Tech Creates Self-Training Gene Prediction Program

*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************

Contact

Georgia Tech Media Relations
Laura Diamond
laura.diamond@comm.gatech.edu
404-894-6016
Jason Maderer
maderer@gatech.edu
404-660-2926

Sidebar Content
No sidebar content submitted.
Summaries

Summary Sentence:

First self-training program for eukaryotes

Full Summary:

Researchers have developed the first ever computer program able to train itself to predict genes in genomic DNA sequences of eukaryotic organisms. The program may help researchers save a year or more off genome sequencing and interpretation projects.

Media

Researchers at the Georgia Institute of Technology have developed the first ever computer program capable of training itself to predict genes in genomic DNA sequences of eukaryotic organisms such as animals, plants and fungi. The software program, GeneMark.hmm-ES, may help researchers save a year or more in a genome sequencing and interpretation project. The program is a new addition to the family of GeneMark gene prediction programs developed at Georgia Tech and is freely available to academic researchers.

Currently, there are more than 600 ongoing genome sequencing projects of eukaryotes that carry nuclei within cells. Decoding the DNA sequences that come out from even a single genome project is an enormous task. Still, unraveling the genetic code of living creatures allows scientists to understand the details of the cellular machinery. This knowledge helps generate ideas for a variety of future research directions. Understanding the specific features of individual genomes may lead to the development of personalized medicine, while comparing the genomes from related species can help scientists trace their evolution.

"The genomic sequence is a foundation and blueprint of molecular cellular networks and processes which dynamics need to be reconstructed to understand how the cell works. These networks are specific for each organism, so once you know the list of the genes, you start to assemble all the parts into a picture," said Mark Borodovsky, Regents' professor in the School of Biology and the Department of Biomedical Engineering, and director of the Center for Bioinformatics and Computational Genomics at Georgia Tech.

Borodovsky developed the first version of GeneMark in 1993. In 1995, this program was used by Craig Venter and his Institute for Genomic Research to find genes in the first ever completely sequenced genomes of the organisms representing the two prokaryotic domains of life, bacteria and archea.

A self-training version of the genefinding program for prokaryotic genomes was created by Borodovsky's group in 2001. Since 1998, it has been frequently used for gene finding in eukaryotes, particularly in plant genomes such as rice. By now, use of the GeneMark programs by the researchers around the globe was registered for discoveries of more than 400,000 genes in various genomes, from viruses and bacteria to rice and humans.

Now Borodovsky and his team at Georgia Tech have taken a leap forward and built a program that can train itself to make accurate gene prediction in the numerous newly sequenced genomes of eukaryotes. The program uses established general principles of genetic code organization - adjusted to the general compositional features of a particular genome - to help identify at least a few regions of the anonymous genome that contain protein coding sequences. Once they have the initial predictions, they separate the coding and non-coding sequences. This clusterization allows scientists to apply machine-learning techniques to refine the parameters of the recognition algorithm to the specific patterns found in the newly identified protein-coding sequences. A researcher then repeats this prediction and training step, each time detecting a larger set of true coding sequences that are used to further improve the model employed in statistical pattern recognition. The last run, when no innovation is reached at the prediction step, produces the desirable final set of predicted genes.

Because the self-training method uses established general principles of eukaryotic gene organization to reconstruct the species specific nucleotide sequence patterns, it speeds things up, since scientists don't have to wait for an outside expert to develop a sequence large enough to use as a training set. That can shave a year or more off a sequencing project. With the self-training method, the program does the work itself.

Details on the new program can be found in number 20 of Nucleic
Acids Research (volume 33) on pages 6494-6506.

Related Links

Additional Information

Groups

News Room

Categories
No categories were selected.
Related Core Research Areas
No core research areas were selected.
Newsroom Topics
No newsroom topics were selected.
Keywords
No keywords were submitted.
Status
  • Created By: David Terraso
  • Workflow Status: Published
  • Created On: Jun 15, 2006 - 8:00pm
  • Last Updated: Oct 7, 2016 - 11:00pm