[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

Re: Reverse-engineering the T. rex genome

> > When you sequence a genome, at the most basic level,
> all
> > you find is the nucleotide sequence, but you don't
> even know
> > which of those nucleotides constitute a "gene" - after
> all
> > for any given length of DNA you have to examine 6
> different
> > possibilities for open reading frames.
> > Genes can be on either strand, and since codons are
> pairs
> > of three nucleotides, there are three possibilities
> for
> > reading frames per strand.
> > And its very hard to tell what codes for aa's, what
> > produces rna transcripts that aren't translated (but
> still
> > serve some function, and could be considered a gene),
> what
> > nucleotides are promoters, which ones are structural,
> > binding sites, which ones are merely "spacers", etc.
> Computers can do all this (and more) standing on their
> heads.  Moreover, they can annotate genomes in a
> high-throughput and fairly accurate fashion.  Sure,
> they get the start- and end-points of a gene wrong
> sometimes, even when it's a protein-coding gene.  For
> example, if there are two methionine codons (ATG) in close
> proximity near the beginning of a gene, the software may
> choose the wrong one as the start codon.  And some
> organisms have stop codons that are not stop codons at all,
> but encode for 'weird' amino acids (i.e., outside the
> standard 20, like pyrrolysine), which causes the software to
> terminate the gene prematurely.  But by and large,
> computers can annotate genomes (and metagenomes) at a fairly
> rapid rate.

True, but it still needs manual proofreading. If you manage to code an 
application that speeds up the process, something like a smarter version of 
Apollo (a nice compact Java app that you can put on a USB stick and use on the 
fly http://apollo.berkeleybop.org/current/index.html) you might not have a top 
seller but the eternal gratitude of geneticists wordwide.

A huge problem are exons/introns. IIRC, the smallest exons in vertebrates are 
smaller than their flanking splice-signals (either 3 AA or even 1 AA). Coupled 
with multiple splice variants (the only figure I can remember is about 150 
splice variants of one gene in _C. elegans_, but that was some years ago), this 
is really an obstacle that only human proofreading can tackle. Good 
proofreading software will at least fairly confidently figure out all the 
possible splice sites. But then you have protein splicing, and there it becomes 

If you only need the genomic sequence, things become A LOT easier. There, the 
biggest problem is faulty sequencing. This can be sped up by sequencing several 
close relatives at the same time, because erroneous sequencer reads will stand 
out in an alignment. Of course, multiple close relatives might not be 
available; then parallel runs of the same sequence of the same organism 
(multiple specimens if you have the time and money) will do it.