[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]
Re: Reverse-engineering the T. rex genome
> > When you sequence a genome, at the most basic level,
> > you find is the nucleotide sequence, but you don't
> even know
> > which of those nucleotides constitute a "gene" - after
> > for any given length of DNA you have to examine 6
> > possibilities for open reading frames.
> > Genes can be on either strand, and since codons are
> > of three nucleotides, there are three possibilities
> > reading frames per strand.
> > And its very hard to tell what codes for aa's, what
> > produces rna transcripts that aren't translated (but
> > serve some function, and could be considered a gene),
> > nucleotides are promoters, which ones are structural,
> > binding sites, which ones are merely "spacers", etc.
> Computers can do all this (and more) standing on their
> heads. Moreover, they can annotate genomes in a
> high-throughput and fairly accurate fashion. Sure,
> they get the start- and end-points of a gene wrong
> sometimes, even when it's a protein-coding gene. For
> example, if there are two methionine codons (ATG) in close
> proximity near the beginning of a gene, the software may
> choose the wrong one as the start codon. And some
> organisms have stop codons that are not stop codons at all,
> but encode for 'weird' amino acids (i.e., outside the
> standard 20, like pyrrolysine), which causes the software to
> terminate the gene prematurely. But by and large,
> computers can annotate genomes (and metagenomes) at a fairly
> rapid rate.
True, but it still needs manual proofreading. If you manage to code an
application that speeds up the process, something like a smarter version of
Apollo (a nice compact Java app that you can put on a USB stick and use on the
fly http://apollo.berkeleybop.org/current/index.html) you might not have a top
seller but the eternal gratitude of geneticists wordwide.
A huge problem are exons/introns. IIRC, the smallest exons in vertebrates are
smaller than their flanking splice-signals (either 3 AA or even 1 AA). Coupled
with multiple splice variants (the only figure I can remember is about 150
splice variants of one gene in _C. elegans_, but that was some years ago), this
is really an obstacle that only human proofreading can tackle. Good
proofreading software will at least fairly confidently figure out all the
possible splice sites. But then you have protein splicing, and there it becomes
If you only need the genomic sequence, things become A LOT easier. There, the
biggest problem is faulty sequencing. This can be sped up by sequencing several
close relatives at the same time, because erroneous sequencer reads will stand
out in an alignment. Of course, multiple close relatives might not be
available; then parallel runs of the same sequence of the same organism
(multiple specimens if you have the time and money) will do it.