Home - Knowledge
Center - Bioinformatics
THE CHALLENGE OF PROTEIN MODELING
are a myriad of steps following the location of a gene locus to the realization
of a three-dimensional model of the protein that it encodes.
STEP 1: Location of Transcription Start/Stop
A proper analysis to locate a genetic locus will usually have already
pinpointed at least the approximate sites of the transcriptional start and
Such an analysis is usually sufficient in determining protein structure. It is
the start and end codons for translation that must be determined with accuracy.
STEP 2: Location of Translation Start/Stop
The first codon in a messenger RNA sequence is almost always AUG. While this
reduces the number of candidate codons, the reading frame of the sequence must
also be taken into consideration.
There are six reading frames possible for a given DNA sequence, three on each
strand, which must be considered, unless further information is available.
Since genes are usually transcribed away from their promoters, the definitive
location of this element can reduce the number of possible frames to three.
There is not a strong consensus between different species surrounding
translation start codons. Hence location of the appropriate start codon will
include a frame in which they are not apparent abrupt stop codons.
Knowledge of a proteins predicted molecular mass can assist this analysis.
Incorrect reading frames usually predict relatively short peptide sequences.
Therefore, it might seem deceptively simple to ascertain the correct frame. In
bacteria, such is frequently the case. However, eukaryotes add a new obstacle
to this process: INTRONS!
STEP 3: Detection of Intron/Exon Splice Sites
In eukaryotes, the reading frame is discontinuous at the level of the DNA
because of the presence of introns. Unless one is working with a cDNA sequence
in analysis, these introns must be spliced out and the exons joined to give the
sequence that actually codes for the protein.
Intron/exon splice sites can be predicted on the basis of their common
features. Most introns begin with the nucleotides GT and end with the
There is a branch sequence near the downstream end of each intron involved in
the splicing event. There is a moderate consensus around this branch site.
STEP 4: Prediction of 3-D Structure
With the completed primary amino acid sequence in hand, the challenge of
modeling the three-dimensional structure of the protein awaits. This process
uses a wide range of data and CPU-intensive computer analysis.
Most often, one is only able to obtain a rough model of the protein, and
several conformations of the protein may exist that are equally probable.
best analyses will utilize data from all the following sources:
Alignment to known homologues whose conformation is more secure.
X-ray Diffraction Data:
Most ideal when some data is available on the protein of interest. However,
diffraction data from homologous proteins is also very valuable.
Physical Forces/Energy States:
Biophysical data and analyses of an amino acid sequence can be used to predict
how it will fold in space.
of this information is used to determine the most probable locations of the
atoms of the protein in space and bond angles. Graphical programs can then use
this data to depict a three-dimensional model of the protein on the
two-dimensional computer screen.