Home - Knowledge
Center - Bioinformatics
THE CREATION OF SEQUENCE DATABASES
Most biological databases consist of long strings of nucleotides (guanine,
adenine, thymine, cytosine and uracil) and/or amino acids (threonine, serine,
glycine, etc.). Each sequence of nucleotides or amino acids represents a
particular gene or protein (or section thereof), respectively. Sequences are
represented in shorthand, using single letter designations. This decreases the
space necessary to store information and increases processing speed for
analysis.
While most biological databases contain nucleotide and protein sequence
information, there are also databases, which include taxonomic information such
as the structural and biochemical characteristics of organisms. The power and
ease of using sequence information has however, made it the method of choice in
modern analysis.
advertisement
In the last three decades, contributions from the fields of biology and
chemistry have facilitated an increase in the speed of sequencing genes and
proteins. The advent of cloning technology allowed foreign DNA sequences to be
easily introduced into bacteria. In this way, rapid mass production of
particular DNA sequences, a necessary prelude to sequence determination, became
possible. Oligonucleotide synthesis provided researchers with the ability to
construct short fragments of DNA with sequences of their own choosing. These
oligonucleotides could then be used in
probing vast libraries of DNA to extract genes containing that
sequence. Alternatively, these DNA fragments could also be used in
polymerase chain reactions to amplify existing DNA sequences or to
modify these sequences. With these techniques in place, progress in biological
research increased exponentially.
For researchers to benefit from all this information, however, two additional
things were required:
- Ready access to the collected pool of sequence information and
- Ways to extract from this pool only those sequences of
interest to a given researcher. Simply collecting, by hand, all necessary
sequence information of interest to a given project from published journal
articles quickly became a formidable task. After collection, the organization
and analysis of this data still remained. It could take weeks to months for a
researcher to search sequences by hand in order to find related genes or
proteins.
Computer technology has provided the obvious solution to this problem. Not only
can computers be used to store and organize sequence information into
databases, but they can also be used to analyze sequence data rapidly. The
evolution of computing power and storage capacity has, so far, been able to
outpace the increase in sequence information being created. Theoretical
scientists have derived new and sophisticated algorithms, which allow sequences
to be readily, compared using probability theories. These comparisons become
the basis for determining gene function, developing phylogenetic relationships
and simulating protein models.
The physical linking of a vast array of computers in the 1970's provided a few
biologists with ready access to the expanding pool of sequence information.
This web of connections, now known as the Internet, has evolved and expanded so
that nearly everyone has access to this information and the tools necessary to
analyze it.
|
|