This BioCompanion copy is a demo version .
This section describes the methods available to search a single sequence in the databases by
similarity. After the introduction, both heuristic and rigorous searching methods are detailed.
Searching strategies are illustrated with programs of the GCG package. Data must be formatted
properly - see previous chapters for details.
The procedure to identify a single sequence in a database by similarity is very much different
from a keyword search. Keywords are expected to match exactly. This type of
keyword searching has been described earlier
. Pattern searching programs will match a pattern exactly at single, defined positions in a
sequence. The focus in this chapter is on sequence searching programs, which
are expected to report as a result of searching a query sequence The major problem of sequence searching, therefore, is to find a reasonable definition for
the similarity of sequences. Programs doing "sequence searching" in general assume that the query
sequence comparison with each entry of a sequence database results in a list of database entries
which are similar to the query sequence. The receipt of this program type reads
as follows: The use of so-called rigorous searching algorithms such as the Smith and
Waterman type of search is already known from the 'bestfit'
program , doing an accurate comparison. These programs execute the central searching loop
in a specifically tuned fashion, and run best on special hardware such as the MasPar
, Fast DNA Finder and Bioccelerator systems.
The use of heuristic searching algorithms which reduce the compute time needed
in the central searching loop is the most widely used option. Approximations in the comparison
computation are based on the assumption that sequence similarity can be expressed as tightly
accumulated regions of identical oligomers or short peptides. Either computed on-the-fly ( 'fasta'
type of program) or in a pre-processing step ( 'blast' program type),
the heuristic searching programs will run on smaller, general-purpose machines and even PCs might
occasionally be used. If used on large compute servers, the heuristic search programs achieve
impressive performance. However, keep in mind that high-throughput sequencing makes it necessary
to run thousands of searches in a competitive cost/performance set-up, which suggests to avoid
the use of PC's for sequence searching in these projects.
[next page] , or [overview] , or [table of contents] Tools for Sequence Searching
start program
initialise top-scoring list
for each entry in database \
{ |
compare database sequence with query sequence |
evaluate similarity as "score" |search loop
compare "score" to top-scoring list |across all
if "score" better than lowest entry |entries in
{ |the database
place entry in top-scoring list |
} |
} /
normalise top-scoring list
for each entry in top-scoring list \ evaluation
{ |of result
determine statistics |using all
align search sequence and database sequence |entries in
print out result |top-scoring
} / list
Depending on the algorithm or implementation, some of the steps might be missing or integrated
in other steps. Sophisticated sequence searching might be based on two approaches as
described below.
Sensitive searching, using special computers
Extremely fast searching, using approximations
JAMF source file: search.jam
Next file in HTML:
'Sequence Searching with Heuristic Methods'