Sequence Searching

This BioCompanion copy is a demo version . This section describes the methods available to search a single sequence in the databases by similarity. After the introduction, both heuristic and rigorous searching methods are detailed. Searching strategies are illustrated with programs of the GCG package. Data must be formatted properly - see previous chapters for details.


Tools for Sequence Searching

The procedure to identify a single sequence in a database by similarity is very much different from a keyword search. Keywords are expected to match exactly. This type of keyword searching has been described earlier . Pattern searching programs will match a pattern exactly at single, defined positions in a sequence. The focus in this chapter is on sequence searching programs, which are expected to report as a result of searching a query sequence

The major problem of sequence searching, therefore, is to find a reasonable definition for the similarity of sequences. Programs doing "sequence searching" in general assume that the query sequence comparison with each entry of a sequence database results in a list of database entries which are similar to the query sequence. The receipt of this program type reads as follows:

 
  
 start program   
 initialise top-scoring list  
 for each entry in database                               \  
    {                                                      |  
     compare database sequence with query sequence         |  
     evaluate similarity as "score"                        |search loop   
     compare "score" to top-scoring list                   |across all   
     if "score" better than lowest entry                   |entries in   
        {                                                  |the database  
         place entry in top-scoring list                   |  
        }                                                  |  
    }                                                     /  
 normalise top-scoring list  
 for each entry in top-scoring list                       \ evaluation  
    {                                                      |of result   
     determine statistics                                  |using all  
     align search sequence and database sequence           |entries in   
     print out result                                      |top-scoring  
    }                                                     / list   
  
Depending on the algorithm or implementation, some of the steps might be missing or integrated in other steps. Sophisticated sequence searching might be based on two approaches as described below.

Sensitive searching, using special computers

The use of so-called rigorous searching algorithms such as the Smith and Waterman type of search is already known from the 'bestfit' program , doing an accurate comparison. These programs execute the central searching loop in a specifically tuned fashion, and run best on special hardware such as the MasPar , Fast DNA Finder and Bioccelerator systems.

Extremely fast searching, using approximations

The use of heuristic searching algorithms which reduce the compute time needed in the central searching loop is the most widely used option. Approximations in the comparison computation are based on the assumption that sequence similarity can be expressed as tightly accumulated regions of identical oligomers or short peptides. Either computed on-the-fly ( 'fasta' type of program) or in a pre-processing step ( 'blast' program type), the heuristic searching programs will run on smaller, general-purpose machines and even PCs might occasionally be used. If used on large compute servers, the heuristic search programs achieve impressive performance. However, keep in mind that high-throughput sequencing makes it necessary to run thousands of searches in a competitive cost/performance set-up, which suggests to avoid the use of PC's for sequence searching in these projects.


JAMF source file: search.jam
Next file in HTML: 'Sequence Searching with Heuristic Methods'

[next page] , or [overview] , or [table of contents]