Tprofilesearch

Go back to top

TPROFILESEARCH


FUNCTION

TProfileSearch uses a profile (representing a group of aligned protein sequences) as a probe to search the nucleotide database for new sequences with possible protein products having some similarity to the group. The profile is created with the program ProfileMake.


DESCRIPTION

There is an essay on profile analysis in the Multiple Sequence Analysis chapter of the Program Manual.

Using the method of Gribskov, et al (In Methods in Enzymology, 183; 146-159 (1990)), TProfileSearch accepts a profile from ProfileMake and uses it to search a database (or any set of sequences you specify) for sequences that are similar to the aligned probe sequences used to create the profile. The algorithm calculates the score (quality) of the optimal alignment between the profile and each sequence in the database and creates a list of all of the sequences in the database with an alignment score above some threshold. The results of TProfileSearch are corrected for compositional effects of the sequence and for systematic effects of the sequence length on the score. The output list can be displayed as optimal alignments with TProfileSegments.

The gap and gap length weights specified for TProfileSearch are maximum values. The actual position-specific gap penalties at any position are determined by multiplying the gap creation penalty by the percent value in the second to the last column of the profile, and the gap extension penalty by the percent value in the last column of the profile.

TProfileSearch does a lot of computing so it will probably want to run it in the batch queue (see the CONSIDERATIONS topic below).


AUTHOR

This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session using TProfileSearch to search through the entire DNA sequence database with a profile generated from a group of aligned helix turn helix protein sequences.

  
  
  % tprofilesearch
  
    TPROFILESEARCH with what query profile ?  hth.prf
  
    "hth.prf" is a profile of length: 738
  
    Search for query in what sequences(s) (* SwissProt:* *) ? GenEMBL:ec*
  
    What is the gap weight (* 4.5 *) ?
  
    What is the gap length weight (* 0.05 *) ?
  
    What should I call the output file (* hth.pfs *) ?
  
    Normalization not requested
  
            Sequences: 3012
  
   CPU time (seconds): 644.15 sec.
          Output file: hth.pfs
  
  %
  


OUTPUT

Here is some of the output file:

  
  
  (Peptide) TPROFILESEARCH of: hth.prf;1 Length: 39 to: GenEMBL:Ec*
  
  	 Scores are not corrected for composition effects
  
  	 Both DNA strand directions used
  
  	 Six frame translation used
  
       Gap Weight: 4.50
Gap Length Weight: 0.05
  	 Sequences Examined: 3012
  	 CPU time (seconds): 644
  *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
  Profile information:
  (Peptide) PROFILEMAKE v4.40 of: egendocdata:hth.msf{*}  Length: 39
    Sequences: 9  MaxScore: 17.37  September 6, 1993  11:39
                       Gap: 1.00              Len: 1.00
                  GapRatio: 0.33         LenRatio: 0.10
        hth.msf{Galr_Ecoli}  From: 1         To: 39        Weight: 1.00
        hth.msf{Gals_Ecoli}  From: 1         To: 39        Weight: 1.00 . . .
  *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
       Sequence  Strd ZScore   Orig Length ! Documentation  ..
  EMPRO:ECGALS5         +    0.00  20.72   1989 ! X62529 E.coli galS ...
  EMNEW:ECGALS5         +    0.00  20.72   1989 ! X62529 E.coli galS ...
  EMPRO:ECGALLYS        +    0.00  20.64   4293 ! J01614 E.coli galR ...
  
  //////////////////////////////////////////////////////////////////
  
  EMPRO:ECRMS2AA        -    0.00   4.82     60 ! M63619 E.coli C ge ...
  EMPRO:ECHLETX         -    0.00   4.69     66 ! M17101 E.coli heat ...
  EMPRO:EC5ERNAB        +    0.00   4.66     54 ! M16641 E.coli 5'-t ...
  
  

The output file from TProfileSearch is an ordered list of the sequences with the highest alignment scores when compared to the profile. Unless the normalization procedure is disabled or fails, the file is ordered according to the Z scores (see the NORMALIZATION OF SCORES topic below).

The documentation section of the file (before the sequence listing) is divided by rows of asterisks into three parts. The first section describes the conditions used in running TProfileSearch the number of sequences examined, and the amount of CPU time used. The second section reports the documentary information from the profile. The third section of the documentation records the process of the normalization (see the NORMALIZATION OF SCORES topic below).


RELATED PROGRAMS

PileUp creates a multiple sequence alignment from a group of related sequences. LineUp is a multiple sequence editor used to create multiple sequence alignments. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between a sequence and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.


RESTRICTIONS

We have little experience using nucleotide sequences with profile analysis.

The sequence databases have grown considerably since the GCG versions of the profile programs were released. In particular, SwissProt is now larger than the maximum size. The limits for TProfileSearch are increased by comparison with those in GCG release 7.2, so the program can also be preferred for searches of protein profiles against protein databases.

A second enhancement in TProfileSearch is the use of a minimum score to allow many more sequences to be searched. If a search stops before the last sequence in the database is reached, set a higher minimum score and repeat search, perhaps just specifying the remaining database divisions.

Because of memory constraints, searches are currently limited to a maximum of 80,000 scoring sequences. Nucleic acid searches are limited to a maximum of 80,000 scoring sequences (forward or reverse strands). The default minimum score is 0.0, therefore you cannot search the entire nucleic acid sequence database at one time without first setting a minimum search score!. If you attempt to search more than 40,000 scoring nucleotide sequences, TProfileSearch reports the results of first 40,000 sequences scoring above the minimum, as well as the name of the last sequence searched. Because of memory constraints, only the first 350,000 sequence symbols in each sequence are searched (currently the maximum sequence length in GCG databases anyway).


ALGORITHM

There is an essay on profile analysis in the Multiple Sequence Analysis chapter of the Program Manual.


NORMALIZATION OF SCORES

The scores for comparison of sequences and a profile are systematically correlated with the lengths of the sequences. TProfileSearch corrects the observed alignment scores for these systematic affects of sequence length to give normalized scores.

The relationship between the length of a sequence (SeqLen) and the observed score for comparison of the profile and sequence (Score) is usually very close to

  
  Score = C * (1 - e(A*SeqLen+B))
  

where A, B, and C are some constants. These constants can be empirically determined by fitting the alignment scores of the profile and the database sequences to the above equation. The scores for each sequence are sorted by the sequence length and pooled together in groups of 50 sequences (20 sequences per pool if there are less than 1,500 total scores to normalize). For each pool, the mean length, score, and the standard deviation of the scores are determined.

The best fit of these values to the above equation is determined by non-linear fitting by the Levenberg-Marquardt method. The fit is weighted by the standard deviations of the pools; pools that contain values that represent true similarity between the sequence and profile have large standard deviations and, therefore, make relatively small contributions to the fit.

The assumption of the normalization is that the alignment scores being normalized represent sequences unrelated to the profile. Because there are usually some sequences that are related to the profile, a second pass of normalization is made. In the second pass, the average score of each length pool is compared to the calculated score for a sequence of the same length. Pools whose average scores differ from their expected scores by more than 5.0 times the average standard deviation of all pools are eliminated from consideration. This effectively eliminates most pools that contain many sequences related to the profile, since these pools have average scores with large deviations from the predicted scores. The curve fitting procedure is now repeated and new values for A, B, and C calculated. Each normalized score (Score(N)) is then calculated

  
  Score(N) = Score(orig) / (C * (1 - e(A*SeqLen+B)))
  

The average (Score(N)(avg)) and standard deviation (SD(N)) of all normalized scores are then calculated. Normalized scores whose original scores differ from the calculated score for a sequence of the same length by more than 5.0 times the average standard deviation of all length pools are omitted from this calculation. This ensures that sequences similar to the profile will not affect the calculation of the mean and standard deviation.

The Z scores are the differences in standard deviation units between each normalized comparison score and the mean normalized comparison score for sequences unrelated to the profile. Therefore, a Z score of 5.0 means that a comparison score is significant at the 5.0 sigma level. The Z score is calculated as

  
  ZScore = (Score(N) - Score(N)(avg)) / SD(N)
  

The Z score has a mean of 0.0 and a standard deviation of 1.0.

The third section of the documentation in the TProfileSearch output file records the process of the normalization described above. First, the number of length pools used in the normalization, and the number of pools rejected because of high standard deviation are reported. The empirically derived equation of the curve is then presented, along with the correlation coefficient for the agreement between the calculated curve and the observed results. This value is typically about 0.95. If it is much lower (e.g., less than 0.90), the reported Z scores are not accurate. Finally, the equation used for the calculation of Z scores is reported, along with the number of scores omitted from the calculation of the normalized mean and standard deviation.

Optionally, a file recording the observed and calculated scores for the length pools used in the normalization procedure may be produced with the command line qualifier -FITfile. For each pool used in the calculation, the file contains the average and standard deviation of the lengths and observed scores, and the calculated score for a sequence of the average length of the pool.

Failure of Normalization

The normalization procedure described above appears to be robust, but it must be remembered that it is based on an empirically derived description of the distribution of scores. In particular, this normalization may not be appropriate for very long profiles with large numbers of rows with low gap penalty weights. If the first pass of the curve fitting procedure is unsuccessful, the Z scores are all reported as zero. If the first pass is successful, but the second pass fails, Z scores are calculated and the entries in the output file are sorted based on the Z scores; however, a warning message is produced in the documentation of the output. If fewer than 400 entries are saved in the output, normalization will not be attempted and all Z scores will be reported as zero.


CONSIDERATIONS

TProfileSearch requires the computer to make a calculation that is proportional to the product of the profile length times the length of the database. You should probably run TProfileSearch in the batch queue. (See the Batch Queue section of the User's Guide for more information.) You can Fetch the shell script tprofilesearch.csh that runs the example in the session above as a point of departure. (Remember to use % chmod to change the file's permission to be executable.)

The example search with a profile of length 39 tested against 3012 entries in the DNA sequence database used several minutes of CPU time (depending on the processing speed of the CPU). Searches of the nucleotide sequence database with profiles prepared from nucleotide or protein sequence alignments take a longer time than searches of the protein databases because of the larger size of the database, and the necessity to search both forward and reverse strands of each database sequence.

ProfileSearch- ProfileSegments finds only the best fit of the profile to any sequence. This obscures other fits in the same sequence, especially in nucleic acids.


SUGGESTIONS

Because TProfileSearch can search a maximum of 80,000 scoring nucleotide sequences at one time (see the RESTRICTIONS topic above), GenEMBL must be divided into smaller parts in order to search the entire nucleic acid sequence database unless a minimum score is set. A search of just a few sequences will give a good guide to a suitable score to exclude a large number of false hits without significant risk.

An alternative use of the nucleotide search is to compare a nucleotide profile to a new nucleotide entries only, using the database name defined at your site. (See the Specifying Sequences section of the User's Guide for help in naming GenEMBL subsections in a file of sequence names.)


INPUT FILE

Look at the entry for ProfileMake for a description and an example of what profile files look like.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % tprofilesearch [-INfile1=]pileup.prf -Default
  
  Prompted parameters:
  
  [-INfile2=]SW:*          the search set
  -GAPweight=4.50          maximum position-specific gap weight
  -LENgthweight=0.05       maximum position-specific gap length weight
  [-OUTfile=]pileup.pfs    output file name
  
  Local Data Files: None
  
  Optional Parameters:
  
  -LIStsize=100           sets a limit to the size of the output list
                          (60,000 is the default)
  -MINList=2.5            lowest Z score to report in output list.
  -MINSeq=50              minimum length sequence to examine in the search
  -MINSCore=0.0           minimum score (or average score) to include
  -[NO]SIXframe           six-frame translation of a DNA database
  -[NO]BOTHstrands        search both strands in a DNA database
  -NORmalize              turns on normalization of comparison scores
  -AVErage                adjusts quality scores for sequence composition
  -NOSEArch               normalizes a pre-existing file of search scores
  -FITfile[=pileup.fit]   output file with curve fitting information for
                          normalized scores
  -CPUlimit=1000          limits the search to 1,000 seconds (default
                          is 86,400)
  -SINce=6.90             limits search to sequences dated on or after
                          June 1990 (does not work for PIR databases)
  -NOMONitor              suppresses the screen trace for the search set
                          sequences
  


LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate= mycode.txt. Translation tables are discussed in more detail in the Data Files manual.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-MINSCore=0.0

defines a minimum score for consideration. Setting -MINSCore allows searches of far larger databases, but upsets the assumptions of normalization and averaging because the sample scores are no longer complete.

-SIXframe

specifies a six frame translation of a nucleotide sequence. -NOSIXframe limits the search to the three forward frames only.

-BOTHstrands

specifies a search of both strands of nucleotide sequence database entries. -NOBOTHstrands specifies the forward strand only.

-LIStsize=100

sets a limit to the number of entries in the output list. The default is the maximum number of sequences that can be searched (currently 80,000 protein or 40,000 nucleotide).

-MINlist=2.5

sets the lowest Z score for an entry to be reported in the output list. If Z scores aren't calculated, then all entries (up to the maximum number determined by the -LIStsize qualifier) are reported. The default reports an entry if its score is 2.5 standard deviations above the expected score for a sequence of the same length.

-MINSeq=50

sets the minimum length for a sequence to be searched.

-NORmalize

turns on the normalization of alignment scores. The listed sequences are sorted by their original comparison scores, rather than the scores adjusted for systematic effects of sequence length. In the output file, all Z scores are 0. Users at EMBL prefer to turn off normalization for all searches, and the normalization is not suitable when the -MINSCore option is used.

-AVErage

turns on the adjustment of scores for sequence composition. With & -AVErage) , a score due to the similarity in composition between the profile and sequence of interest is subtracted from the original alignment score. Users at EMBL prefer to run searches without averaging.

-NOSEArch

turns off the database search and normalizes a previously existing file of TProfileSearch scores, instead. The normalization occurs only if the existing file contains over 400 saved entries. This allows runs with the default options (no normalization) to be normalized later.

-FITfile=pretty.fit

writes an output file containing the observed and calculated scores for the length pools used in the normalization procedure.

-CPUlimit=1000

sets a limit in seconds beyond which the search is aborted. The limit defaults to 86,400 seconds (24 hours). ProfileSearch reports the results for the sequences searched before the time limit is exceeded.

-SINce=6.90

limits the search to sequences that have been entered into the database or modified since June 1990. As this is being written, only the EMBL, GenBank, and SWISS-PROT databases support this feature.

-MONitor=100

monitors this program's progress on your screen. Use this option to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it with -NOMONitor.

The monitor is updated every time the program processes 100 sequences or files. You can use the optional parameter to set this monitoring interval to some other number.

-TRANSlate=filename.txt

Usually, translation is based on the translation table in a default or local data file called translate.txt. This option allows you to use a translation table in a different file. (See the Data Files manual for information about translation tables.)


REFERENCES

Gribskov, M., Luethy, R. and Eisenberg, D. (1990) "Profile Analysis. " Methods in Enzymology 183, 146-159.

Printed: April 22, 1996 15:56 (1162)