Fastacheck

Go back to top

FASTACHECK


FUNCTION

FastaCheck selects significant alignments from a (T)Fasta output file.


DESCRIPTION

FastaCheck uses the significance threshold value of Sander and Schneider (1991); Proteins 9:56-68 to select significant hits from a FastA or TFastA output file with a protein search sequence. The measure is based on the percent identity reported by (T)FastA for the original alignment (even if a higher identity/length score could be found).


AUTHOR

This program was written by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a sample session with FastaCheck The input file is a search of gamma.pep against SwissProt:L* sequences with a word size of 1.

  
  
  % fastacheck
  
    FASTACHECK of what FASTA output file ?  ggamma.fasta
  
    What should I call the output file (* ggamma.check *) ?
  
   Hits checked: 23
  Accepted: 1
  Rejected: 22
  
  %
  


OUTPUT

The output from FastaCheck is a file containing only the alignments for the significant scores. This example shows a search with a human globin sequence through the leghemoglobin sequences in SwissProt. At the specified threshold calculation only one of the leghemoglobin sequences shows a significant match, though there are others in the 20-25% range in the input file.

  
  
  (Peptide) FASTA of: ggamma.pep  from: 1 to: 148  March 19, 1996 14:39
  
  ETRANSLATE of: gamma.seq check: 6474 from: 2179 to: 2270
  ETRANSLATE of: gamma.seq check: 6474 from: 2393 to: 2615
  ETRANSLATE of: gamma.seq check: 6474 from: 3502 to: 3630
  
     //////////////////////////////////////////
  
  (11) sw:lgb2_sesro  P14848 leghemoglobin 2. sesbania rostrata....  55    78    8
  9
  ID   LGB2_SESRO     STANDARD;      PRT;   147 AA.
  AC   P14848;
  DT   01-APR-1990 (REL. 14, CREATED)
  DT   01-APR-1990 (REL. 14, LAST SEQUENCE UPDATE)
  DT   01-APR-1990 (REL. 14, LAST ANNOTATION UPDATE)
  DE   LEGHEMOGLOBIN 2. . . .
  
  SCORES     Init1: 55 Initn: 78 Opt: 89
        24.80% identity in 101 aa overlap
        Minimum identity: 24.8%
  
       20        30        40        50         60        70
  gamma.
   WGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAI-MGNPKVKAHGKKVLTSLGD
                                  |: |::::::  :||:::||::||:: : |
  lgb2_s SYEAFKQNLPGNSVLFYSFILEKAPAAKGMFSFLKDSDGVPQNNPSLQAHAEKVFGLVRD
          20        30        40        50        60        70
  
        80           90       100       110       120       130
  gamma. AIKHLDDLKGTF---AQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASW
    :  :| :   :    | |:::| :|  :|| :| :: ::|:::|    | ::: || ::|
  lgb2_s SAAQLRATGVVVLADASLGSVHVQKGVLDP-HFVVVKEALLKTLKEAAGATWSDEVSNAW
          80        90       100        110       120       130
  
          140
  gamma. QKMVTGVASALSSRYHX
    :   :|:::|::
  lgb2_s EVAYDGLSAAIKKAMS
          140
  


ALGORITHM

FastaCheck uses the significance threshold value of Sander and Schneider (1991); Proteins 9:56-68 to select significant hits from a FastA or TFastA output file. The measure is based on the percent identity reported by FastA or TFastA for the alignment that was reported (even if a higher score could be found).

The significance is determined by a threshold value for percent identity for each alignment length. The formula used is:

  
  Threshold = (290.15*L)**-0.562
  

where L is limited to the range 10 to 80 residues. Below 10 residues there is no correlation, and above 80 residues a limit of about 25% identity is used.

The formula was calculated using the default (T)FastA matrix and alignment. The FACTOR (290.15) and EXPONENT (-0.562) values can be changed with command line options (see below).


INPUT FILE

The input file for FastaCheck is a (T)FastA output file.


LOCAL DATA FILES

None.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % fastacheck [-INfile=]ggammacod.fasta -Default
  
  Prompted Parameters:
  [-OUTfile=]ggammacod.check      Output file
  
  Optional Parameters:
  
  -FACTor=290.15                  Factor for threshold calculation
  -EXPonent=-0.562                Exponent for threshold calculation
  -MINLen=10                      Minimum accepted alignment length
  -MAXLen=80                      Maximum length for calculated threshold
  -MAXPct=25.0                    Identity threshold for length above MAXLEN
  


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-FACTor=290.15

sets the factor used for the percent identity threshold calculation.

-EXPonent=-0.562

sets the exponent used for the percent identity threshold calculation.

-MINLen=10

sets the minimum accepted alignment length.

-MAXLen=80

sets the maximum accepted alignment length for which the threshold calculation is performed.

-MAXPct=25.0

sets the threshold percent identity for all alignments longer than the maximum length calculated (see -MAXLEN above).


REFERENCES

Sander, C and Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9, 56-68.

Printed: April 22, 1996 15:53 (1162)