FastaCheck selects significant alignments from a (T)Fasta output file.
FastaCheck uses the significance threshold value of Sander and Schneider (1991); Proteins 9:56-68 to select significant hits from a FastA or TFastA output file with a protein search sequence. The measure is based on the percent identity reported by (T)FastA for the original alignment (even if a higher identity/length score could be found).
This program was written by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a sample session with FastaCheck The input file is a search of gamma.pep against SwissProt:L* sequences with a word size of 1.
% fastacheck FASTACHECK of what FASTA output file ? ggamma.fasta What should I call the output file (* ggamma.check *) ? Hits checked: 23 Accepted: 1 Rejected: 22 %
The output from FastaCheck is a file containing only the alignments for the significant scores. This example shows a search with a human globin sequence through the leghemoglobin sequences in SwissProt. At the specified threshold calculation only one of the leghemoglobin sequences shows a significant match, though there are others in the 20-25% range in the input file.
(Peptide) FASTA of: ggamma.pep from: 1 to: 148 March 19, 1996 14:39 ETRANSLATE of: gamma.seq check: 6474 from: 2179 to: 2270 ETRANSLATE of: gamma.seq check: 6474 from: 2393 to: 2615 ETRANSLATE of: gamma.seq check: 6474 from: 3502 to: 3630 ////////////////////////////////////////// (11) sw:lgb2_sesro P14848 leghemoglobin 2. sesbania rostrata.... 55 78 8 9 ID LGB2_SESRO STANDARD; PRT; 147 AA. AC P14848; DT 01-APR-1990 (REL. 14, CREATED) DT 01-APR-1990 (REL. 14, LAST SEQUENCE UPDATE) DT 01-APR-1990 (REL. 14, LAST ANNOTATION UPDATE) DE LEGHEMOGLOBIN 2. . . . SCORES Init1: 55 Initn: 78 Opt: 89 24.80% identity in 101 aa overlap Minimum identity: 24.8% 20 30 40 50 60 70 gamma. WGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAI-MGNPKVKAHGKKVLTSLGD |: |:::::: :||:::||::||:: : | lgb2_s SYEAFKQNLPGNSVLFYSFILEKAPAAKGMFSFLKDSDGVPQNNPSLQAHAEKVFGLVRD 20 30 40 50 60 70 80 90 100 110 120 130 gamma. AIKHLDDLKGTF---AQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASW : :| : : | |:::| :| :|| :| :: ::|:::| | ::: || ::| lgb2_s SAAQLRATGVVVLADASLGSVHVQKGVLDP-HFVVVKEALLKTLKEAAGATWSDEVSNAW 80 90 100 110 120 130 140 gamma. QKMVTGVASALSSRYHX : :|:::|:: lgb2_s EVAYDGLSAAIKKAMS 140
FastaCheck uses the significance threshold value of Sander and Schneider (1991); Proteins 9:56-68 to select significant hits from a FastA or TFastA output file. The measure is based on the percent identity reported by FastA or TFastA for the alignment that was reported (even if a higher score could be found).
The significance is determined by a threshold value for percent identity for each alignment length. The formula used is:
Threshold = (290.15*L)**-0.562
where L is limited to the range 10 to 80 residues. Below 10 residues there is no correlation, and above 80 residues a limit of about 25% identity is used.
The formula was calculated using the default (T)FastA matrix and alignment. The FACTOR (290.15) and EXPONENT (-0.562) values can be changed with command line options (see below).
The input file for FastaCheck is a (T)FastA output file.
None.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum syntax: % fastacheck [-INfile=]ggammacod.fasta -Default Prompted Parameters: [-OUTfile=]ggammacod.check Output file Optional Parameters: -FACTor=290.15 Factor for threshold calculation -EXPonent=-0.562 Exponent for threshold calculation -MINLen=10 Minimum accepted alignment length -MAXLen=80 Maximum length for calculated threshold -MAXPct=25.0 Identity threshold for length above MAXLEN
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
sets the factor used for the percent identity threshold calculation.
sets the exponent used for the percent identity threshold calculation.
sets the minimum accepted alignment length.
sets the maximum accepted alignment length for which the threshold calculation is performed.
sets the threshold percent identity for all alignments longer than the maximum length calculated (see -MAXLEN above).
Sander, C and Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9, 56-68.
Printed: April 22, 1996 15:53 (1162)