Tprofilegap

Go back to top

TPROFILEGAP


FUNCTION

TProfileGap makes an optimal alignment between a profile and a sequence.


DESCRIPTION

There is an essay on profile analysis in the Multiple Sequence Analysis chapter of the Program Manual.

TProfileGap uses the method of Gribskov, et al (Proc. Natl. Acad. Sci. USA 84;4355-4358 (1987)) to make an optimal alignment between a profile and a sequence. TProfileGap works like BestFit but accepts a profile instead of one of the sequences. TProfileGap uses the alignment procedure of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to search for and align the segment of similarity. The symbol comparison values are present in the profile itself and need not be set. The gap and gap length weights specified in TProfileGap are maximum values. The actual position-specific gap penalties at any position are determined by multiplying the gap creation penalty by the percent value in the second to the last column of the profile, and the gap extension penalty by the percent value in the last column of the profile.


AUTHOR

This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session using TProfileGap to align a 75 kd membrane peptide sequence from Chlamydia with a profile generated from 75 kd heat shock and heat shock cognate peptide sequences:

  
  
  % tprofilegap
  
   TPROFILEGAP uses any sequences
  
   TPROFILEGAP of what sequence(s) ?  GenEMBL:eclaci
  
               Start (* 1 *) ?
              End (*  360 *) ?
  
   and what profile (* eclaci.prf *) ? hth.prf
  
   What is the gap weight (* 4.5 *) ?
  
   What is the gap length weight  (* 0.05 *) ?
  
   What should I call the paired output display file (* hth.pair *) ?
  
     The following levels will be marked in the alignments:
                Bar: 0.32
              Colon: 0.18
                Dot: 0.09
  
   Aligning .-..
   Empro:Eclaci
  
       Gaps:     0
    Quality:  15.7
   Quality Ratio: 0.505
     Length:    31
  
  %
  


OUTPUT

Here is part of the output file:

  
  
   (Local) TPROFILEGAP of: Eclaci  check: 7788  from: 1  to: 1113
  
  ID   ECLACI     standard; DNA; PRO; 1113 BP.
  AC   V00294;
  DT   09-JUN-1982 (Rel. 01, Created)
  DT   30-NOV-1990 (Rel. 26, Last updated, Version 1)
  DE   E. coli laci gene (codes for the lac repressor).
  KW   DNA binding protein; repressor. . . .
  
   to: Hth.Prf  check: 1753  from: 1  to: 39
  
  (Peptide) PROFILEMAKE v4.40 of: egendocdata:hth.msf{*}  Length: 39
    Sequences: 9  MaxScore: 17.37  September 6, 1993  11:39
                  Gap: 1.00              Len: 1.00
             GapRatio: 0.33         LenRatio: 0.10
  hth.msf{Galr_Ecoli}  From: 1    To: 39   Weight: 1.00
  hth.msf{Gals_Ecoli}  From: 1    To: 39   Weight: 1.00 . . .
  
      Gap Weight:  4.500      Average Match:  0.185
   Length Weight:  0.050   Average Mismatch: -0.127
  
         Quality:  15.66             Length:     31
           Ratio:   0.51               Gaps:      0
  
   Eclaci x Hth.Prf          September 6, 1993  13:45  ..
               .         .         .
  S     12 KPVTLYDVAEYAGVSYQTVSRVVNQASHVSA 42
      . :|| |||  |||| .|||||||:.. |:.
  P      8 KMATLKDVARMAGVSVATVSWVLNGSPWVSE 38
  
  


RELATED PROGRAMS

PileUp creates a multiple sequence alignment from a group of related sequences. LineUp is a multiple sequence editor used to create multiple sequence alignments. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between a sequence and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.


RESTRICTIONS

We have little experience using nucleotide sequences with profile analysis.

The surface of comparison (see the entry for BestFit in the Program Manual) may not be more than some value set within the program (5.5 million at most institutions). Profiles may not be longer than 1,000 residues or bases. Sequences that are too long for the surface of comparison are divided into smaller segments that are aligned separately (see the CONSIDERATIONS topic, below).


ALGORITHM

There is an essay on profile analysis in the Multiple Sequence Analysis chapter of the Program Manual.

TProfileGap executes a BestFit of the profile to the sequence to make an alignment. The alignment is made with the values in the profile. The alignment is displayed with the consensus sequence from the profile aligned to the sequence.

For a detailed description of Smith and Waterman style alignments, see the entry for BestFit in the Program Manual.


CONSIDERATIONS

There is strong reason to believe that the BestFit algorithm used by TProfileGap is the best known way to find segments of similarity, but the best parameters must be empirically determined. Like any alignment program, TProfileGap produces alignments that are very different depending on the symbol comparison values and gap coefficients used to make up the profile, and the gap weights used as input to TProfileGap

Unless the -LIMit command line qualifier is used, sequences that are too long for the surface of comparison are always divided into smaller, overlapping segments that are aligned separately. The -LIMit qualifier may permit long sequences to be aligned without division. For a detailed description of the -LIMit qualifier, see the entry for BestFit in the Program Manual. Sequences longer than 32,000 are always divided and aligned as separate segments. Although ProfileGap and ProfileSegments overlap the points of division by the whole length of the profile, divided sequences may not align properly if the segment of similarity crosses the point where the sequence was divided.

The command line option -GLObal makes ProfileGap and ProfileSegments display the alignment of the whole sequence to the whole profile, instead of just the most-similar segment between the sequence and the profile. This is analogous to executing a Gap between the profile and sequence.

If multiple sequences are specified as input to TProfileGap the command line qualifiers -BEGin and -END are ignored. In this case, the entire length of each input sequence is used.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % tprofilegap [-INfile1=]SW:75kd_Chltr -
                   [-INfile2=]pileup.prf -Default
  
  Prompted Parameters:
  
  -BEGin=1 -END=652        range of interest for sequence
  -GAPweight=4.50          maximum position-specific gap weight
  -LENgthweight=0.05       maximum position-specific gap length weight
  [-OUTfile=]pileup.pair   output file for the alignment
  
  Local Data Files: None
  
  Optional Parameters:
  
  -GLObal                  aligns the whole sequence and profile (global
                        alignment)
  -LOCal                   aligns the best segment of similarity between
                        the sequence and profile (local alignment is
                        the default)
  -NOAVErage               does not adjust alignment score for sequence
                        composition
  -ENDWeight               weights end gaps like other gaps
  -LIMit1=737              lets you set a gap shift limit for the sequence
  -LIMit2=651              lets you set a gap shift limit for the profile
  -OUTfile2=75kd_chltr.gap new file for sequence 1 with gaps added
  -OUTfile3=pileup.gap     new file for the profile consensus with
                        gaps added
  -PAIr=1.0,0.5,0.1        thresholds for displaying '|', ':', and '.'
  -NOMONitor               suppresses the screen summary for each alignment
  


LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate=mycode.txt. Translation tables are discussed in more detail in the Data Files manual.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-GLObal

causes this program to make alignments using the method of Needleman and Wunsch instead of the default method of Smith and Waterman. The difference between these two methods is the same as the difference between the programs Gap and BestFit. The Needleman and Wunsch method displays the whole length of both sequences after alignment, while the Smith and Waterman method shows only the best segment of similarity from each sequence.

-LOCal

forces this program to make alignments using the default method of Smith and Waterman instead of the method of Needleman and Wunsch. The difference between these two methods is the same as the difference between the programs BestFit and Gap. The Smith and Waterman method shows only the best segment of similarity from each sequence, while the Needleman and Wunsch method displays the whole length of both sequences after alignment.

-NOAVErage

turns off the adjustment of scores for sequence composition. In the default ( -AVErage), a score due to the similarity in composition between the profile and sequence of interest is subtracted from the original alignment score.

-ENDWeight

causes the end gaps to be penalized in the same way as all other gaps. This qualifier is ignored unless -GLObal is also present on the command line.

-LIMit1=20 and -LIMit2=20

lets you set gap shift limits for each sequence ( -LIMit1 sets a gap shift limit for the sequence and -LIMit2 sets a gap shift limit for the profile). When you already know of a long similarity between two sequences you can "zip" them together using this mode. The beginning coordinates for each sequence must be near the beginning of the alignment you want to see. The alignment continues so that gaps inserted do not require the sequences to get out of step by more than the gap shift limits. You can align very long sequences rapidly. The surface of comparison is still limited to one million. The size of a comparison can be predicted by multiplying the average length of the two sequences times the sum of the two shift limits.

If you add -LIMit to the command line without any qualifier value, the program prompts you to enter gap shift limits for each sequence.

-OUTfile2=seqname1.gap -OUTfile3=profilename.gap

This program can write three different output files. The first displays the alignment of the sequence with the profile consensus sequence. The second is a new sequence file for the sequence, possibly expanded by gaps to make it align with the profile. The third, like the second, is a new sequence file for the profile consensus, possibly expanded by gaps to make it align with sequence one. The program writes only the first file unless there are output file options on the command line. If there are any output files named on the command line, only those output files are written. If you add -OUT to the command line without any qualifying filename, then the program will write the second and third output files after prompting you for their names.

Aligned sequences (in sequence files) can be displayed with GapShow.

-PAIr=1.0,0.5,0.1

The paired output file from this program displays sequence similarity by putting a pipe character (|), colon (:), and period (.) between similar sequence symbols. The thresholds for the characters are determined by the values in the profile. The pipe character is put between symbols whose comparison value in the profile is at least the average positive value in the profile plus one tenth the difference between the maximum and average values in the profile. The colon character threshold is the average positive value in the profile. The period character threshold is the larger of the average positive value in the profile minus one tenth the difference between the maximum and average values, and one half the average value.

-NOMONitor

suppresses the screen summary for each alignment which reports some statistics for the alignment.


REFERENCES

Gribskov, M., McLachlan, M., and Eisenberg, D. (1987) "Profile Analysis: Detection of Distantly Related Proteins." Proceedings of the National Academy of Sciences USA 84, 4355-4358.

Smith, T.F. and Waterman, M.S. (1981) "Comparison of Bio-Sequences." Advances in Applied Mathematics 2, 482-489.

Printed: April 22, 1996 15:56 (1162)