Tsegments

Go back to top

TSEGMENTS


FUNCTION

TSegments aligns and displays the segments of similarity found by TWordSearch.


DESCRIPTION

TWordSearch uses word comparison, which is very fast, to identify regions of possible similarity between a query sequence and some set of sequences. TSegments uses optimal alignment, which is slow but precise, to display the best segment of similarity in the regions identified by TWordSearch. TWordSearch uses a method similar to the method of Wilbur and Lipman (Proc. Natl. Acad. Sci.(USA) 80; 726-730 (1983)) to find the regions of possible similarity. TSegments uses the alignment procedure of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to search for the segments.

TSegments uses a symbol comparison table, a gap weight, and a gap length weight to find the best region of similarity between two sequences. The best region has the highest quality where quality is the sum of the matches minus the sum of the mismatches minus the sum of the gap weights for the gaps added. The best region must fall within some "width" around the peak diagonal.


AUTHOR

This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session using TSegments to display the segments in the output file from the example in the EGCG Program Manual for TWordSearch:

  
  
  % tsegments
  
   TSEGMENTS of what file ?  laci_ecoli.word
  
   What should I call the output file (* laci_ecoli.pairs *) ?
  
   Aligning ......................-...
   EmPro:Eclaci   1113 bp  Gaps:  0  Quality: 537.2 / Length: 360
   Aligning ..................-..
   EmPro:Eclac    7476 bp  Gaps:  0  Quality: 537.2 / Length: 360
  
   //////////////////////////////////////////////////////////////
  
  %
  


OUTPUT

Here is part of the output file:

  
  
   (BestFit) TSEGMENTS from: laci_ecoli.word  September 7, 1993  13:06
  
   (Peptide) WORDSEARCH of: Sw:Laci_Ecoli  check: 1939  from: 1  to: 360
  ID   LACI_ECOLI     STANDARD;      PRT;   360 AA.
  AC   P03023;
  DT   21-JUL-1986 (REL. 01, CREATED)
  DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
  DT   01-AUG-1991 (REL. 19, LAST ANNOTATION UPDATE) . . .
  
   AvMatch: 0.54  AvMisMatch: -0.40  GapWeight: 2.00  LengthWeight: 0.10   ..
  
  Laci_Ecoli                check: 1939  from: 1      to: 360
  Empro:Eclaci              check: 7788  from: 10     to: 1113
  V00294 E. coli laci gene (codes for the lac repressor). 11/90
   Gaps: 0  Quality: 537.2  Ratio: 1.492  Score: 362  Width: 3  Limits: +/-4
               .         .         .         .         .
    1 MKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN 50
      :|||||||||||||||||||||||||||||||||||||||||||||||||
   11 VKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN 60
  
         ///////////////////////////////////////////////
  
  Laci_Ecoli                check: 1939  from: 1      to: 360
  Empro:Eclac               check: 4781  from: 26     to: 7476
  J01636 E.coli lac operon with lacI, lacZ, lacY and lacA. 3/92
   Gaps: 0  Quality: 537.2  Ratio: 1.492  Score: 362  Width: 3  Limits: +/-4
               .         .         .         .         .
    1 MKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN 50
      :|||||||||||||||||||||||||||||||||||||||||||||||||
   27 VKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN 76
  
         ///////////////////////////////////////////////
  
  Laci_Ecoli                check: 1939  from: 1      to: 360
  Empro:Eclact41            check: 446   from: 1      to: 1080
  X58469 E.coli T41 mutant lac repressor gene. 7/91
   Gaps: 0  Quality: 536.0  Ratio: 1.489  Score: 361  Width: 3  Limits: +/-4
               .         .         .         .         .
    1 MKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN 50
      :|||||||||||||||||||||||||||||||||||||||||||||||||
    1 VKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN 50
  
         ///////////////////////////////////////////////
  


RELATED PROGRAMS

TSegments is an automated version of the BestFit program run with the command line option -LIMit, with the limits set to +/-(width+1). The output file of TWordSearch is the input file for TSegments Compare/ DotPlot and BestFit are more flexible tools for examining the relationship between two sequences when automation is not desired.

FastA does a Pearson and Lipman search for similarity between a query sequence and any group of sequences. For nucleotide database searches, FastA is more sensitive than BLAST. TFastA does a Pearson and Lipman search for similarity between a query peptide sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied peptide sequences in a nucleotide sequence database are similar to my peptide sequence?"


RESTRICTIONS

The diagonal of comparison cannot be longer than 30,000 and the surface of comparison may not be larger than one million. TSegments truncates sequences more than 30,000-symbols long and squeezes the gap shift limits to keep the surface within the one-million limit.


ALGORITHM

TSegments reads the query sequence and the set of sequences and diagonals in the output list from TWordSearch and then executes a limited BestFit on each pair of sequences to make an alignment near that diagonal. For a detailed description, see BestFit ( -LIMit), and imagine that the gap shift limits are both set to width + 1. Width is defined as the width of a structure in the histogram from a word comparison (see the TWordSearch program). Width is the fifth column of data in the TWordSearch output file.


CONSIDERATIONS

There is strong reason to believe that the BestFit algorithm used by TSegments is the best way to search for segments of similarity (Lipman and Pearson, Rapid and Sensitive Protein Similarity Searches, Science 227; 1435-1441 (1985)), but the best parameters to use for TSegments are not yet clear. Like any alignment program, TSegments produces alignments that are very different depending on the values assigned for match, mismatch, gap weight, and gap length weight.

The Public Symbol Comparison Table is Quite Stringent

The public symbol comparison table segdna.cmp scores matches as +1.0 and mismatches as -0.6, which means that the segment shown is cut off if there is any significant region where mismatches outnumber matches by about a 2:1 ratio. If the words scored by TWordSearch were dispersed along the diagonal, then some of them may not appear in the alignment for that diagonal.

The Alignments Miss Some Words

TSegments often fails to display every word scored for the peak diagonal if the words were not tightly grouped along the diagonal. You can use the command line option -WHOle to get Needleman-Wunsch alignments that traverse the entire length of the diagonal. If you run Compare with the option -WORd and plot the output with DotPlot, you see the exact pattern of word identities between two sequences.


INPUT FILES

TSegments reads the file names in the output file from TWordSearch. If any of the search set sequences have been changed or deleted, TSegments acts as if they do not exist. If the query sequence no longer exists, TSegments complains and stops. TSegments also reads the beginning and ending positions of the query sequence in the output file from TWordSearch. If TSegments cannot read this range, the entire query sequence is used.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % tsegments [-INfile=]Laci_Ecoli.Word -Default
  
  Prompted Parameters:
  
  [-OUTfile=]laci_ecoli.pairs  output file
  
  Local Data Files:
  
  -DATa=segdna.cmp         symbol comparison table for nucleic acids
  -DATa=segpep.cmp         symbol comparison table for peptide sequences
  -TRANSlate=translate.txt contains the genetic code
  
  Optional Parameters:
  
  -GAPweight=3.0     gap weight (default depends on word size)
  -LENgthweight=0.1  gap length weight
  
  -PAIr=1.0,0.5,0.1       thresholds for displaying '|', ':', and '.'
  -WIDth=50               the number of sequence symbols per line
  -PAGe=60                adds a line with a form feed every 60 lines
  -NOBIGGaps              suppresses abbreviation of large gaps with '.'s
  -MATch=+1.0        symbol match value for simplified word searches
  -MISmatch=-0.25    mismatch value (defaults to -2.0/size of Alphabet)
  -WHOle             aligns the whole diagonal, not just the best segment
  -NOMONitor         suppresses the screen monitor
  


LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

TSegments reads symbol comparison values from the file segdna.cmp (nucleic acids) or segpep.cmp (peptides). If the TWordSearch sequences were simplified, TSegments would use the same simplification table used by TWordSearch to construct a symbol comparison table.

TSegments run with the command line option -WHOle uses the files seggapdna.cmp and seggappep.cmp for symbol comparison instead of segdna.cmp and segpep.cmp.

The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate= mycode.txt. Translation tables are discussed in more detail in the Data Files manual.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-GAPweight=3.0

lets you select a gap weight if you don't want the default value, which is the greater of 2.0 and word size/2.0. (See BestFit for a description of gap weights.)

-LENgthweight=0.1

lets you select a gap length weight if you don't want the default value of 0.1. (See BestFit for a description of gap length weights.)

-WHOle

causes this program to make alignments using the method of Needleman and Wunsch instead of the default method of Smith and Waterman. The difference between these two methods is the same as the difference between the programs Gap and BestFit. The Needleman and Wunsch method displays the whole length of both sequences after alignment, while the Smith and Waterman method shows only the best segment of similarity from each sequence.

The -WHOle option causes TSegments to read the local data file seggapdna.cmp (nucleic acids) or seggappep.cmp (peptides).

-PAIr=1.0,0.5,0.1

The paired output file from this program displays sequence similarity by printing one of three characters between similar sequence symbols: a pipe character(|), a colon (:), or a period (.). Normally a pipe character is put between symbols that are the same, a colon is put between symbols whose comparison value is greater than or equal to 0.50, and a period is put between symbols whose comparison value is greater than or equal to 0.10. You can change these match display thresholds from the command line. The three parameters for -PAIr are the display thresholds for the pipe character, colon, and period. The match display criterion for a pipe character changes from symbolic identity (the default) to the quantitative threshold you have set in the first parameter. A pipe character will no longer be inserted between identical symbols unless their comparison values are greater than or equal to this threshold. If you still want a pipe character to connect identical symbols, use x instead of a number as the first parameter. (See the Data Files manual for more information about scoring matrices.)

-PAGe=64

When you print the output from this program, it may cross from one page to another in a frustrating way -- especially when you print on individual sheets. This option adds form feeds to the output file in order to try to keep clusters of related information together. You can set the number of lines per page by supplying a number after the -PAGe qualifier.

-WIDth=50

puts 50 sequence symbols on each line of the output file. You can set the width to anything from 10 to 150 symbols.

-NOBIGGaps

suppresses large gap abbreviations, showing all the sequence characters across from large gaps. Usually, gaps that extend one sequence by more than one complete line of output are abbreviated with three dots arranged in a vertical line.

-MATch=1.0

If you have done a simplified word search, TSegments must make up a scoring table that looks like your simplification scheme. The table is normally made up of 1s at all the symbol comparisons you treated as equivalent and -2.0/Alphabet size for all other symbol comparisons. The -MATch and -MISmatch parameters allow you to set values other than 1.0 for matches and -2.0/Alphabet size for mismatch.

-MISmatch=-0.50

See the -MATch parameter for a description of -MISmatch.

-MONitor

This program normally monitors its progress on your screen. However, when you use the -Default option to suppress all program interaction, you also suppress the monitor. You can turn it back on with this option. If your program is running in batch, the monitor will appear in the log file. If the monitor is slowing the program down, suppress it with -NOMONitor.

-TRANSlate=filename.txt

Usually, translation is based on the translation table in a default or local data file called translate.txt. This option allows you to use a translation table in a different file. (See the Data Files manual for information about translation tables.)


REFERENCES

Lipman, D.J. and Pearson, W.R. (1985). Rapid and Sensitive Protein Similarity Searches. Science 227, 1435-1441.

Printed: April 22, 1996 15:56 (1162)