Eoverlap

Go back to top

EOVERLAP

EOVERLAP

FUNCTION

EOverlap compares two sets of DNA sequences to each other in both orientations using a WordSearch style comparison. EOverlap is an extended version of GCG's Overlap for use in database nonredundancy checks, together with the FilterOverlap program.

DESCRIPTION

EOverlap accepts two sets of sequences as input and uses the algorithm of Wilbur and Lipman (Proc. Natl. Acad. Sci. USA 80; 726-730 (1983)) to compare each sequence of the first set with each sequence of the second set, in both orientations. Thus, EOverlap runs a WordSearch reiteratively, using the first set of sequences as queries. Unlike GCG's WordSearch, EOverlap looks for overlaps between sequences rather than simply regions of similarity. An overlap is a highly similar region between two sequences that runs the entire length of a register of comparison. EOverlap lists the position, length, and stringency of discovered overlaps in an output file.

AUTHOR

This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).

EXAMPLE

Here is a session using EOverlap

  
  
  % eoverlap
  
   EOVERLAP what query sequences ?  mu*.seq
  
   To what other sequences (* mu*.seq *) ?
  
   What word size (* 7 *) ?
  
   What fraction of the words in an overlap must match (* 0.8 *) ?
  
   Integrate how many adjacent diagonals (* 3 *) ?
  
   What is the minimum overlap length (* 30 *) ?
  
   What should I call the output file (* overlap.dat *) ?
  
Reading ............
   Comparing ............
  
  %

OUTPUT

Here is the output file:

  
  
   OVERLAP of: mu:*
      to: mu:*
  
   Min overlap fraction: 0.80  Min overlap length: 10  Integral width: 3
  
                    December 12, 1995 14:02
  
  Sequence1 Strand Pos Sequence2 Strand Pos Length Matches Ratio Len1 Len2 ..
  
  
  mu10           +   2 mu5           -   1    230     205  0.89  361  230
  
         ////////////////////////////////////////////////////
  
  mu32           +   6 mu9           -   1     35      35  1.00   40   39

In this example, the overlap pairs are divided into three groups, or overlap clusters, separated by blank lines. Each cluster consists of overlapping fragments that could be chained together into a single, continuous assembly.

The output file lists the length, position, and percent similarity (ratio) of each overlap in descending order of sequence and overlap length. It also gives the orientation of each sequence.

RELATED PROGRAMS

GCG's WordSearch uses the same comparison algorithm as EOverlap WordSearch, however, accepts a single query sequence as input and finds regions of similarity rather than overlaps. ELineUp is a screen editor for editing and displaying overlapping sequences.

RESTRICTIONS

The total length of bases in any sequence set may not exceed 350,000. No sequence in the query set may exceed 30,000 bases. The word size must be between 1 and 30. The minimum overlap length must be between 1 and 1,000. The program cannot store more than 10,000 overlaps. If this number is exceeded, the program stops after suggesting that you increase the stringency to reduce the number of overlaps.

ALGORITHM

EOverlap like GCG's WordSearch, identifies sequence similarities using a Wilbur and Lipman-style word comparison (see the WordSearch entry in the Program Manual for information regarding the details of this algorithm and considerations about using this search). EOverlap differs from WordSearch in that it accepts a set of query sequences as input and reports overlaps rather than regions of similarity.

EOverlap removes gap characters ( . ) from the input sequences before comparing them.

CONSIDERATIONS

For considerations in using a word comparison, see the CONSIDERATIONS topic in the WordSearch entry of the Program Manual.

EOverlap recognizes certain regions of similarity to be overlaps based upon the strength of similarity across the entire register of comparison and the length of the register itself. These requirements correspond to the stringency and minimum overlap values that are set in response to EOverlap s prompts. A stringency of .95 means that 95% percent of the bases in a given register of comparison must match for that similarity to be recognized as an overlap. A minimum overlap of 10 means that a given register of comparison must contain at least 10 bases to qualify as an overlap. The figure at the end of this entry illustrates these requirements. Examples four through six of this figure are not overlaps for the following reasons:

4. -- Although the two sequences are highly similar from B to C, the similarity over the length of the entire register, A to D, is not particularly strong. Highly similar segments that are not positioned at the end of each sequence are not reported as overlaps. The exception to this is the third example, in which a short sequence is completely similar to an internal segment of a larger sequence.

5. -- These sequences are not similar enough to contain an overlap. The minimum stringency requirement for overlaps is not met.

6. -- The register of comparison containing the similarity is not long enough; overlaps must be larger than the minimum overlap length.

To make the search for overlaps more tolerant of gaps between sequences, EOverlap combines the scores of a user-defined number of adjacent diagonals, or registers of comparison (see the ALGORITHM topic in the WordSearch entry of the Program Manual). Thus, the reported percent similarity or ratio may be larger than the actual ratio and may even be greater than 100%. Combining the scores of adjacent diagonals in EOverlap may cause the listed overlap position to be a few bases removed from the actual overlap position.

SUGGESTIONS

If you are looking only for weak overlaps, you can use the -UPPERlimit command line option to specify a maximum stringency. Overlaps containing more than this maximum fraction of matching bases are not reported in the output file. For example, if you run % eoverlap -UPPERlimit= 0.7 -STRIngency=0.6, your output only contains overlaps in which 60 to 70 percent of the bases matched.

COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimal Syntax: % eoverlap [-INfile1=]Mu*.Seq -Default
  
  Prompted Parameters:
  
  [-INfile2=]Mu*.Seq        second search set
  -WORdsize=5               length of word for a match
  -STRIngency=.80           minimum fraction of required word matches
  -MINOverlap=10            minimum overlap length
  -INTegrate=3              number of diagonals to integrate
  [-OUTfile=]overlap.dat    output file
  
  Local Data Files:  None
  
  Optional Parameters:
  
  -UPPERlimit=.90           upper limit on stringency