Erepeat

Go back to top

EREPEAT


FUNCTION

ERepeat finds direct repeats in sequences. You must set the size, stringency, and range within which the repeat must occur; all the repeats of that size or greater are displayed as short alignments. ERepeat is a version of GCG's old Repeat with command line control.


DESCRIPTION

ERepeat lets you choose a minimum repeat window and stringency and a search range and then finds all the repeats of at least that size and stringency within the search range chosen. The repeats are sorted by position and displayed in an output file as alignments of those parts of the sequence that make up the repeats. ERepeat tells you the number of repeats found for your settings of window and stringency before filing the results. If you feel there are too many repeats, you may reset the parameters before writing the repeats out to a file. You can limit the number of repeats shown, or sort the repeats by quality so that the longest repeats come at the top of the list. See the ALGORITHM topic below to understand precisely what ERepeat does.


AUTHOR

This GCG program was modified by Jaakko Hattula (Tampere University of Technology, Finland) and Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session using ERepeat to find all the direct repeats in the first 1,000 bases of gamma.seq that are 10 bases or longer and that occur within 100 bases of each other and that have at least 9 out of 10 matched bases:

  
  
  % erepeat
  
    EREPEAT uses any sequence data
  
    EREPEAT of what sequence ?  gamma.seq
  
                 Start (* 1 *) ?
               End (* 11375 *) ?  1000
  
    What minimum repeat window (* 7 *) ?  10
  
    What minimum stringency (* 10.0 *) ?  9
  
    Find repeats through what range (* 50 *) ?  100
  
    There are 11 repeats, would you like to
  
     1) File the repeats
     2) Set new parameters
  
    Please choose one (* 1 *):
  
    What should I call the output file (* gamma.rpt *)
  
  %
  


OUTPUT

Each repeat is shown as an alignment of two repeated sequences along with the beginning and ending coordinates of each sequence. The size and quality of each repeat is shown to the right of the alignment. The quality is the sum of the symbol comparison values in the repeat. Here is some of the output file for the example above:

  
  
   EREPEAT of:   check: 6474  from: 1  to: 1000
  
  Human fetal beta globins G and A gamma
  from Shen, Slightom and Smithies,  Cell 26; 191-203.
  Analyzed by Smithies et al. Cell 26; 345-353.
  
   Window: 10  Stringency: 0.0  Range: 100  Repeats: 11  June 12, 1995 16:23  ..
  
   79 TGTAATCCCA 88
      || |||||||     10 9.0
  158 TGAAATCCCA 167
  
  158 TGAAATCCCATCT 170
      || ||||||| ||     13 11.0
  213 TGTAATCCCAGCT 225
  
  395 ACCAGTCTCT 404
      ||||| ||||     10 9.0
  444 ACCAGACTCT 453
  
   /////////////////////////////
  
  937 AAAAAACAAAA 947
      |||||| ||||     11 10.0
  965 AAAAAATAAAA 975
  
  965 AAAAAATAAAAA 976
      |||||||||| |     12 11.0
  985 AAAAAATAAAGA 996
  
  981 AAAGAAAAA 989
      |||||||||      9 9.0
  992 AAAGAAAAA 1000
  


RELATED PROGRAMS

Using Compare/ DotPlot to create a dot-plot comparison of a sequence to itself is functionally equivalent to running ERepeat The dot-plot is a much more graphic way to show where the repeats occur and what the background of random repeats looks like.


RESTRICTIONS

ERepeat cannot find more than 1,000 repeats.


ALGORITHM

For window/stringency comparisons, ERepeat reads a scoring matrix that defines a match value for every possible GCG symbol comparison. (See Chapter 4, Using Data Files in the User's Guide for more information.) ERepeat then slides the sequence along itself in order to generate every register of comparison (diagonal) for the search range you have set. For each diagonal, ERepeat slides a window along the pair of sequences. The match values for each pair of symbols within the window are summed to determine a score at each position. When the score under the window is greater than or equal to the set stringency, then the match criterion has been met and the repeat is recorded.

Repeat Nibbling

Before the repeats are presented, they are nibbled from both ends so that the symbol pair on each end has a scoring matrix value of at least 0.5. You can reset this minimum match threshold with the -PAIr command line option. Thus, repeats less than the minimum repeat length may be shown.


CONSIDERATIONS

ERepeat can show several repeats that are part of the same structure if there is a simple sequence with a repeat period shorter than the minimum repeat length. We are considering adding a filter to ERepeat to remove these redundant repeats.


SEQUENCE TYPE

The function of ERepeat depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum Syntax: % erepeat [-INfile=]gamma.seq -Default
  
  Prompted Parameters:
  
  -BEGin=1 -END=100           Range of interest
  [-OUTfile=]gamma.rpt        Output file
  -WINdow1=1                  Minimum repeat window
  -STRingency1=1              Minimum stringency
  -RANge1=1                   Repeat range
  -MENu1=1                    Menu response (1 = write to file)
  
  
  Local Data Files:
  
  -DATa=repeatdna.cmp   scoring matrix for nucleic acids
  -DATa=repeatpep.cmp   scoring matrix for peptides
  
  Optional Parameters:
  
  -LIMit    limits the number of repeats written into the output file
  -SORt     sorts the repeats on quality
  -PAIr=0.5 match threshold for displaying '|'
  


LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

ERepeat uses the scoring matrix found in either repeatdna.cmp or repeatpep.cmp to find the match values when determining the stringency for any position of the window. You should recognize that stringency is really the sum of the match values (defined in this file) for the symbols compared under the window. The public version of repeatdna.cmp (for nucleic acid comparisons) scores a 1.0 for all IUB nucleic acid ambiguity symbol comparisons where there is ANY overlap between the sets defined by the symbols (see Appendix III) . No symbols match the symbols X or N, however. The public version of repeatpep.cmp has 1.5 for perfect symbol matches and values less than 1.5 (depending upon the evolutionary distance) for non-matches. You can use the Fetch program to copy and modify these files to suit your own needs.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide.

-SORt

sorts the repeats by quality score instead of position so that the longest repeats (those with the highest quality scores) are at the top of the output.

-LIMit

limits the output report to the largest repeats. This option automatically causes the repeats to be sorted by quality score instead of position. If you use this option, the program asks you to specify how many repeats you want to see.

-PAIr=0.5

The output from this program has a '|' (vertical bar) between sequence symbols that match. This match display character is added to the output whenever the symbol comparison value for the two symbols in your scoring matrix is greater than or equal to 0.50. If your scoring matrix has a lot of values above 0.5, this match display threshold is too low -- many of the symbols will appear to match. The -PAIr parameter lets you specify a match display threshold appropriate for the scoring matrix you are using.

The repeat nibbling, referred to in the ALGORITHM topic above, uses the threshold value set by this command line option to decide what repeats should be nibbled away from the structure. You can set a pairing threshold high enough that all repeats are nibbled away!

Printed: April 22, 1996 15:53 (1162)