Polydot

Go back to top

POLYDOT(+)


FUNCTION

PolyDot compares two sets of sequences, draws a dotplot for each pair of sequences, and reports all identical matches of a specified length.


DESCRIPTION

PolyDot compares a set of sequences against itself, or against a second set.

The sequences can be nucleotide or protein, aligned or unaligned. The original applications are to compare members of a protein family an to compare the major contigs of a fragment assembly project.

PolyDot uses a word (ktup) comparison method, so that only exact matches of a given minimum length are reported. This makes it more appropriate for closely related sequences.

The output from PolyDot includes all the dotplots on a single page and to the same scale, a list of all identical regions of at least the selected length, and an input file for Segments to report the complete alignment around each detected match.


AUTHOR

This program was written by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session with PolyDot

  
  
  % polydot
  
   POLYDOT uses any sequences
  
   POLYDOT of what sequence(s) ?  @eclac.fil
  
              Reverse (* No *) ?
  
       EGenRunData:eclac.seq  len: 7477  wgt: 1.00
      EGenRunData:eclaca.seq  len: 1832  wgt: 1.00
      EGenRunData:eclaci.seq  len: 1113  wgt: 1.00
      EGenRunData:eclacy.seq  len: 1500  wgt: 1.00
      EGenRunData:eclacz.seq  len: 3078  wgt: 1.00
  
   What word size (* 15 *) ?
  
   What should I call the output file (* eclac.poly *) ?
  
   PostScript instructions for a LASERWRITER are now being sent to gcgplot.ps.
  
  %
  


OUTPUT

The output from a session with PolyDot is a graphics output file, a list of hits (file extension ".poly"). The command line option -WORDList also generates a set of input files for the GCG program Segments with the file extension ".word".

Part of the output from the example is shown below:

  
  
  
  EGenRunData:eclac.seq (7477) vs. EGenRunData:eclac.seq (7477)
        1    7477                  1    7477           7477
  
  EGenRunData:eclac.seq (7477) vs. EGenRunData:eclaca.seq (1832)
     5646    7477                  1    1832           1832
  
  EGenRunData:eclac.seq (7477) vs. EGenRunData:eclaci.seq (1113)
       49    1161                  1    1113           1113
  
  EGenRunData:eclac.seq (7477) vs. EGenRunData:eclacy.seq (1500)
     4305    5804                  1    1500           1500
  
  EGenRunData:eclac.seq (7477) vs. EGenRunData:eclacz.seq (3078)
     1287    4364                  1    3078           3078
  
  
          //////////////////////////////////
  
  

This is the plot from the example session


CONSIDERATIONS

PolyDot can only find a match if there is a minimum number of bases or residues matching perfectly between two sequences. This limits its use to closely related sequences, but allows it to run extremely quickly. Comparisons of all contigs in a cosmid sequencing project (about 40,000 bases) can take only a few seconds to run with a suitable word size.

PolyDot can run with a very small word size to detect weaker similarities, but this drastically increases the number of lines plotted and the run time required.


SUGGESTIONS

A word value of 15 is reasonable for finding direct (tandem) repeats in DNA sequences, or for inverted repeats if one set of sequences is reversed.

The -REVerse option is always worth using in a second run. Remember when comparing contigs for a fragment assembly project that the orientation of the contigs is usually unknown and that you will need a comparison in each direction.

When comparing contigs in a fragment assembly project, the qualifier -MINLENgth is very useful for excluding short fragments which are too small to see on a cosmid scale plot.


INPUT FILE

The input files for PolyDot are nucleotide or protein sequence files.


GRAPHICS

The Wisconsin Package must be configured for graphics before you run any program with graphics output! If the % setplot command is available in your installation, this is the easiest way to establish your graphics configuration, but you can also use commands like % postscript that correspond to the graphics languages the Wisconsin Package supports. See Chapter 5, Using Graphics in the User's Guide for more information about configuring your process for graphics.


CTRL-C

If you need to stop this program, use C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use C. The graphics device should stop plotting the current page and start plotting the next page. If the current page is the last page, plotters should put the pen away and graphic terminals should return to interactive mode.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % polydot [-INfile=]genembl:eclac* -Default
  
  Prompted Parameters:
  
  -REVerse                  Reverses the first set of DNA sequences
  -WORDSize=15              Comparison word size (minimum match)
  -OUTfile=seqname.poly     Output file name
  
  Local Data Files: None
  
  Optional Parameters:
  
  -MINLENgth=1500           Ignores short sequences
  -WORDList                 Write "wordsearch" files for each sequence
  -OUTfile2=seqname.word    Names of "wordsearch" files
  -POINTLIMIT=200000        Maximum number of lines plotted
  -NOPLOT                   Do not produce the dotplot
  -NOHITlist                Do not produce the list of identity hits
  -INTerval=2               Gap between dotplots in platen units
  -MAXGAPs=3                Gaps allowed in "wordsearch" matches
  -MINSCore=15              Minimum score for "wordsearch" matches
  
  
  Most EGCG graphics programs accept these and other switches. See the Using
  Graphics chapter of the EGCG USERS GUIDE for descriptions.
  
  -DENSity=150.0        plot density in bases per 100 platen units
  -LEFTMARgin=10.0      sets the left plot margin position
  -RIGHTMARgin=140.0    sets the right plot margin position
  -BOTTOMMARgin=10.0    sets the bottom plot margin position
  -TOPMARgin=90.0       sets the top plot margin position
  -BORDer               puts a line border around the plot
  -NOBORDer             suppresses a line border
  -PAGENUMber           forces page numbering
  -NOPAGENUMber         suppresses page numbering
  -TITletext="text"     overrides the default plot title
  -NOTITletext          suppresses the plot title
  -SUBTITletext="text"  overrides the default plot subtitle
  -NOSUBTITletext       suppresses the plot subtitle
  -CHEIGHT=1.5          default plot character height
  -LINESTyle1=1         plot line style 1 (set for each line)
  -LINEPERiod1=1        plot line period 1 (set for each line)
  -LINECOLor1=0         plot line colour 1 (set for each line)
  All GCG graphics programs accept these and other switches. See the Using
  Graphics chapter of the USERS GUIDE for descriptions.
  
  -FIGure[=FileName]  stores plot in a file for later input to FIGURE
  -FONT=3             draws all text on the plot using font 3
  -COLor=1            draws entire plot with pen in stall 1
  -SCAle=1.2          enlarges the plot by 20 percent (zoom in)
  -XPAN=10.0          moves plot to the right 10 platen units (pan right)
  -YPAN=10.0          moves plot up 10 platen units (pan up)
  -PORtrait           rotates plot 90 degrees
  


LOCAL DATA FILES

None.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-REVerse

The second set of sequences can be reversed. This is very useful when comparing a set of sequences to itself.

-OUTfile=seqname.poly

sets the name of the main output file.

-WINdow=15

sets the window size. A value of 15 is reasonable for DNA sequencing projects, but a lower value (for example 7) may be useful for more sensitive but slower comparisons.

-MINLENgth=0

sets a minimum length for the plot. All sequences shorter than this length (small single fragments in an assembly project for example) will be ignored.

-WORDList

tells PolyDot to write "wordsearch" files for each sequence that has a match above a minimum length. This length can be more than the word size used for the dot plot. See the option -MINSCore below for more detail.

-OUTfile2=seqname.word

sets the name of each "wordsearch" output file. The default name uses the sequence name and the extension ".word" which should be good enough for most purposes.

-POINTLIMIT=200000

When a large number of lines is found on a plot (usually with a small word size, or a very large number of simple repeats) the program can take a very long time to run. This option reduces the maximum number of lines on any one dotplot to a smaller size, possibly allowing other plots to show significant matches. If the plot is truncated, the bottom left corner (near the start of each sequence) is drawn.

This limit is to the number of lines drawn on the plot. The lines can contain any number of points (matching residues).

-NOPLOT

allows the program to produce the list of hits (and possibly the "wordsearch" output files) without drawing the actual dot plots.

-NOHITlist

allows the programs to produce the single page of dot plots without writing the list of all the matches (the ".poly" file). This can be very useful when testing small word sizes to avoid creating a large output file.

-INTerval=1

sets the gap between individual dot plots in GCG's "platen units". The page is "100 platen units high" for a GCG plot.

-MAXGAPs=3

sets the number of diagonals to be counted together when producing the "wordsearch" output files (with the -WORDList option on the command line).

This can be set to the number of gaps expected in aligning two contigs or used to reduce the effect of simple repeats.

-MINSCore=15

sets the minimum score used when producing the "wordsearch" output files (with the -WORDList option on the command line).

The minimum score defaults to the word size, so any diagonal will be reported. A higher score forces "wordsearch" hits to have more than the minimum word match.

These options apply to all GCG graphics programs. These and many others are described in detail in Chapter 5, Using Graphics of the User's Guide.

-FIGure=programname.figure

writes the plot as a text file of plotting instructions suitable for input to the Figure program instead of drawing the plot on your plotter.

-FONT=3

draws all text characters on the plot using Font 3 (see Appendix I) .

-COLor=1

draws the entire plot with the pen in stall 1.

These options let you expand or reduce the plot (zoom), move it in either direction (pan), or rotate it 90 degrees (rotate).

-SCAle=1.2

expands the plot by 20 percent by resetting the scaling factor (normally 1.0) to 1.2 (zoom in). You can expand the axes independently with -XSCAle and -YSCAle. Numbers less than 1.0 contract the plot (zoom out).

-XPAN=30.0

moves the plot to the right by 30 platen units (pan right).

-YPAN=30.0

moves the plot up by 30 platen units (pan up).

-PORtrait

rotates the plot 90 degrees. Usually, plots are displayed with the horizontal axis longer than the vertical (landscape). Note that plots are reduced or enlarged, depending on the platen size, to fill the page.

Printed: April 22, 1996 15:55 (1162)