Eplotsimilarity

Go back to top

EPLOTSIMILARITY(+)


FUNCTION

EPlotSimilarity plots the running average of the similarity among the sequences in a multiple sequence alignment.


DESCRIPTION

EPlotSimilarity calculates the average similarity among all members of a group of aligned sequences at each position in the alignment, using a user-specified sliding window of comparison. The window of comparison is moved along all sequences, one position at a time, and the average similarity over the entire window is plotted at the middle position of the window. The average similarity across the entire alignment is plotted as a dotted line.

If you give EPlotSimilarity a single input sequence, you can choose the range and strand for that sequence, and then EPlotSimilarity prompts you for the name, range, and strand of a second input sequence. In this way, you can plot the average similarity between the two aligned sequences created with % gap -OUT.


AUTHOR

This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session using EPlotSimilarity to display the similarity among the group of aligned 70 kd heat shock and heat shock cognate peptide sequences in the file hsp70.msf:

  
  
  % eplotsimilarity
  
   EPLOTSIMILARITY uses any sequences
  
   EPLOTSIMILARITY of what sequence(s) ? hsp70.msf{*}
  
               Start (* 1 *) ?
             End (*   720 *) ?
  
   hsp70.msf{hs70_plafa}
   hsp70.msf{hs70_thean}
  
   /////////////////////
  
   hsp70.msf{dnak_ecoli}
  
   What window to average (* 10 *) ?
  
   The minimum density for this plot is  626.1 residues/100 platen units.
   What density do you want (* 626.1 *) ?
  
    When your LaserWriter attached to tty07 is ready, press .
  
  %
  


OUTPUT

This is the plot from the example session


RELATED PROGRAMS

PileUp creates a multiple sequence alignment of a group of related sequences. Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps. BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap) .

GapShow displays an alignment of two sequences by making a graph that show the distribution of similarities and gaps.


RESTRICTIONS

The lengths of all sequences being compared must be the same.


ALGORITHM

The average similarity at a position in an alignment is the arithmetic average of the scores of all possible pairwise symbol comparisons among the sequence symbols at that position. The comparison value between any two sequence symbols can be found in the scoring matrix (see the LOCAL DATA FILES topic below). The average similarity across the entire alignment (plotted as a dotted line) is the sum of the separate window similarities divided by the number of windows.

If -IDEntity is on the command line, the program plots a measure of the level of identity among all sequences in the multiple sequence alignment. The calculations are done exactly as described above, but all identical symbol comparisons are given a value of 1.0; all other comparisons are given a value of 0.0.

If -PROFile is on the command line, the program plots a running average of the positional conservation in a profile. The measure of conservation at any position is the difference between the greatest and least values at that position in the profile.


CONSIDERATIONS

EPlotSimilarity does not create the multiple sequence alignment. You can create the alignment using PileUp, Gap, or BestFit (see the INPUT FILE topic below).


SUGGESTIONS

You can plot a measure of identity between all sequences in the alignment using the -IDEntity command line option.

You can plot a measure of the level of conservation in a profile created from a multiple sequence alignment using the -PROFile command line option. This plot provides similar information to a plot of the similarity among the sequences in the multiple sequence alignment.


GRAPHICS

The Wisconsin Package must be configured for graphics before you run any program with graphics output! If the % setplot command is available in your installation, this is the easiest way to establish your graphics configuration, but you can also use commands like % postscript that correspond to the graphics languages the Wisconsin Package supports. See Chapter 5, Using Graphics in the User's Guide for more information about configuring your process for graphics.


CTRL-C

If you need to stop this program, use C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use C. The graphics device should stop plotting the current page and start plotting the next page. If the current page is the last page, plotters should put the pen away and graphic terminals should return to interactive mode.


INPUT FILE

The input to EPlotSimilarity is a group of two or more aligned sequences. The multiple sequence alignment created by the PileUp program can be used as input to EPlotSimilarity The gapped output files from the Gap and BestFit programs, which were created using the -OUTfile2 and -OUTfile3 command line qualifiers, can also be used as input to EPlotSimilarity If the first sequence entered into EPlotSimilarity is a single sequence, the program prompts you for the second sequence.

EPlotSimilarity tries to read the name of the scoring matrix from the text heading of the input file. If it can't read the matrix name, it uses the default scoring matrix (see the LOCAL DATA FILES topic below).


SEQUENCE TYPE

The function of EPlotSimilarity depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimal Syntax: % eplotsimilarity [-INfile1=]hsp70.msf{*} -Default
  
  Prompted Parameters:
  
  -BEGin1=1 --END1=720   the range of interest in the alignment
  -WINdow=10            comparison window
  -DENsity=626.1        the number of bases per 100 platen units
  
  Prompted Parameters: for comparing 2 sequences only:
  
  [-INfile2=]ggamma.seq second input sequence
  -BEGin2=1 -END2=738   the range of interest in sequence 2
  -REVerse1 -REVerse2   strand of each sequence
  
  Local Data Files: -DATa=plotsimpep.cmp scoring matrix for peptides
               -DATa=plotsimdna.cmp scoring matrix for nucleic acids
  
  Optional Parameters:
  
  -IDEntity             plots the level of identity among the sequences
  -BARgraph             plots a bar graph (rather than a continuous curve)
  -PROFile              plots positional conservation in a profile
  -MINScale=0           sets the bottom of the similarity score scale
  -MAXScale=2           sets the top of the similarity score scale
  -EXPand               scales plot between observed min and max
                     similarity scores
  -NOAVErage            suppresses the plot of overall similarity
  -NOTITle              suppresses the plot title
  -BOXplot              adds a box around the plot
  -MINExist=0           minimum number of sequences to plot a point
  -MINLen=10            minimum length of sequence to plot points
  
  All GCG graphics programs accept these and other switches. See the Using
  Graphics chapter of the USERS GUIDE for descriptions.
  
  -FIGure[=FileName]  stores plot in a file for later input to FIGURE
  -FONT=3             draws all text on the plot using font 3
  -COLor=1            draws entire plot with pen in stall 1
  -SCAle=1.2          enlarges the plot by 20 percent (zoom in)
  -XPAN=10.0          moves plot to the right 10 platen units (pan right)
  -YPAN=10.0          moves plot up 10 platen units (pan up)
  -PORtrait           rotates plot 90 degrees
  


LOCAL DATA FILES

The files described below supply data to this program. The program automatically reads them from a public data directory unless one of the following occurs: 1) a different data file is named in the text heading of the input file; 2) you have a data file with exactly the same name in your current working directory; or 3) you name a file on the command line with an expression like -DATa1=mydata.dat. A file named with an expression like -DATa1=mydata.dat takes precedence over a data file named in the text heading of the input file. The concept of a local data file is described in more detail in Chapter 4, Using Data Files in the User's Guide.

EPlotSimilarity reads a scoring matrix from your local directory or the public database with the values for every possible match. EPlotSimilarity tries to read the name of the scoring matrix from the text heading of the input file. If it can't read the matrix name, it uses the default scoring matrix. The default file plotsimdna.cmp has a 1.0 at every place where the set of bases implied by the alphabetic IUB ambiguity codes (see Appendix III) overlap; all of the other locations have zeros. The default file plotsimpep.cmp has 1.5 for perfect symbol matches and values less than 1.5 (depending upon the evolutionary distance) for non-matches. You can use Fetch to copy these files and then you can modify them to suit your own needs.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-IDEntity

plots the level of identity between the sequences.

-BAR

plots the similarity as a bar graph (rather than a continuous curve).

-MINScale

sets the bottom of the similarity score scale.

-MAXScale

sets the top of the similarity score scale.

-EXPand

scales the plot between the observed minimum and maximum scores, rather than between the minimum and maximum scores in the scoring matrix.

-NOAVErage

suppresses the plot of overall average similarity between the sequences.

-PROFile

plots a running average of the positional conservation in a profile. The measure of conservation at any position is the difference between the greatest and least values at that position in the profile.

-NOTITle

suppresses the plot title, for example when preparing a figure for publication.

-BOXplot

draws a box border around the plot

-MINExist=0

specifies a minimum number of sequences with a residue at any position for a point to be plotted. Where there are fewer residues, a gap will be left in the plotted line.

-MINLength=10

specifies a minimum number of residues with enough sequences included for a point to be plotted. Where there are fewer residues, a gap will be left in the plotted line.

These options apply to all GCG graphics programs. These and many others are described in detail in Chapter 5, Using Graphics of the User's Guide.

-FIGure=programname.figure

writes the plot as a text file of plotting instructions suitable for input to the Figure program instead of drawing the plot on your plotter.

-FONT=3

draws all text characters on the plot using Font 3 (see Appendix I) .

-COLor=1

draws the entire plot with the pen in stall 1.

These options let you expand or reduce the plot (zoom), move it in either direction (pan), or rotate it 90 degrees (rotate).

-SCAle=1.2

expands the plot by 20 percent by resetting the scaling factor (normally 1.0) to 1.2 (zoom in). You can expand the axes independently with -XSCAle and -YSCAle. Numbers less than 1.0 contract the plot (zoom out).

-XPAN=30.0

moves the plot to the right by 30 platen units (pan right).

-YPAN=30.0

moves the plot up by 30 platen units (pan up).

-PORtrait

rotates the plot 90 degrees. Usually, plots are displayed with the horizontal axis longer than the vertical (landscape). Note that plots are reduced or enlarged, depending on the platen size, to fill the page.

Printed: April 22, 1996 15:53 (1162)