EPlotSimilarity plots the running average of the similarity among the sequences in a multiple sequence alignment.
EPlotSimilarity calculates the average similarity among all members of a group of aligned sequences at each position in the alignment, using a user-specified sliding window of comparison. The window of comparison is moved along all sequences, one position at a time, and the average similarity over the entire window is plotted at the middle position of the window. The average similarity across the entire alignment is plotted as a dotted line.
If you give EPlotSimilarity a single input sequence, you can choose the range and strand for that sequence, and then EPlotSimilarity prompts you for the name, range, and strand of a second input sequence. In this way, you can plot the average similarity between the two aligned sequences created with % gap -OUT.
This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a session using EPlotSimilarity to display the similarity among the group of aligned 70 kd heat shock and heat shock cognate peptide sequences in the file hsp70.msf:
% eplotsimilarity EPLOTSIMILARITY uses any sequences EPLOTSIMILARITY of what sequence(s) ? hsp70.msf{*} Start (* 1 *) ? End (* 720 *) ? hsp70.msf{hs70_plafa} hsp70.msf{hs70_thean} ///////////////////// hsp70.msf{dnak_ecoli} What window to average (* 10 *) ? The minimum density for this plot is 626.1 residues/100 platen units. What density do you want (* 626.1 *) ? When your LaserWriter attached to tty07 is ready, press. %
This is the plot from the example session
PileUp creates a multiple sequence alignment of a group of related sequences. Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps. BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap) .
GapShow displays an alignment of two sequences by making a graph that show the distribution of similarities and gaps.
The lengths of all sequences being compared must be the same.
The average similarity at a position in an alignment is the arithmetic average of the scores of all possible pairwise symbol comparisons among the sequence symbols at that position. The comparison value between any two sequence symbols can be found in the scoring matrix (see the LOCAL DATA FILES topic below). The average similarity across the entire alignment (plotted as a dotted line) is the sum of the separate window similarities divided by the number of windows.
If -IDEntity is on the command line, the program plots a measure of the level of identity among all sequences in the multiple sequence alignment. The calculations are done exactly as described above, but all identical symbol comparisons are given a value of 1.0; all other comparisons are given a value of 0.0.
If -PROFile is on the command line, the program plots a running average of the positional conservation in a profile. The measure of conservation at any position is the difference between the greatest and least values at that position in the profile.
EPlotSimilarity does not create the multiple sequence alignment. You can create the alignment using PileUp, Gap, or BestFit (see the INPUT FILE topic below).
You can plot a measure of identity between all sequences in the alignment using the -IDEntity command line option.
You can plot a measure of the level of conservation in a profile created from a multiple sequence alignment using the -PROFile command line option. This plot provides similar information to a plot of the similarity among the sequences in the multiple sequence alignment.
The Wisconsin Package must be configured for graphics before you run any program with graphics output! If the % setplot command is available in your installation, this is the easiest way to establish your graphics configuration, but you can also use commands like % postscript that correspond to the graphics languages the Wisconsin Package supports. See Chapter 5, Using Graphics in the User's Guide for more information about configuring your process for graphics.
If you need to stop this program,
use
The input to EPlotSimilarity is a group of two or more aligned sequences.
The multiple sequence alignment created by the PileUp
program can be used as input to EPlotSimilarity The gapped output files from the Gap
and BestFit
programs,
which were created using the -OUTfile2 and -OUTfile3 command line qualifiers,
can also be used as input to EPlotSimilarity If the first sequence entered into EPlotSimilarity is a single sequence,
the program prompts you for the second sequence.
EPlotSimilarity tries to read the name of the scoring matrix from the text heading of the input file.
If it can't read the matrix name,
it uses the default scoring matrix (see the LOCAL DATA FILES
topic below).
The function of EPlotSimilarity depends on whether your input sequence(s)
are protein or nucleotide.
Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence.
If your sequence(s)
are not the correct type,
turn to Appendix VI for information on how to change or set the type of a sequence.
All parameters for this program may be put on the command line.
Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes.
In the summary below,
the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter.
Square brackets ([ and ])
enclose qualifiers or parameter values that are optional.
For more information,
see "Using Program Parameters" in Chapter 3,
Basic Concepts: Using Programs in the GCG User's Guide.
The files described below supply data to this program.
The program automatically reads them from a public data directory unless one of the following occurs: 1)
a different data file is named in the text heading of the input file;
2)
you have a data file with exactly the same name in your current working directory;
or 3)
you name a file on the command line with an expression like -DATa1=mydata.dat.
A file named with an expression like -DATa1=mydata.dat takes precedence over a data file named in the text heading of the input file.
The concept of a local data file is described in more detail in Chapter 4,
Using Data Files in the User's Guide.
EPlotSimilarity reads a scoring matrix from your local directory or the public database with the values for every possible match.
EPlotSimilarity tries to read the name of the scoring matrix from the text heading of the input file.
If it can't read the matrix name,
it uses the default scoring matrix.
The default file plotsimdna.cmp has a 1.0 at every place where the set of bases implied by the alphabetic IUB ambiguity codes (see Appendix III)
overlap;
all of the other locations have zeros.
The default file plotsimpep.cmp has 1.5 for perfect symbol matches and values less than 1.5 (depending upon the evolutionary distance)
for non-matches.
You can use Fetch
to copy these files and then you can modify them to suit your own needs.
The parameters and switches listed below can be set from the command line.
For more information,
see "Using Program Parameters" in Chapter 3,
Basic Concepts: Using Programs in the GCG User's Guide.
plots the level of identity between the sequences.
plots the similarity as a bar graph (rather than a continuous curve).
sets the bottom of the similarity score scale.
sets the top of the similarity score scale.
scales the plot between the observed minimum and maximum scores,
rather than between the minimum and maximum scores in the scoring matrix.
suppresses the plot of overall average similarity between the sequences.
plots a running average of the positional conservation in a profile.
The measure of conservation at any position is the difference between the greatest and least values at that position in the profile.
suppresses the plot title,
for example when preparing a figure for publication.
draws a box border around the plot
specifies a minimum number of sequences with a residue at any position for a point to be plotted.
Where there are fewer residues,
a gap will be left in the plotted line.
specifies a minimum number of residues with enough sequences included for a point to be plotted.
Where there are fewer residues,
a gap will be left in the plotted line.
These options apply to all GCG graphics programs.
These and many others are described in detail in Chapter 5,
Using Graphics of the User's Guide.
writes the plot as a text file of plotting instructions suitable for input to the Figure
program instead of drawing the plot on your plotter.
draws all text characters on the plot using Font 3 (see Appendix I)
.
draws the entire plot with the pen in stall 1.
These options let you expand or reduce the plot (zoom),
move it in either direction (pan),
or rotate it 90 degrees (rotate).
expands the plot by 20 percent by resetting the scaling factor (normally 1.0)
to 1.2 (zoom in).
You can expand the axes independently with -XSCAle and -YSCAle.
Numbers less than 1.0 contract the plot (zoom out).
moves the plot to the right by 30 platen units (pan right).
moves the plot up by 30 platen units (pan up).
rotates the plot 90 degrees.
Usually,
plots are displayed with the horizontal axis longer than the vertical (landscape).
Note that plots are reduced or enlarged,
depending on the platen size,
to fill the page.
Printed: April 22,
1996 15:53 (1162)
INPUT FILE
SEQUENCE TYPE
COMMAND-LINE SUMMARY
Minimal Syntax: % eplotsimilarity [-INfile1=]hsp70.msf{*} -Default
Prompted Parameters:
-BEGin1=1 --END1=720 the range of interest in the alignment
-WINdow=10 comparison window
-DENsity=626.1 the number of bases per 100 platen units
Prompted Parameters: for comparing 2 sequences only:
[-INfile2=]ggamma.seq second input sequence
-BEGin2=1 -END2=738 the range of interest in sequence 2
-REVerse1 -REVerse2 strand of each sequence
Local Data Files: -DATa=plotsimpep.cmp scoring matrix for peptides
-DATa=plotsimdna.cmp scoring matrix for nucleic acids
Optional Parameters:
-IDEntity plots the level of identity among the sequences
-BARgraph plots a bar graph (rather than a continuous curve)
-PROFile plots positional conservation in a profile
-MINScale=0 sets the bottom of the similarity score scale
-MAXScale=2 sets the top of the similarity score scale
-EXPand scales plot between observed min and max
similarity scores
-NOAVErage suppresses the plot of overall similarity
-NOTITle suppresses the plot title
-BOXplot adds a box around the plot
-MINExist=0 minimum number of sequences to plot a point
-MINLen=10 minimum length of sequence to plot points
All GCG graphics programs accept these and other switches. See the Using
Graphics chapter of the USERS GUIDE for descriptions.
-FIGure[=FileName] stores plot in a file for later input to FIGURE
-FONT=3 draws all text on the plot using font 3
-COLor=1 draws entire plot with pen in stall 1
-SCAle=1.2 enlarges the plot by 20 percent (zoom in)
-XPAN=10.0 moves plot to the right 10 platen units (pan right)
-YPAN=10.0 moves plot up 10 platen units (pan up)
-PORtrait rotates plot 90 degrees
LOCAL DATA FILES
OPTIONAL PARAMETERS
-IDEntity
-BAR
-MINScale
-MAXScale
-EXPand
-NOAVErage
-PROFile
-NOTITle
-BOXplot
-MINExist=0
-MINLength=10
-FIGure=programname.figure
-FONT=3
-COLor=1
-SCAle=1.2
-XPAN=30.0
-YPAN=30.0
-PORtrait