Prettyplot

Go back to top

PRETTYPLOT

PRETTYPLOT

FUNCTION

PrettyPlot displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment, it simply displays it.

PrettyPlot is an improved version of the GCG program Pretty which produces a boxed sequence alignment as graphics output in addition to the standard text file. PrettyPlot was originally written to produce publication-quality sequence alignment output. There are also several enhancements in the way the consensus sequence is calculated and in the options for sequence display.

PrettyPlot prints and plots sequences with their columns aligned. This utility is used after a number of sequences have had gaps added to make them all align. PrettyPlot s output allows you to look at relationships among several sequences. You should use a file of sequence names to define the sequences you want PrettyPlot to display (see Appendix VI). Although a specification such as "*.pep" is also accepted you would not be able to use the various weighting and naming options described below.

You can change the alignments displayed by PrettyPlot with a text editor. The output from PrettyPlot can then be separated into individual sequence files by running PrettyPlot with the command line option -UGLy.

AUTHOR

This program was written by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org). Additional code and suggestions were provided by David Mathog at Caltech, and by Jaakko Hattula of Tampere University of Technology, Finland.

ACKNOWLEDGEMENTS

The original Pretty program was designed with the help of Ann Palmenberg of the UW Biophysics lab. The sequences in the example were aligned for Dr. Palmenberg's work, and were used in the GCG manual.

The original suggestions for the PrettyPlot program were from Denis Duboule and Sigfried Labeit at EMBL. Gert Vriend added the star marking. Rita Grandori suggested the -NOCOLLISION option.

EXAMPLE

By repeatedly using the program Gap with the command line option -OUT, gaps were added to a group of picorna virus capsid proteins in the antigenic region to make them align with each other and with a growing consensus sequence.

This procedure can be replaced by any multiple sequence alignment procedure such as Mali (Martin Vingron), Clustal (Des Higgins), MSE (Will Gilbert) or the new PileUp program in GCG version 7. The resulting sequence files must be converted to GCG format before use in PrettyPlot typically by extracting the individual sequences with an editor, creating individual sequence files beginning with ".." on the first line, and converting to GCG format with the Reformat program.

  
  
  % prettyplot -Consensus -LineSize=90
  
   PRETTYPLOT uses any sequences
  
   PRETTYPLOT of what sequence(s)  ?  @pretty.list
  
             Start (* 1 *) ?
           End (*   349 *) ?
  
         Fa10.Ugly, len: 349
         Fa12.Ugly, len: 349
  
         ///////////////////
  
          R14.Ugly, len: 349
           R2.Ugly, len: 349
  
   Find consensus to what minimum plurality (* 3.3 *) ?
  
  %

OUTPUT

Here is part of the text output file:

  
  
  Plurality: 2.00  Threshold: 1.00
  AveWeight 0.55  AveMatch 0.54  AvMisMatch -0.40
  
  PRETTY of: @Pretty.Fil   February 6, 1989  19:25  ..
  
        1                                                   50
  Fa10.Ugly  .......... .......... .......... ..TTttGESA D.PvtTtVE.
  Fa12.Ugly  .......... .......... .......... ..TTatGESA D.PvtTtVE.
  Fo1k.Ugly  .......... .......... .......... ..TTsaGESA D.PvtTtVE.
E.Ugly  Gvenae.kgV tEnTna.Tad fvaqpvyLPE .nqT...... kV.AFfynrs
   P1m.Ugly  GlgqmlEsmI .DnTvreTvg AatsrdaLPn teasGPthSk EIPALTAVET
   P1s.Ugly  GlgqmlEsmI .DnTvreTvg AatsrdaLPn teasGPahSk EIPALTAVET
   P2s.Ugly  GigdmIEgaV .Egitknalv pptstnsLPg hkpsGPahSk EIPALTAVET
   P3s.Ugly  GiedlIseva .qgal..Tls lpkqqdsLPD tkasGPahSk EVPALTAVET
   Cb3.Ugly  ...gpVEdaI .......T.. Aaigr..vaD tvgTGPtnSe aIPALTAaET
   R14.Ugly  GlgdelEevI vEkT.kqTv. Asi....... ..ssGPkhtq kVPiLTAnET
    R2.Ugly  ...npVEnyI dEvlnevlv. .......vPn inssnPttSn saPALdAaET
  Consensus  G----VE--I -E-T---T-- A------LPD --TTGPGESA D-PALTAVET
  
  /////////////////////////////////////////////////////////////////

The graphics version of the output is shown below:

RELATED PROGRAMS

LineUp is a screen editor for editing multiple sequence alignments. You can edit up to 30 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict. Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it. PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment.

Mali is Martin Vingron's Multiple Alignment program. Clustal is Des Higgins' multiple alignment program. MSE is Will Gilbert's multiple alignment editor. All these programs are available from the EBI Network File Server (NETSERV@ebi.ac.uk).

If you run Gap with the command line options for sequence output, it will write sequence files with the sequences expanded by the addition of gaps. Only two sequences can be aligned at once.

RESTRICTIONS

PrettyPlot displays sequences which have already been aligned. You can use up to 500 sequences with up to 2,000,000 symbols in total unless your site has increased the limits for GCG.

CALCULATING A CONSENSUS

If you use one of the command line options -CONsensus, -DIFferences, or -CASe, PrettyPlot calculates a consensus for the column using a symbol comparison table called PrettyPep.Cmp for peptide or PrettyDNA.Cmp for nucleic acids. The consensus is found by finding the symbol in the column for which its comparison to all of the symbols in the column (including itself) yields the greatest number of votes. A vote is cast for each symbol comparison that is over some set threshold value. The votes can be either 1.0 or some "vote weight" assigned to the sequence from which the vote comes.

If there is no coalition of votes that is larger than all of the other coalitions or if the largest coalition is below the minimum plurality, then there is a choice of consensus for the column. By default, no consensus is then displayed. The -NOCOLLision option makes PrettyPlot box all possible consensus matches, and choose the first one found for use in the consensus sequence.

The weights for each sequence, the threshold, and the minimum plurality are all real numbers.

If you use -CASe, PrettyPlot will show the members of the winning coalition in upper case and others in lower case in the text output file. The graphics output will always be in upper case.

If you use -DIFferences, PrettyPlot will suppress the members of the winning coalition and show all the other positions in lower case.

If you use -CONsensus, PrettyPLot will add a line to your alignments with a "consensus" sequence. The consensus is the symbol that had the largest number of votes (vote weights) in the column. The consensus is included in both the text and graphics versions of the output.

Since different symbols could contribute to a consensus for either -CASe or -DIFferences, such a consensus will not necessarily define a consensus symbol for the consensus sequence row.

For example, in the default comparison matrix, aspartate (D) and glutamate (E) have a score of over 1.0 so they are considered to match. If an alignment has five D and five E residues at position 15, they are all considered to match for the -CASe and -DIFferences options, but neither is in the majority for defining a consensus.

To resolve these conflicts, PrettyPlot (but not Pretty) has a command line option -NOCOLLision which simply uses the first residue it finds when there is an equal choice.

-THReshold=1.0

determines the symbol comparison value below which a symbol may not vote for a coalition. Please note that in the default comparison table an exact match between two amino acid residues scores 1.5, and that some other pairs (D and E, W and Y, L and F for example) are also, by default, considered to match. You should specify -THReshold= 1.5 to force exact matches only.

-PLUrality=2.0

defines the number of votes (vote weights) below which there will be no consensus. The default value is just over half the total weight. By default, each sequence has a vote of 1.0 (see threshold) in creating the consensus.

Vote Weight

If several of your sequences are very similar, you may not want their votes to dominate the consensus for the column. If your input file specification to PrettyPlot is a file of file names, you can assign each sequence a vote weight by adding a number to the line after the sequence name. The vote weight is the vote that each row casts for the consensus. Here is the file of file names used to run the example above. Note how each kind of sequence is assigned a vote weight so that their combined impact on the election is never more than one vote.

INPUT FILE

Multiple sequence alignments are best represented with files of sequence names. For PrettyPlot these files may include a vote weight as a column of numbers. Here is the input file (pretty.list) from the example session:

  
  
  A multiple sequence alignment represented as a list file for input to
  the programs PRETTY, PROFILEMAKE and LINEUP.
  
  7/30/94   ..
  
  GenDocData:fa10.ugly    wgt: 0.5
  GenDocData:fa12.ugly    wgt: 0.5
  GenDocData:fo1k.ugly    wgt: 1.0
  GenDocData:e.ugly       wgt: 1.0
  GenDocData:p1m.ugly     wgt: 0.25
  GenDocData:p1s.ugly     wgt: 0.25
  GenDocData:p2s.ugly     wgt: 0.25
  GenDocData:p3s.ugly     wgt: 0.25
  GenDocData:cb3.ugly     wgt: 1.0
  GenDocData:r14.ugly     wgt: 0.5
  GenDocData:r2.ugly      wgt: 0.5

COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum Syntax: % prettyplot [-INfile=]@Pretty.Fil -Default
  
  Prompted Parameters:
  
  -BEGin=1 -END=349        range of interest
  [-OUTfile=]pretty.pretty output file
  [-OUTfile2=]pretty.fil   ugly format output file
  
  Local Data Files:
  
  [-DATa=]prettydna.cmp consensus comparison table for Nucleotides
  [-DATa=]prettypep.cmp consensus comparison table for Proteins
  -STAR=pretty.star  file of positions to be marked with asterisk
  
  Optional Parameters:
  
  -CONsensus           generates (displays) a consensus sequence
  -DIFferences[="-"]   only shows positions disagreeing with the consensus
  -CASe                shows positions agreeing with consensus in upper case
  -THReshold=1.0       sets min value for symbol to vote in consensus
  -PLUrality=2.0       defines the minimum number of votes for a consensus
  -LINESize=50         sets the number of residues per line
  -DENSity=50          same as LINESize
  -BLOcksize=10        sets the number of residues per block
  -UGLy                writes the individual sequences into new files
  -VOTes=matrix.cmp    alternative local data file (can also use -DATa as above)
  -NOTEXT              no text output file
  -NOPLOT              no graphics output file
  -NOBOX               no boxes drawn (use with color modes)
  -NOSEQNUMber         no sequence numbering on right of plot
  -NONAME              no sequence name on left of plot
  -NOTITLE             no title at top of plot
  -TOPNUMber=Consensus number every 10th position in named sequence or consensus
  -STARSEQ=Consensus   sequence positions used for asterisk
  -STAR=Pretty.Star    file of sequence positions to be marked with "*"
  -NOCOLLisions        allows more than one alternative consensus residue
  -NOSHORTname         full filename or MSF file and entry name shown
  
  Coloring of residues, in order of priority:
  
  -DOCOLors            highlight residues in color
  -BLACKaa=X           residues to color black
  -GREENaa=FLMWYIV     residues to color green
  -BLUEaa=RKH          residues to color blue
  -REDAa=DE            residues to color red (-RED is too short)
  -CYANaa=X            residues to color cyan
  -YELLOWaa=AG         residues to color yellow
  -VIOLETaa=P          residues to color violet
  Alternative coloring of residues, in order of priority:
  
  -CCOLors             highlight quality of consensus match
  -CONSCOLor           highlight quality of consensus match
  -CCONsensus=RED      colour for residues on consensus line
  -CIDentity=RED       colour for identity to consensus
  -CSImilarity=GREEN   colour for similarity to consensus
  -COThers=BLACK       colour for other residues

LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

If you use one of the command line options -CONsensus, -DIFferences, or -CASe, PrettyPlot calculates a consensus for each column using a symbol comparison table (Appendix II) . You can provide your own table called either PrettyPep.Cmp for peptides or PrettyDNA.Cmp for nucleic acids. You can define some other table with the command line specification -VOTes=FileName.

OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-LINESize=50

specifies the number of sequence symbols to display on each line. Typically a linesize of 50 is used for text output, but for graphics values of 100 of more can be more useful.

-BLOcksize=10

specifies the number of sequence symbols to put into each block in the text output file. The graphics output is never split into blocks.

-CONsensus

causes PrettyPlot to show a consensus sequence for the set of sequences you are displaying. Read how PrettyPlot finds the consensus above.

-DIFferences="-"

causes PrettyPlot to print only the positions that did not vote with the winning consensus and to print blanks at all other positions. If an optional character is added PrettyPlot will use that character at all of the positions that agree with the consensus. The '-' character has to be enclosed in quotes if it is the last character in a command or the VMS command interpreter will think you are starting a new command line.

-CASe

causes PrettyPlot to print all of the positions in each column that voted with the winning coalition in upper case and to print all other positions in lower case. This option overrides -DIFferences if both are used, and only applies to the text output file. In the graphics output, all positions are in upper case.

-THReshold=1.0

determines the symbol comparison value below which a symbol may not vote for a coalition. See the topic called CALCULATING A CONSENSUS.

-PLUrality=2.0

defines the number of votes (vote weights) below which there will be no consensus. See the topic called CALCULATING A CONSENSUS.

-UGLy

rewrites the sequences in a PrettyPlot text output file into individual sequence files in GCG format. The PrettyPlot output file must have a line with two periods ("..") separating the text in the heading from the sequences. -UGLy also causes PrettyPlot to write a file of file names to go with the new sequence files.

-TEXT

writes a text output file (the same as Pretty) as well as graphics.

-NOPLOT

cancels the graphics output.

-NONUMber

removes the sequence numbering from the graphics output.

-NONAME

removes the sequence names from the graphics output.

-NOTITLE

removes the title lines from the graphics output.

-TOPNUMber[=Consensus]

numbers every 10th position in the alignment, or every 10th position in the consensus sequence.

-STAR[=Pretty.Star]

reads a file of sequence positions to be marked with an asterisk in the graphics output. The default file name is the same as the input file with the extension ".Star". The Star file format is a heading of free text ending with "..", then every number on the remaining lines is used as a sequence position to be marked.

-STARSEQ[=Fa12.Ugly]

marks each position listed in the Star file with an asterisk, either using the consensus sequence ("=Consensus" ) or one of the sequence fragments as a base for the sequence position numbering.

-NOCOLLisions

allows positions where there are alternative consensus residues to have all the possible consensus resides boxed in preference to the default behaviour of boxing none. This is only importarnt where the consensus plurality is less than half of the total sequence voting weights, but this is by default often the case as the plurality is 2.0 and each sequence has a vote of 1.0 towards the consensus.

-NOSHORTname

uses the full filename (or the MSF file name and entry name).

-DOCOLors

tells PrettyPlot to highlight selected residues in color, according to the qualifier values below. The colors are searched in the order: Black, Green, Blue, Red, Cyan, Yellow, Violet. Setting any residue to one of the earliest colors overrides any later setting.

-BLACKaa=x

specifies residues to be black (the default) on a color plot.

-GREENaa=FLMWYIV

specifies residues to be green on a color plot.

-BLUEaa=RKH

specifies residues to be blue on a color plot.

-REDAa=DE

specifies residues to be red on a color plot. For RED but not for any other colour, the "aa" part of the qualifier is required. This is because GCG have a qualifier -REDuce which their graphics library uses, and it clashed with the EGCG qualifier if the name is shorter.