Prettybox

Go back to top

PRETTYBOX


FUNCTION

PrettyBox displays multiple sequence alignments as shaded boxes in Postscript format (e.g., the output file must be printed and/or displayed on a Postscript-compatible device). PrettyBox will optionally calculate a consensus sequence. The program does not create the alignment; it simply displays it.


DESCRIPTION

PrettyBox is an improved version of the GCG program Pretty which produces a boxed sequence alignment as graphics output in addition to the standard text file. PrettyBox was originally written to produce publication-quality sequence alignment output. There are also several enhancements in the way the consensus sequence is calculated and in the options for sequence display.

PrettyBox prints and plots sequences with their columns aligned. This utility is used after a number of sequences have had gaps added to make them all align. PrettyBox s output allows you to look at relationships among several sequences.


AUTHOR

This program was written by Rick Westerman (E-mail: westerm@aclcb.purdue.edu Post: Ag Campus Laboratory for Computational Biology, BCHM Building, Purdue University, West Lafayette, IN 47907, USA), and modified for EGCG by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


ACKNOWLEDGEMENTS

The original Pretty program was designed with the help of Ann Palmenberg of the UW Biophysics lab. The sequences in the example were aligned for Dr. Palmenberg's work, and were used in the GCG manual.


EXAMPLE

By repeatedly using the program Gap with the command line option -OUT, gaps were added to a group of picorna virus capsid proteins in the antigenic region to make them align with each other and with a growing consensus sequence.

This procedure can be replaced by any multiple sequence alignment procedure such as Mali (Martin Vingron), Clustal (Des Higgins), MSE (Will Gilbert) or the new PileUp program in GCG version 7. The resulting sequence files must be converted to GCG format before use in PrettyBox typically by extracting the individual sequences with an editor, creating individual sequence files beginning with ".." on the first line, and converting to GCG format with the Reformat program.

  
  
  % prettybox -SeqName=Partial
  
   PRETTYBOX uses any sequences
  
   PRETTYBOX of what sequence(s)  ?  @pretty.list
  
         Fa10.Ugly, len: 349
         Fa12.Ugly, len: 349
  
         ///////////////////
  
          R14.Ugly, len: 349
           R2.Ugly, len: 349
  
             Start (* 1 *) ?
           End (*   349 *) ?
  
  Orient output as:
  
  L) Landscape
  P) Portrait
  
   Please choose one (* L *) ?
  
   Display a consensus (* No *) ?
  
   Find consensus to what plurality (* 5.6 *) ?
  
  Do numbering on:
  
  R) Right side
  T) Top side
  N) None
  
   Please choose one (* R *) ?
  
  %
  


OUTPUT

This is the plot from the example session


RELATED PROGRAMS

LineUp is a screen editor for editing multiple sequence alignments. You can edit up to 30 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict. Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it. PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. PrettyPlot displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment, it simply displays it.

Mali is Martin Vingron's Multiple Alignment program. Clustal is Des Higgins' multiple alignment program. MSE is Will Gilbert's multiple alignment editor. All these programs are available from the EBI Network File Server (NETSERV@ebi.ac.uk).

If you run Gap with the command line options for sequence output, it will write sequence files with the sequences expanded by the addition of gaps. Only two sequences can be aligned at once.


RESTRICTIONS

PrettyBox displays sequences which have already been aligned. You can use up to 500 sequences with up to 2,000,000 symbols in total unless your site has increased the limits for GCG.

The graphics output must be PostScript for PrettyBox to work. No other graphics drivers are supported.


CALCULATING A CONSENSUS

If you use one of the command line option -CONsensus, PrettyBox calculates a consensus for the column using a symbol comparison table called PrettyPep.Cmp for peptide or PrettyDNA.Cmp for nucleic acids. The consensus is found by finding the symbol in the column for which its comparison to all of the symbols in the column (including itself) yields the greatest number of votes. A vote is cast for each symbol comparison that is over some set threshold value. The votes can be either 1.0 or some "vote weight" assigned to the sequence from which the vote comes.

If there is no coalition of votes that is larger than all of the other coalitions or if the largest coalition is below the minimum plurality, then there is a choice of consensus for the column. By default, no consensus is then displayed.

The weights for each sequence, the threshold, and the minimum plurality are all real numbers.

If you use -CONsensus, PrettyBox will add a line to your alignments with a "consensus" sequence. The consensus is the symbol that had the largest number of votes (vote weights) in the column. The consensus is included in both the text and graphics versions of the output.

Since different symbols could contribute to a consensus for either -CASe or -DIFferences, such a consensus will not necessarily define a consensus symbol for the consensus sequence row.

For example, in the default comparison matrix, aspartate (D) and glutamate (E) have a score of over 1.0 so they are considered to match. If an alignment has five D and five E residues at position 15, they are all considered to match for the -CASe and -DIFferences options, but neither is in the majority for defining a consensus.

-THReshold=1.0

determines the symbol comparison value below which a symbol may not vote for a coalition. Please note that in the default comparison table an exact match between two amino acid residues scores 1.5, and that some other pairs (D and E, W and Y, L and F for example) are also, by default, considered to match. You should specify -THReshold= 1.5 to force exact matches only.

-PLUrality=2.0

defines the number of votes (vote weights) below which there will be no consensus. The default value is just over half the total weight. By default, each sequence has a vote of 1.0 (see threshold) in creating the consensus.

Vote Weight

If several of your sequences are very similar, you may not want their votes to dominate the consensus for the column. If your input file specification to PrettyBox is a file of file names, you can assign each sequence a vote weight by adding a number to the line after the sequence name. The vote weight is the vote that each row casts for the consensus. Here is the file of file names used to run the example above. Note how each kind of sequence is assigned a vote weight so that their combined impact on the election is never more than one vote.


INPUT FILE

Multiple sequence alignments are best represented with files of sequence names. For PrettyBox these files may include a vote weight as a column of numbers. Here is the input file (pretty.list) from the example session:

  
  
  A multiple sequence alignment represented as a list file for input to
  the programs PRETTY, PROFILEMAKE and LINEUP.
  
  7/30/94   ..
  
  GenDocData:fa10.ugly    wgt: 0.5
  GenDocData:fa12.ugly    wgt: 0.5
  GenDocData:fo1k.ugly    wgt: 1.0
  GenDocData:e.ugly       wgt: 1.0
  GenDocData:p1m.ugly     wgt: 0.25
  GenDocData:p1s.ugly     wgt: 0.25
  GenDocData:p2s.ugly     wgt: 0.25
  GenDocData:p3s.ugly     wgt: 0.25
  GenDocData:cb3.ugly     wgt: 1.0
  GenDocData:r14.ugly     wgt: 0.5
  GenDocData:r2.ugly      wgt: 0.5
  


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % prettybox [ -INfile=]@pretty.list -Default
  
  Prompted Parameters:
  
  -BEGin=1 -END=349           Range of interest
  -ORIentation=land           Which direction the printing will be, Landscape
                              or Portrait
  -NUMbering=right            Print sequence numbering to Right side, Top side,
                              or None
  -[NO]CONsensus              Displays a consensus sequence
  
  Local Data Files:
  
  -DATa=prettydna.cmp         Consensus comparison table for Nucleotides
  -DATa=prettypep.cmp         Consensus comparison table for Proteins
  
  Optional Parameters:
  
  -OUTput=file.name  Normally output goes to the PlotPort (postscript-compatible)
  -PROtein           Insists that your sequences are proteins, not nucleic acids
  -SIMilar           Consider similarity in generating a consensus; most
                     useful with proteins
  -[No]OFFset        If to offset the consensus line from the other sequences
  -IDEntity          Boxes only positions of unanimous agreement
  -CASe              Shows positions agreeing with the consensus in upper
                     case, other postions are lower case
  -THReshold=1.0     Sets minimum comparison value for symbol to vote in
                     the consensus
  -PLUrality=2.0     Defines the minimum number of votes for a consensus to exist
  -LINesize=50       Sets the number of residues per line
  -BLOcksize=10      Sets the number of residues per block
  -FONtsize=12       Font size in terms of postscript numbers
  -XMArgin=20        Left/right margins in postscript units
  -YMArgin=20        Top/bottom margins in postscript units
  -[No]HEAder        If to print the header
  -BLAnklines=2      Blank lines between each set of sequences
  -SEQName=partial   If the names of the sequences should be Partial (don't
                      include the file name, good for Pileup files), Full,
                      or None.
  -PAIr=1.5,1.0,0.5  Thresholds for identical, similar, and somewhat-similar
                      pair-wise matching. Protein defaults are: 1.5, 1.0,
                      0.5. Nucleic acid defaults are: 1.0, 1.0, 1.0.
  -COLor=B,L,P,W     What color identical, similar, somewhat-similar, and non-
                      matching residues will have. The colors are:
                      Black, Dark, Light, Pale, White.
  -DENsity=fine      Density of printing, this can be either Rough or Fine;
                      Rough may xerox better. This works with Dark, Light,
                      and Pale colors only.
  
  


LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

If you use one of the command line option -CONsensus, PrettyBox calculates a consensus for each column using a symbol comparison table (Appendix II) . You can provide your own table called either PrettyPep.Cmp for peptides or PrettyDNA.Cmp for nucleic acids. You can define some other table with the command line specification -VOTes=FileName.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-LINESize=50

specifies the number of sequence symbols to display on each line. Typically a linesize of 50 is used for text output, but for graphics values of 100 of more can be more useful.

-BLOcksize=10

specifies the number of sequence symbols to put into each block in the graphics output.

-CONsensus

causes PrettyBox to show a consensus sequence for the set of sequences you are displaying. Read how PrettyBox finds the consensus above.

-THReshold=1.0

determines the symbol comparison value below which a symbol may not vote for a coalition. See the topic called CALCULATING A CONSENSUS.

-PLUrality=2.0

defines the number of votes (vote weights) below which there will be no consensus. See the topic called CALCULATING A CONSENSUS.

-SEQname=Partial

uses only the filename (or the MSF file entry name).

Printed: April 22, 1996 15:55 (1162)