Sortconsensus

Go back to top

SORTCONSENSUS


FUNCTION

SortConsensus identifies the strong consensus regions of an alignment in an MSF file and reports them in sorted order.


DESCRIPTION

SortConsensus calculates the consensus of a set of aligned sequences from MSF or ".pileup" files (using the standards from PileUp MSF files) and sorts them by decreasing score. This score is calculated by mixing the length and the "quality" of the consensus. The user can weight one parameter more than the other, for example eliminating a consensus whose score is below a threshold (called sensitivity) or using a personal scoring matrix. The results are stored in an output file (default extension .cons).


AUTHOR

This program was written by Philippe Dessen (E-mail: dessen@infobiogen.fr) and colleagues at the French EMBnet node (Post: INFOBIOGEN, 7 rue Guy Moquet - BP8, 94801 Villejuif CEDEX, France).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a sample session with SortConsensus

  
  
  % sortconsensus
  
   SORTCONSENSUS of what file ? cah.msf
  
   Threshold (* 1.0 *) 0.8
  
   Sensitivity (* 0.0 *) 4
  
   Plurality (* 2 *)
  
   Output File (* cah.cons *)
  
  %
  


OUTPUT

Here is an output file of a session with SortConsensus

  
  Threshold: 0.8   Penalty: 1    Sensitivity: 4  Plurality: 2
  
  SORTCONSENSUS of : cah.msf     Type: P
  
  Consensus of
     CAH2_HUMAN      Weight: 1
     CAH3_RAT        Weight: 1
     CAH1_HORSE      Weight: 1
     CAH5_HUMAN      Weight: 1
     CAH1_CHLRE      Weight: 1
     CAHC_SPIOL      Weight: 1..
  
  Length  Score   Position
    15     11.2     274   RDYWTYHGSLTTPPL
    13     9.722    192   KYPAELHLVHWNS
    12     9.178    218   DGLAVLGIFLKL
   /////////////////////////
  
  


CALCULATING AND DISPLAYING A CONSENSUS

SortConsensus calculates a consensus for each column of the alignment using the scoring matrix prettypep.cmp for peptides or prettydna.cmp for nucleic acids. The consensus is determined by finding the symbol in the column for which its comparison to all of the symbols in the column (including itself) yields the greatest number of votes. A vote is cast for each symbol comparison that is over some set threshold value; votes can be either 1.0 or some vote weight assigned to the sequence from which the vote comes.

If there is no coalition of votes that is larger than all of the other coalitions, or if the largest coalition is below the minimum plurality, then there is no consensus for the column.

The weights for each sequence, the threshold, and the minimum plurality are all real numbers.

-THReshold=1.0

Determines the scoring matrix value below which a symbol may not vote for a coalition.

-PLUrality=2.0

Defines the number of votes (vote weights) below which there is no consensus.

Vote Weight

If several of your sequences are very similar, you may not want their votes to dominate the consensus for the column. If your input file specification to SortConsensus is a list file, you can assign each sequence a vote weight with the wgt sequence attribute. The vote weight is the vote that each row casts for the consensus. A weight of 1.0 is assumed if no vote weight is specified. (See the INPUT FILE topic below for information about the list file used to run the example above.) Note how each kind of sequence is assigned a vote weight so that their combined impact on the election is never more than one vote. For more information about list files, see "Using List Files (formerly Files of Sequence Names) " in Chapter 2, Using Sequences in the GCG User's Guide.

You can assign vote weights to sequences in an MSF file by editing the MSF file and modifying the weight on the name/weight line for each sequence at the top of the file. (See "Using Multiple Sequence Format (MSF) Files" in Chapter 2, Using Sequences in the User's Guide for a complete description of MSF files.)


INPUT FILE

We highly recommend to use PileUp to produce the MSF file required for the input. Indeed the method of SortConsensus relies on the standards used in PileUp, and other programs do not always use those standards.


ALGORITHM

SortConsensus determines the same way Pretty does. The score of a consensus is the addition of each column subscore, which is between the range of [0-1]. A subscore is determined with the values of the scoring matrix, and then raised to the power of the PENalty parameter.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % sortconsensus [-INfile=]cah.msf -Default
  
  Prompted Parameters:
  
  -THReshold=1.0          scoring matrix value below which a symbol
                     may not vote for a coalition.
  -SENsitivity=0.0        minimum score to select consensus
  -PLUrality=2            number of votes below which there is no consensus
  -OUTfile=cah.cons       output file
  
  Local Data Files:
  
  -DATa=prettydna.cmp     consensus scoring matrix for nucleotides
  -DATa=prettydpep.cmp    consensus scoring matrix for peptides
  
  Optional Parameters:
  
  -PENalty=1.0            parameter used to privilege either the length
                     or the quality of the consensus.
  


LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-PENalty=1.0

Sets the parameter used to privilege either the length or the quality of the consensus. It is a floating point number that must be positive. The greater the penalty is, the more quality is privileged.

Example: PEN<<1 score=length ; PEN>>1 score=number of perfect consensus columns.

-THReshold=1.0

Determines the scoring matrix value below which a symbol may not vote for a coalition (see the CALCULATING A CONSENSUS topic above). Same as in GCG's Pretty.

-PLUrality=2.0

Defines the number of votes (vote weights) below which there is no consensus (see the CALCULATING A CONSENSUS topic above). Same as in GCG's Pretty.

-SENsitivity=0.0

Sets the minimum value a consensus must score in order to be reported.