SortConsensus identifies the strong consensus regions of an alignment in an MSF file and reports them in sorted order.
SortConsensus calculates the consensus of a set of aligned sequences from MSF or ".pileup" files (using the standards from PileUp MSF files) and sorts them by decreasing score. This score is calculated by mixing the length and the "quality" of the consensus. The user can weight one parameter more than the other, for example eliminating a consensus whose score is below a threshold (called sensitivity) or using a personal scoring matrix. The results are stored in an output file (default extension .cons).
This program was written by Philippe Dessen (E-mail: dessen@infobiogen.fr) and colleagues at the French EMBnet node (Post: INFOBIOGEN, 7 rue Guy Moquet - BP8, 94801 Villejuif CEDEX, France).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a sample session with SortConsensus
% sortconsensus SORTCONSENSUS of what file ? cah.msf Threshold (* 1.0 *) 0.8 Sensitivity (* 0.0 *) 4 Plurality (* 2 *) Output File (* cah.cons *) %
Here is an output file of a session with SortConsensus
Threshold: 0.8 Penalty: 1 Sensitivity: 4 Plurality: 2 SORTCONSENSUS of : cah.msf Type: P Consensus of CAH2_HUMAN Weight: 1 CAH3_RAT Weight: 1 CAH1_HORSE Weight: 1 CAH5_HUMAN Weight: 1 CAH1_CHLRE Weight: 1 CAHC_SPIOL Weight: 1.. Length Score Position 15 11.2 274 RDYWTYHGSLTTPPL 13 9.722 192 KYPAELHLVHWNS 12 9.178 218 DGLAVLGIFLKL /////////////////////////
SortConsensus calculates a consensus for each column of the alignment using the scoring matrix prettypep.cmp for peptides or prettydna.cmp for nucleic acids. The consensus is determined by finding the symbol in the column for which its comparison to all of the symbols in the column (including itself) yields the greatest number of votes. A vote is cast for each symbol comparison that is over some set threshold value; votes can be either 1.0 or some vote weight assigned to the sequence from which the vote comes.
If there is no coalition of votes that is larger than all of the other coalitions, or if the largest coalition is below the minimum plurality, then there is no consensus for the column.
The weights for each sequence, the threshold, and the minimum plurality are all real numbers.
Determines the scoring matrix value below which a symbol may not vote for a coalition.
Defines the number of votes (vote weights) below which there is no consensus.
If several of your sequences are very similar, you may not want their votes to dominate the consensus for the column. If your input file specification to SortConsensus is a list file, you can assign each sequence a vote weight with the wgt sequence attribute. The vote weight is the vote that each row casts for the consensus. A weight of 1.0 is assumed if no vote weight is specified. (See the INPUT FILE topic below for information about the list file used to run the example above.) Note how each kind of sequence is assigned a vote weight so that their combined impact on the election is never more than one vote. For more information about list files, see "Using List Files (formerly Files of Sequence Names) " in Chapter 2, Using Sequences in the GCG User's Guide.
You can assign vote weights to sequences in an MSF file by editing the MSF file and modifying the weight on the name/weight line for each sequence at the top of the file. (See "Using Multiple Sequence Format (MSF) Files" in Chapter 2, Using Sequences in the User's Guide for a complete description of MSF files.)
We highly recommend to use PileUp to produce the MSF file required for the input. Indeed the method of SortConsensus relies on the standards used in PileUp, and other programs do not always use those standards.
SortConsensus determines the same way Pretty does. The score of a consensus is the addition of each column subscore, which is between the range of [0-1]. A subscore is determined with the values of the scoring matrix, and then raised to the power of the PENalty parameter.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum syntax: % sortconsensus [-INfile=]cah.msf -Default Prompted Parameters: -THReshold=1.0 scoring matrix value below which a symbol may not vote for a coalition. -SENsitivity=0.0 minimum score to select consensus -PLUrality=2 number of votes below which there is no consensus -OUTfile=cah.cons output file Local Data Files: -DATa=prettydna.cmp consensus scoring matrix for nucleotides -DATa=prettydpep.cmp consensus scoring matrix for peptides Optional Parameters: -PENalty=1.0 parameter used to privilege either the length or the quality of the consensus.
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Sets the parameter used to privilege either the length or the quality of the consensus. It is a floating point number that must be positive. The greater the penalty is, the more quality is privileged.
Example: PEN<<1 score=length ; PEN>>1 score=number of perfect consensus columns.
Determines the scoring matrix value below which a symbol may not vote for a coalition (see the CALCULATING A CONSENSUS topic above). Same as in GCG's Pretty.
Defines the number of votes (vote weights) below which there is no consensus (see the CALCULATING A CONSENSUS topic above). Same as in GCG's Pretty.
Sets the minimum value a consensus must score in order to be reported.