Eclustalw

Go back to top

ECLUSTALW

ECLUSTALW

FUNCTION

EClustAlW calculates a multiple alignment of nucleic acid or protein sequences according to the method of Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). This is part of the original ClustalW distribution, modified for inclusion in EGCG.

DESCRIPTION

EClustAlW is a program for performing multiple alignments of up to 500 DNA or protein sequences of up to 10,000 residues (including gaps in the final alignment).

The basic multiple alignment algorithm consist of three main stages:

1) all pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences;

2) a guide tree is calculated from the distance matrix;

3) the sequences are progressively aligned according to the branching order in the guide tree.

Two files are usually produced during the alignment process:

1) a file containing a description of the dendrogram;

2) a file for the multiple alignment.

You can use an option ( -ONLY) to specify that only the dendrogram is to be produced. This dendrogram file can later be used as input (-DENdrogram=filename). This is useful if you have a large data set (many sequences) for which it takes a long time to produce the dendrogram. If you wish to experiment with different gap parameters in the multiple alignment, you only have to produce a dendrogram file once. If you run EClustAlW with the parameter -MSF, you will obtain a multiple sequence format file (MSF file) which you can use as input for other programs like ClusTree or BoxAlign.

AUTHOR

ClustalW was written by Des Higgins (E-mail:Des.Higgins@ebi.ac.uk)

The EGCG version of the program was modified by Weiyun Chen and Karl-Heinz Glatting at the German Cancer Research Centre (DKFZ), Heidelberg, Germany.

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).

EXAMPLE

  
   % eclustalw -msf
  
   EClustAlW on what sequence(s) ? @globin.fil
  
   hbahum.pep >>> Length of sequence 1: 141 symbols <<<
   hbbhum.pep >>> Length of sequence 2: 146 symbols <<<
   hbghum.pep >>> Length of sequence 3: 146 symbols <<<
   hbhagf.pep >>> Length of sequence 4: 148 symbols <<<
   hbrlam.pep >>> Length of sequence 5: 149 symbols <<<
   mycrhi.pep >>> Length of sequence 6: 151 symbols <<<
   myohum.pep >>> Length of sequence 7: 153 symbols <<<
  
   Would you like to:
  
   A)dd more sequences
   Q)uit and compute cluster alignment
  
   Please choose one (* Q *):
  
   What should I call the  alignment output file (* globin.aln *) ?
  
   What should I call the dendrogram output file (* globin.dnd *) ?
  
   What should I call the MSF output file (* globin.msf *) ?
  
  %

OUTPUT

The final multiple alignment is sent to a file whose name is derived from the sequence input file with the addition of the ending .aln. The output is self explanatory. Positions where all residues are identical are marked with an asterisk ( * ) and, for proteins, positions where all residues belong to the same class are marked with a dot ( . ).

Here is the output file:

  
  
  of: @globin.fil
   hbahum.pep                   ck: 9231  from:     1  to:   141  Length:  141
   hbbhum.pep                   ck: 1242  from:     1  to:   146  Length:  146
   hbghum.pep                   ck: 3104  from:     1  to:   146  Length:  146
   hbhagf.pep                   ck: 4827  from:     1  to:   148  Length:  148
   hbrlam.pep                   ck: 7737  from:     1  to:   149  Length:  149
   mycrhi.pep                   ck:  918  from:     1  to:   151  Length:  151
   myohum.pep                   ck: 4188  from:     1  to:   153  Length:  153
  
   Pairwise similarity parameter:
  
   K-Tuple length:          1
   Gap Penalty:             3
   Number of diagonals:     5
   Diagonal window size:    5
   Scoring Method:          Percentage
  
   Multiple alignment parameter:
  
   Gap Penalty (fixed):           10.00
   Gap Penalty (varying):         0.05
   Gap separation penalty range:  8
   Percent. identity for delay:   40%
   List of hydrophilic residue:   GPSNDQEKR
   Protein Weight Matrix:         blosum
  
   Used Sequences:
  
    1 /husar3/gcg/gcg ( 141) HBAHUM     HEMOGLOBIN ALPHA CHAIN, HUMAN
    2 /husar3/gcg/gcg ( 146) HBBHUM     HEMOGLOBIN BETA CHAIN, HUMAN
    3 /husar3/gcg/gcg ( 146) HBGHUM     HEMOGLOBIN GAMMA CHAIN, HUMAN
    4 /husar3/gcg/gcg ( 148) HBHAGF     HEMOGLOBIN, HAGFISH (MYXINE GLUTINOSA)
    5 /husar3/gcg/gcg ( 149) HBRLAM     HEMOGLOBIN, RIVER  LAMPREY (LAMPETRA FLUVIATILIS)
    6 /husar3/gcg/gcg ( 151) MYCRHI     MYOGLOBIN, GASTROPOD, CERITHIDE RHIZOPHORARUM
    7 /husar3/gcg/gcg ( 153) MYOHUM     MYOGLOBIN, HUMAN
  
  
                     10        20        30        40        50        60
                      .         .         .         .         .         .
  hbahum.pep      ---------VLSPADKTNVKAAWGKVG---AHAGEYGAEALERMFLSFPTTKTYFPHF--
  hbbhum.pep      --------VHLTPEEKSAVTALWGKV-----NVDEVGGEALGRLLVVYPWTQRFFESFGD
  hbghum.pep      --------GHFTEEDKATITSLWGKV-----NVEDAGGETLGRLLVVYPWTQRFFDSFGN
  hbhagf.pep      PITDHGQPPTLSEGDKKAIRESWPQIY---KNFEQNSLAVLLEFLKKFPKAQDSFPKFS-
  hbrlam.pep      PIVDSGSVAPLSAAEKTKIRSAWAPVY---SNYETSGVDILVKFFTSTPAAQEFFPKFKG
  mycrhi.pep      ---------SLQPASKSALASSWKTLAKDAATIQNNGATLFSLLFKQFPDTRNYFTHFGN
  myohum.pep      ---------GLSDGEWQLVLNVWGKVE---ADIPGHGQEVLIRLFKGHPETLEKFDKFKH
                               .   *  .                      * .   *  *
  
  hbahum.pep      ----DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDL---HAHKLRVDPVNFK
  hbbhum.pep      LSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSEL---HCDKLHVDPENFR
  hbghum.pep      LSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSEL---HCDKLHVDPENFK
  hbhagf.pep      -AKKSHLEQDPAVKLQAEVIINAVNHTIGLMDKEAAMKKYLKDLSTKHSTEFQVNPDMFK
  hbrlam.pep      MTSADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFK
  mycrhi.pep      -MSDAEMKTTGVGKAHSMAVFAGIGSMIDSMDDADCMNGLALKLSRNHIQRKIGASR-FG
  myohum.pep      LKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQS---HATKHKIPVKYLE
                          . .   .       .                   *
  
  hbahum.pep      LLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------
  hbbhum.pep      LLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------
  hbghum.pep      LLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH------
  hbhagf.pep      ELSAVFVSTMGGK----------AAYEKLFSIIATLLRSTYDA-----
  hbrlam.pep      VLAAVIADTVAAG---------DAGFEKLMSMICILLRSAY-------
  mycrhi.pep      EMRQVFPNFLDEALGGGASGDVKGAWDALLAYLQDNKQAQAL------
  myohum.pep      FISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG
              .       .

MULTIPLE SEQUENCE FILES

Using the -MSF option, ClustAl writes the alignment into a multiple sequence format (MSF) file that interleaves the sequences to show their alignment. Any or all of the sequences in this file can be used by any other GCG/EGCG sequence analysis program. For instance, you could generate a phylogenetic tree from the sequences in the MSF file with the command % clustree clustal.msf{*}. (See the Specifying Sequences section of the User's Guide for help specifying sequences in MSF files.)

RELATED PROGRAMS

BoxAlign displays a sequence alignment graphically marking columns with conserved amino-acids or nucleotides with boxes. BoxAlign does not compute an alignment, it simply displays it.

ClusTree computes a phylogenetic tree according to the Neighbor-Joining Method of Saitou and Nei (1987). This is part of the original ClustalW distribution, modified for inclusion in EGCG. The tree will be displayed graphically.

LineUp is a screen editor for editing multiple sequence alignments. You can edit up to 30 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict.

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.

MultAlign does a simultaneous alignment for two or more DNA or protein sequences. It introduces a certain number of gaps into either pairwise aligned sequences or groups of sequences to find a minimal global distance. The user can influence the result by defining the order in which the sequences will be aligned. The program is based on a generalization of the algorithm of Waterman, Smith and Beyer by Krueger and Osterburg.

PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment.

PlotAlign takes a GCG format sequence alignment, and plots the mean and range of values for any amino acid parameter you supply. The "panel file" contains a list of parameters to be plotted. The main database of parameters is taken from Nakai et al. (1988), and the default panel file uses selected parameters from the 13 discrete clusters in that paper. This program is experimental. Any suggestions would be most welcome.

Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it.

ProfAlign is for taking two old aligments (or single sequences) and aligning them with each other. The result is one bigger aligment. This is part of the original ClustalW distribution, modified for inclusion in EGCG.

ProfileGap makes an optimal alignment between a profile and a sequence.

TProfileGap makes an optimal alignment between a profile and a sequence.

Tree produces a multiple alignment for a set of protein sequences by iteratively acting on the sequences. An approximate phylogenetic order of the sequences is first determinded by a series of pairwise alignments using the Needleman and Wunsch method. Any subclusters that may exist in the set are prealigned before the final alignment is undertaken. Finally, the phylogenetic tree of the sequences is plotted in the form of a dendrogram.

RESTRICTIONS

ClustAl can align up to 500 sequences each of which can consist of up to 10,000 symbols. As gaps are inserted, the length of the final alignment grows, but the length in the final alignment cannot exceed 10,000 characters for any sequence. This means, the maximum sequence length is 10,000 - X, where X is the number of gaps introduced by ClustAl. For DNA U = T. No ambiguity codes are used.

ALGORITHM

Distance Matrix/Pairwise Alignments

Fast approximate pairwise aligment

The similarity scores are calculated from fast alignments generated by the method of Wilbur and Lipman (Proc. Natl. Acad. Sci. 80: 726-730 (1983)). These are "hash" or "word" or "k-tuple" alignments carried out in 3 stages.

First you mark the positions of every fragment of sequence, K-tuple long (for proteins, the default length is 1 residue, for DNA it is 2 bases) in both sequences. Then you locate all k-tuple matches between the 2 sequences. At this stage you have to imagine a dot-matrix plot between the 2 sequences with each k-tuple match as a dot. You find those diagonals in the plot with most matches (you take the "No. of top diagonals" best ones) and mark all diagonals within "Window size" of each top diagonal. This process will define diagonal bands in the plot where you hope the most likely regions of similarity will lie.

The final alignment stage is to find that head to tail arrangement of k-tuple matches from these diagonal regions that will give the highest score. The score is calculated as the number of exactly matching residues in this alignment minus a "gap penalty" for every gap that was introduced. When you toggle "Scoring method" you choose between expressing these similarity scores as raw scores or expressed as a percentage of the shorter sequence length.

Slow pairwise aligment

By setting the parameter -SLOW, the initial pairwise alignments will be carried out using a full dynamic programming algorithm (Myers E.W. and Miller W. CABIOS 4: 11-17 (1988); Thompson J.D. CABIOS 11; 181-186 (1995)). This method is more accurate but MUCH slower. The scores are calculated by the number of identities in the best alignment divided by the number of residures compared (gap positions are excluded) using two gap penalties (for opening or extending gaps, see the section "multiple alignment") and a full amino acid weight matrix.

The Guide Tree/Dendrogram

The trees used to guide the final multiple alignment process are calculated from the distance matrix of step I using the Neighbour- Joining method (Saitou N. and Nei M. Mol. Biol. Evol. 4: 406-425 (1987)). This produces unrooted trees with branch lengths proportional to estimated divergence along each branch. The tree represents the similarity of the sequences as a hierarchy. The dendrogram is written to a file "file.dnd" and is shown below for an example with 7 sequences.

  
  Example dendrogram for  7 sequences (globin.dnd):
  
                (
                (
   1            hbahum.pep:0.29290,
                (
   2            hbbhum.pep:0.13346,
   3            hbghum.pep:0.13366)
                :0.19765)
                :0.06420,
                (
                (
   4            hbhagf.pep:0.31939,
   5            hbrlam.pep:0.28872)
                :0.10264,
   6            mycrhi.pep:0.44180)
                :0.01878,
   7            myohum.pep:0.38662);

The process runs from the top down, joining more and more sequences until all are joined together. The open clamp "(" means start of joining of sequences and the closed clamp ")" means closure of the joining of sequences. The pairwise distances are calculated as score. For example, sequence 2 (hbbhum.pep) joins sequence 3 (hbghum.pep) ; next, sequence 1 (hbahum.pep) joins group of sequence 2 plus sequence 3. Sequence 4 (hbhagf.pep) joins sequence 5 (hbrlam.pep); next, sequence 6 (mycrhi.pep) joins the second group of sequence 4 and sequence 5. At the end, sequence 7 (myohum.pep) joins the group of sequence 2, 3, 1 and the group of sequence 4, 5, 6. This is shown in the diagram below.

  
  Diagram of the sequence similarity relationships shown in the above
  dendrogram file (branch lengths are not to scale).
  
                                        0.13346
                                I-------------------- 2: hbbhum.pep
                    0.19765     I
               I----------------I
               I                I       0.13366
   0.06420     I                I-------------------- 3: hbghum.pep
  I-----------------I
  I                 I                0.29290
  I                 I------------------------------------- 1: hbahum.pep
  I
  I                                          0.31939
  I                                  I-------------------- 4: hbhagf.pep
  I                      0.10264     I
  I                 I----------------I
  I                 I                I       0.28872
  I     0.01878     I                I-------------------- 5: hbrlam.pep
  I-----------------I
  I                 I                0.44180
  I                 I------------------------------------- 6: mycrhi.pep
  I
  I                         0.38662
  I------------------------------------------------------- 7: myohum.pep

Multiple Aligment

Having calculated a dendrogram between a set of sequences, the final multiple alignment is carried out by a series of alignments of larger and larger groups of sequences. The order is determined by the dendrogram so that the most similar sequences get aligned first. Gaps that are present in older alignments remain fixed. In the basic algorithm, new gaps that are introduced at each stage get full gap opening and extension penalties, even if they are introduced inside old gap positions. A gap opening penalty (GOP) gives the cost of opening a new gap of any length, a gap extension penalty (GEP) gives the cost of every item in a gap. Initial values can be set by the user using the parameters -GAPC and -GAPV. The program then automatically attempts to choose appropriate gap penalties for each sequence alignment, depending on the following factors.

Dependence on the weight matrix: The average score for two mismatch residures is used as a scaling factor for the GOP.

Dependence on the similarity of the sequences: The logarithm of the length of the shorter sequence is used to increase the GOP with sequence length.

Dependence on the difference in the lengths of the sequences: If one sequence is much shorter than the other, the GEP is increased to inhibit too many long gaps in the shorter sequence.

Position-specific gap penalties: The initial GOP is manipulated in a position-specific manner, in order to make gaps more or less likely at different positions.

Lowered gap penalties at existing gaps: If there are already gaps at a position, then the GOP is reduced in proportion to the number of sequences with a gap at this position and the GEP is lowered by a half.

Increased gap penalties near existing gaps: If a position does not have any gaps but is within 8 residues of an existing gap, the GOP is increased.

Reduced gap penalties in hydrophilic stretches: The residures that are to be considered hydrophilic may be set by user using the parameter -RGAPRes (="GPSNDQEKR" is default). If, at any position, there are no gaps and any of the sequences has such a stretch, the GOP is reduced by one third.

Residue-specific penalties: If there is no hydrophilic stretch and the position does not contain any gaps, then the GOP is manipulated depending on the residure.

Weight matrices

Two main series of weight matrices are offered to the user: the Dayhoff PAM series (Dayhoff M.O., Schwartz R.M., Orcutt B.C. In Atlas of Protein Sequence and Structure (Dayhoff M.O. eds.) 5, suppl. 3,: 345-352, NBRF, Washington (1978)) and the BLOSUM series (Henikoff S. and Henikoff J.G. Proc. Natl. Acad. Sci. USA 89: 10915-10919 (1992)). The default is the BLOSUM series. In each case, there is a choice of martix ranging from strict ones, useful for comparing very closely related sequences, to very "soft" ones that are useful for comparing very distantly related sequences. Depending on the distances between the two sequences or groups of sequences to be compared, one of the 4 different matrices is used. The distances are measured directly from the guide tree. The ranges of distances and tables used with the PAM series of matrices are: 80 - 100%: PAM20, 60 - 80%: PAM60, 40-60%: PAM120, 0 - 40%: PAM350. The ranges used with the BLOSUM series are: 80 - 100%: BLOSUM80, 60 - 80%: BLOSUM62, 30-60%: BLOSUM45, 0 - 30%: BLOSUM30.

Divergent Sequences

The most divergent sequences (most different on average from all of the other sequences) are usually the most difficult to align correctly. It is sometimes better to delay the incorporation of these sequences until all of the more easily aligned sequences are merged first. A choice is offered to set a cut-off using the parameter -MAXDiv (default is 40 % identity or less with any other sequence) that will delay the alignment of the divergent sequences until all of the rest have been aligned.

COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimal Syntax: % eclustalw [-INfile=]@globin.fil  -Default
  
  Prompted Parameters:
  
  [-OUTfile1=]globin.aln  output file name
  [-OUTfile2=]globin.dnd  dendrogram output file name
  -BEGin1=1 -END1=100     range of interest for sequence 1
  -REV2                   strand of sequence 2 (if DNA)
  
  Local Data Files:
  
  -DATa1=pam250.clus      comparison table for protein pairwise alignments
  -DATa2=pam250.clus      comparison table for protein multiple alignments
  
  Optional Parameters:
  
  -ONLY                   produces the dendrogram file only
  -DENdrogram[=globin.dnd]uses an old dendrogram file
  -PROtein                insists that your sequences are protein sequences
  -DNA                    insists that your sequences are DNA sequences
  -MSF[=seqname.msf]      writes an MSF file
  -SLOWalign              uses a SLOW algorithm for the pairwise alignments
  
  Pairwise similarity scores (defaults for proteins):
  
  -KTUP=1                 K-Tuple size (2 for DNA)
                     increase for speed; decrease for greater sensitivity
  -GAPW=3                 gap penalty; number of exact matches to create a gap
                     (5 for DNA)
  -TOPDiags=5             number of diagonals to be considered; (4 for DNA)
                     decrease in speed; increase in sensitivity
  -WINdow=5               diagonal window size; (4 for DNA)
                     decrease in speed; increase in sensitivity
  -NOPERcent              suppresses percentage score; absolute scores are
                     recommended for sequences with little difference in
                     length
  
  Slow Pairwise Alignments:
  
  -PWMATRIX=blosum        BLOSUM, PAM or ID
                     comparison table for protein pairwise alignments
  -PW_GAPC=10.0           gap opening penalty;
  -PW_GAPV=0.1            gap extension penalty;
  
  Multiple alignment:
  
  -MATRIX=blosum          BLOSUM, PAM or ID
                     comparison table for protein multiple alignments
  -GAPC=10.0              gap penalty (fixed);
                     increase to prevent gaps; decrease to encourage them
  -GAPV=5.0               gap penalty (varying);
                     decrease to encourage LONGER gaps
  -UNWeighted             controls whether transitions are weighted twice as
                     much as transversions (only applies to DNA)
  -ENDGAPs                use end gap separation penalty
  -GAPDist=8              gap separation penalty range
  -NORGAP                 no residue specific gaps
  -RGAPRes="GPSNDQEKR"    list of hydrophilic residues
  -NOHGAP                 no hydrophilic gaps
  -MAXDiv=40              % ident. for delay

ACKNOWLEDGEMENT

For details about the original ClustAlW program package, including ClustAlW, ProfAlign and Clustree, see J. D. Thompson et al. (Nucleic Acids Research, 22 (22): 4673-4680 (1994)) and D. G. Higgins et al. (CABIOS 8 (2): 189-191 (1992)). For details about the overall multiple alignment algorithm see D. G. Higgins and P. M. Sharp (CABIOS 5: 151-153 (1989)).

EClustAlW is part of ClustalW, which was developed and written by Des Higgins, European Bioinformatics Institute, EMBL Outstation, Hinxton, UK. The program was added to the Package for HUSAR version 3.0 by Weiyun Chen and Karl-Heinz Glatting, DKFZ Heidelberg, Germany, and converted to EGCG by Peter Rice, Sanger Centre, Hinxton, UK.

LOCAL DATA FILES

For protein comparison, a weight matrix is used to weight aligned amino acid. The default is the BLOSUM series. But you can also use your own protein matrix as a local data file. By naming a file on the command line with an expression like -DATa1=pam250.clus or -DATa2=pam250.clus, the matrix in the file pam250.clus for the pairwise alignment or for the multiple alignment respectively will be used.

OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-ONLY

This option allows you to calculate all the pairwise similarity scores and produce a dendrogram, without doing the final multiple alignment. The dendrogram will be sent to a file and can be used again at a later date (by specifying the option -DENdrogram).

-DENdrogram[=filename]

This option allows you to use a dendrogram file that was produced during an earlier multiple alignment (a filename might be a specific globin.dnd). This is useful because some dendrograms are very time consuming to produce. The format of the dendrogram is complicated; therefore you should only use a file produced by this program or one that was edited carefully. The number of sequences in the dendrogram file must be the same as the number of sequences in the current sequence data set. Every time you do a complete multiple alignment a dendrogram file is automatically produced.

-PROtein

insists that your sequences are protein sequences.

-DNA

insists that your sequences are DNA sequences.

-MSF[=seqname.msf]

writes an MSF output file which can be used as input for other GCG/HUSAR programs (e.g. ClusTree) . See the Specifying Sequences section of the User's Guide for a complete description of the MSF files.

PAIRWISE ALIGNMENT PARAMETERS

-KTUP=1

The k-tuple size is the size of the exactly matching fragments which are used. The larger this is set to (max= 2 for proteins; max= 4 for DNA), the faster but less accurate the alignment will be. For short sequences (e.g. 300 residues, or less) or for small numbers of sequences (less than 20) a value of 1 will be fine; for longer sequences (especially DNA) larger values might be used.

-GAPW=3

The gap penalty parameter controls the frequency of gaps in the pairwise alignments. It will not have much affect on the scores. The higher the gap penalty, the less likely it is that gaps will occur. The penalty specifies the number of exactly matching residues that must be found in order to introduce a gap.

-TOPDiags=5

The number of top diagonals determines how many of the diagonals with the most matches in the imaginary dot-matrix plot are considered. The smaller the number of diagonals which are considered for (must be greater than zero), the faster but less sensitive the alignment will be.

-WINdow=5

After the diagonals with most matches are found, this window size parameter specifies a window around each top diagonal determining the number of diagonals that will be considered in the alignment. Decreasing the size of this parameter will speed up the alignments.

-NOPercent

The similarity scores are calculated as percentage matches between 2 sequences (approximately), i.e. (score/shorter length) * 100; if you choose absolute scores then the scores are simply the number of matches (number of identical residues minus a "gap penalty" for each gap). Percentage scores are advisable if the lengths differ greatly. Absolute scores are otherwise better.

-SLOWalign

This parameter determines the speed of the initial pairwise alignments. By setting the parameter -SLOW, the initial pairwise alignments will be carried out using a full dynamic programming algorithm. Eventhough this method is more accurate than the older hash/k-tuple based alignments (Wilbur and Lipman), the alignments will be performed much slower. By default the older and also faster method is used.

SLOW PAIRWISE ALIGNMENT PARAMETERS

-PW_GAPC=10

the penalty for opening a gap in the alignment (gap opening penalty).

-PW_GAPV=0.1

the penalty for extending a gap by 1 residue (gap extension penalty: 5.0 for DNA).

-PWMATRIX=blosum

For protein comparisons, a weight matrix is used to differentially weight different pairs of aligned amino acids. The BLOSUM matrix used by default can be changed into a PAM, an ID (Identity) matrix or an user defined matrix in a local data file (see the parameter -DATa1=pam250.clus in the LOCAL DATA FILES section).

CHANGING THE PAIRWISE ALIGNMENT PARAMETERS

The main reason for changing the above parameters is to affect speed, not sensitivity. The dendrograms that are produced can only show the relationships between the sequences approximately because the similarity scores are calculated from separate pairwise alignments, not from a multiple alignment (that is what you eventually hope to produce). If the groupings of the sequences are "obvious", the above method should work well. If the relationships are obscure or weakly represented by the data, it will not make much difference playing with the parameters. The main factor influencing speed is the k-tuple size followed by the window size.

MULTIPLE ALIGNMENT PARAMETERS

-GAPC=10

The gap penalty (Fixed) parameter is a penalty for every gap that is introduced, regardless of the length of the gap. Therefore, decreasing this parameter will encourage gaps of all sizes. BEWARE: if you choose a penalty which is too small (approx. 5 or so), then the program may prefer to align each sequence opposite a long gap.

-GAPV=0.05

The gap penalty (Varying) parameter is a penalty for each item in each gap. Therefore, this is a penalty for longer gaps. Increase this and gaps will get shorter. BEWARE: if you choose a penalty which is too small, then the program may prefer to align each sequence opposite a long gap (default 5.0 for DNA).

For a more detailed description of gap penalties see the ALGORITHM section in the program description of Gap.

-UNWeighted

If transitions are unweighted, then all nucleic acid mismatches have the same weight (all pairs of nucleotides are equally weighted). If transitions (C vs T; A vs G) are weighted more strongly than transversions (an A aligned with a G will be preferred to an A aligned with a C or a T), then transitions have an intermediate score between exact matches and other mismatches. The default is weighted transitions.

-MATRIX=blosum

-ENDGAPs

End gap separation treats end gaps just like internal gaps for the purpose of avoiding gaps that are too close. If you turn this parameter off, end gaps will be ignored. This is useful when you wish to align fragments where the end gaps are not biologically meaningful.

-MAXDiv=40

The user can set a cut-off to delay the alignment of the most divergent sequences in a data set until all other sequences have been aligned. By default, this is set to 40% which means that if a sequence is less than 40% identical to any other sequence, its alignment will be delayed.

-GAPDist=8

This parameter defines the gap separation penalty range. Gap separation distance tries to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalized more than other gaps. This does not prevent close gaps, it makes them less frequent, promoting a block-like appearance of the alignment.

-NORGAP

By setting this parameter, no residue specific gap is introduced. Residue specific penalties are amino acid specific gap penalties that reduce or increase the gap opening penalties at each position in the alignment or sequence.

-HGAPRes="GPSNDQEKR"

This parameter defines the list of hydrophilic residues. Hydrophilic gap penalties are used to increase the chances of a gap within a run (5 more residues) of hydrophilic amino acids, these are likely to be loop or random coil regions where gaps are more common.

-NOHGAP

By setting this parameter, no hydrophilic gap is introduced.

REFERENCES

Wilbur, W.J. and Lipman, D.J. (1983). "Rapid Similarity Searches of Nucleic Acid and Protein Data Banks." Proceedings of the National Academy of Sciences USA 80, 726-730.

Myers E.W. and Miller W. CABIOS (1988). "Optimal alignments in linear space." Comput. Appl. Biosci. 4, 11-17.

Thompson J.D. (1995). "Introducing variable gap penalties to sequence alignment in linear space."; Comput. Appl. Biosci. 11, 181-186.

Saitou and Nei (1987). "The neighbor-joining method: a new method for reconstructing phylogenetic trees." Mol. Biol. Evol. 4 406-425.

Henikoff S. and Henikoff J.G. (1992). "Amino acid substitution matrices from protein blocks."; Proc. Natl. Acad. Sci. U.S.A. 89, 10915-10919.

Thompson J.D., Higgins D.G. and Gibson T.J. (1994) "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic Acids Research 22 4673-4680.

Higgins D.G., Bleasby A.J., Fuchs R. (1992). "CLUSTAL V: improved software for multiple sequence alignment." Comput. Appl. Biosci. 8 189-191.

Higgins D.G., Sharp P.M. (1989). "Fast and sensitive multiple sequence alignments on a microcomputer." Comput. Appl. Biosci. 5 151-153.

Dayhoff, M. O. (1978). Atlas of Protein Sequence and Structure, Vol. 5, supplement 3.

Printed: April 22, 1996 15:52 (1162)