Multalign

Go back to top

MULTALIGN


FUNCTION

MultAlign does a simultaneous alignment for two or more DNA or protein sequences. It introduces a certain number of gaps into either pairwise aligned sequences or groups of sequences to find a minimal global distance. The user can influence the result by defining the order in which the sequences will be aligned. The program is based on a generalization of the algorithm of Waterman, Smith and Beyer by Krueger and Osterburg.


DESCRIPTION

The program is based on a generalization of the algorithm of M.S.Waterman, T.F.Smith and W.A.Beyer (Adv.Math. Vol. 20, pp. 367- 387 (1976)) by M.Krueger and G. Osterburg (Comp. Prog. in Biomed. Vol. 16, pp. 68- 69 (1983)). The necessary changes (insertions) are counted with a penalty factor (defined below), and their contribution is substracted from the total score of the matching residues. The resulting optimal alignment is printed marking equal letters with an asterisk and similarities (purines or pyrimidines for DNA, physicochemically related amino acid groups for proteins) with different signs.


AUTHOR

This program was written by Weiyun Chen and Karl-Heinz Glatting at the German Cancer Research Centre (DKFZ), Heidelberg, Germany.

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

  
  % multalign
  
   MultAlign on what sequence(s) ?  @globin.fil
  
   hbahum.pep >>> Length of sequence 1: 141 symbols <<<
   hbbhum.pep >>> Length of sequence 2: 146 symbols <<<
   hbghum.pep >>> Length of sequence 3: 146 symbols <<<
   hbhagf.pep >>> Length of sequence 4: 148 symbols <<<
   hbrlam.pep >>> Length of sequence 5: 149 symbols <<<
   mycrhi.pep >>> Length of sequence 6: 151 symbols <<<
   myohum.pep >>> Length of sequence 7: 153 symbols <<<
  
   Would you like to:
  
   A)dd more sequences
   Q)uit and compute alignment
  
   Please choose one (* Q *):
  
   What should I call the output file (* globin.mult *) ?
   What value for DIND  (* 80 *)  ?
   What is the gap weight  (* 5.0 *) ? 1
   What is the gap length weight  (* 1.0 *) ?
   Enter tree to define grouping of sequences
   ( * 7 6 5 4 3 2 1  *)
  1(2 3)7(4 5)6
  
   USED SEQUENCES:
  
    1 HBAHUMPEP  HBAHUM     HEMOGLOBIN ALPHA CHAIN, HUMAN
    2 HBBHUMPEP  HBBHUM     HEMOGLOBIN BETA CHAIN, HUMAN
    3 HBGHUMPEP  HBGHUM     HEMOGLOBIN GAMMA CHAIN, HUMAN
    4 HBHAGFPEP  HBHAGF     HEMOGLOBIN, HAGFISH (MYXINE GLUTINOSA)
    5 HBRLAMPEP  HBRLAM     HEMOGLOBIN, RIVER  LAMPREY (LAMPETRA FLUVIATILIS)
    6 MYCRHIPEP  MYCRHI     MYOGLOBIN, GASTROPOD, CERITHIDEA RHIZOPHORARUM
    7 MYOHUMPEP  MYOHUM     MYOGLOBIN, HUMAN
  
  %
  


OUTPUT

The result looks like:

  
  
                    MultAlign  September 14, 1990 16:51
  
  of: @globin.fil
   hbahum.pep                   ck: 9231  from:     1  to:   141  Length:  141
   hbbhum.pep                   ck: 1242  from:     1  to:   146  Length:  146
   hbghum.pep                   ck: 3104  from:     1  to:   146  Length:  146
   hbhagf.pep                   ck: 4827  from:     1  to:   148  Length:  148
   hbrlam.pep                   ck: 7737  from:     1  to:   149  Length:  149
   mycrhi.pep                   ck:  918  from:     1  to:   151  Length:  151
   myohum.pep                   ck: 4188  from:     1  to:   153  Length:  153
  
   PARAMETER SET:
  
   DIND:            80
   Percent:         50
   Gap Weight:     1.0
   Length Weight:  1.0
   Limit1:          20
   Limit2:          20
  
   Symbol comparison table: gendatabase:multpep.cmp
   Consensus group file: gendatabase:multpep.grp
  
   Specified grouping:
  
   S Nucl./A.A.                  Description
   1STPAG                        HYDROXYL / SMALL ALIPHATIC
   2STPAGNDEQ                    HYDROPHILIC
   3NDEQ                         ACID / ACID AMIDE
   4HRK                          BASIC
   5EDHKR                        CHARGED
   6AMILV                        ALIPHATIC
   7FYW                          AROMATIC
   8ALIVMFYW                     HYDROPHOBIC
    C                            CYSTEINE
  
   USED SEQUENCES:
  
    1 HBAHUMPEP  HBAHUM     HEMOGLOBIN ALPHA CHAIN, HUMAN
    2 HBBHUMPEP  HBBHUM     HEMOGLOBIN BETA CHAIN, HUMAN
    3 HBGHUMPEP  HBGHUM     HEMOGLOBIN GAMMA CHAIN, HUMAN
    4 HBHAGFPEP  HBHAGF     HEMOGLOBIN, HAGFISH (MYXINE GLUTINOSA)
    5 HBRLAMPEP  HBRLAM     HEMOGLOBIN, RIVER  LAMPREY (LAMPETRA FLUVIATILIS)
    6 MYCRHIPEP  MYCRHI     MYOGLOBIN, GASTROPOD, CERITHIDEA RHIZOPHORARUM
    7 MYOHUMPEP  MYOHUM     MYOGLOBIN, HUMAN
  
   USED TREE:
  
  1(2 3)7(4 5)6
  
   alignment of sequences:
  2 3
  
          10        20        30        40        50        60
           .         .         .         .         .         .
1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
   HBBHUMPEP
1 GHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKV
   HBGHUMPEP
   * * * *   * *******   *** *************** *** **   * ******
   H8T2E2K116T1LWGKVNV226GGE1LGRLLVVYPWTQRFF2SFG2LS112A6MGNPKV
   CONSENSUS
  
         .         .         .         .         .         .
    61 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
   HBBHUMPEP
    61 KAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGK
   HBGHUMPEP
  ********    *   *** ****** **************** ******* *** ****
  KAHGKKVL1181D16 HLD2LKGTFA2LSELHCDKLHVDPENF4LLGNVLV VLA HFGK
   CONSENSUS
  
           .         .         .         .         .         .
   121 EFTPPVQAAYQKVVAGVANALAHKYH                                   HBBHUMPEP
   121 EFTPEVQASWQKMVTGVASALSSRYH                                   HBGHUMPEP
  **** ***  ** * *** **   **
  EFTP2VQA17QK6V1GVA2AL1 4YH                                   CONSENSUS
  
   alignment of sequences:
   alignment of sequences:
  4 5
  
          10        20        30        40        50        60
           .         .         .         .         .         .
1 PITDHGQPPTLSEGDKKAIRESWPQIYKNFEQNSLAVLLEFLKKFPKAQDSFPKFSAKKS
   HBHAGFPEP
1 PIVDSGSVAPLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKGMTS
   HBRLAMPEP
  ** * *    **   *  **  *   * * *      *  *    * **  ****    *
  PI D G2 11LS212K  IR21W126Y N7E221626L65F8   P AQ2 FPKF 1  S CONSENSUS
  
           .         .         .         .         .         .
    61 --HLEQDPAVKLQAEVIINAVNHTIGLMDKEAAMKKYLKDLSTKHSTEFQVNPDMFKELS
   HBHAGFPEP
    61 ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
   HBRLAMPEP
     *     *   ** ******     **    *   * *** **   *** *  ** *
  12 L5 212V48 AE IINAVN5161 MD522 M   L4DLS1KH1 2FQV2P28FK L1 CONSENSUS
  
           .         .         .         .         .         .
   119 AVFVSTMG-GKAAYEKLFSIIATLLRSTYDA                              HBHAGFPEP
   121 AVIADTVAAGDAGFEKLMSMICILLRSA--Y                              HBRLAMPEP
  **   *   * *  *** * *  ****
  AV862T611G5A17EKL8S6I  LLRS1728111                           CONSENSUS
  
   alignment of sequences:
  1(2 3)7(4 5)6
  
          10        20        30        40        50        60
           .         .         .         .         .         .
1 ---------VLSPADKTNVKAAW---GKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-D
   HBAHUMPEP
1 --------VHLTPEEKSAVTALW---GKVNVDEV--GGEALGRLLVVYPWTQRFFESFGD
   HBBHUMPEP
1 --------GHFTEEDKATITSLW---GKVNVEDA--GGETLGRLLVVYPWTQRFFDSFGN
   HBGHUMPEP
1 PITDHGQPPTLSEGDKKAIRESW---PQIYKNFEQNSLAVLLEFLKKFPKAQDSFPKFSA
   HBHAGFPEP
1 PIVDSGSVAPLSAAEKTKIRSAW---APVYSNYETSGVDILVKFFTSTPAAQEFFPKFKG
   HBRLAMPEP
1 ---------SLQPASKSALASSWKTLAKDAATIQNNGATLFSLLFKQFPDTRNYFTHFGN
   MYCRHIPEP
1 --------GL-SDGEWQLVLNVW---GKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKH
   MYOHUMPEP
            ..   .      *   ...       . . . ...   * ..  *  *
  16 2 12   LS222K  6 2 W416GKV     2 G E L RLF   P TQ  F2 F   CONSENSUS
  
           .         .         .         .         .         .
    48 LSH-----GSAQVKGHGKKVADALTNAVAHVDD---MPNALSALSDLHAHKLRVDPVNFK
   HBAHUMPEP
    48 LSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN---LKGTFATLSELHCDKLHVDPENFR
   HBBHUMPEP
    48 LSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDD---LKGTFAQLSELHCDKLHVDPENFK
   HBGHUMPEP
    58 KKS--HLEQDPAVKLQAEVIINAVNHTIGLMDKEAAMKKYLKDLSTKHSTEFQVNPDMFK
   HBHAGFPEP
    58 MTSADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFK
   HBRLAMPEP
    52 M-SDAEMKTTGVGKAHSMAVFAGIGSMIDSMDDADCMNGLALKLSRNHIQR-KIGASRFG
   MYCRHIPEP
    49 LKSEDEMKASEDLKKHGATVLTALGGILKKKGH---HEAEIKPLAQSHATKHKIPVKYLE
   MYOHUMPEP
  . .         .. ..  .  .        ..   .      *.  *  .  ...  ..
  L S22 6  22 VK HG  V82A82  6   DD22 M   8  LS  H  K  VDP  FK CONSENSUS
  
           .         .         .         .         .         .
   100 LLSHCLLVTLAAHLPA--EFTPAVHASLDKFLASVSTVLTSKYR------           HBAHUMPEP
   105 LLGNVLVCVLAHHFGK--EFTPPVQAAYQKVVAGVANALAHKYH------           HBBHUMPEP
   105 LLGNVLVTVLAIHFGK--EFTPEVQASWQKMVTGVASALSSRYH------           HBGHUMPEP
   116 ELSAVFVSTMG-GKAAYEKLFSIIATLLRSTYDA----------------           HBHAGFPEP
   118 VLAAVIADTVAAGDAGFEKLMSMICILLRSA--Y----------------           HBRLAMPEP
   110 EMRQVFPNFLDEALGGGAS--GDVKGAWDALLAYLQDNKQAQAL------           MYCRHIPEP
   106 FISECIIQVLQSKHPG--DFGADAQGAMNKALELFRKDMASNYKELGFQG
        MYOHUMPEP
   .  .    ..        .   .     .            .
   L  V8   LA   1  2 F 1 V   8 K 82 8    2  Y 261721           CONSENSUS
  
  
   Alignment of 7 different sequences using 0.13 minutes of CPU time
  
   Output file: globin.mult.
  


MULTIPLE SEQUENCE FILES

Using the -msf option, Multalign writes the alignment into a multiple sequence format (MSF) file that interleaves the sequences to show their alignment. Any or all of the sequences in this file can be used by any other GCG sequence analysis program. For instance, you could generate a profile from the sequences in an MSF file with a command like % profilemake multalign.msf{*} and then use that profile to search the database for sequences similar to the sequences in the alignment. (See the Specifying Sequences section of the User's Guide for help specifying sequences in MSF files.)


RELATED PROGRAMS

EClustAlW calculates a multiple alignment of nucleic acid or protein sequences according to the method of Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). This is part of the original ClustalW distribution, modified for inclusion in EGCG. Tree produces a multiple alignment for a set of protein sequences by iteratively acting on the sequences. An approximate phylogenetic order of the sequences is first determinded by a series of pairwise alignments using the Needleman and Wunsch method. Any subclusters that may exist in the set are prealigned before the final alignment is undertaken. Finally, the phylogenetic tree of the sequences is plotted in the form of a dendrogram. PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp is a screen editor for editing multiple sequence alignments. You can edit up to 30 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict. Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds. ProfileGap makes an optimal alignment between a profile and a sequence. Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it. PrettyPlot displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment, it simply displays it. PrettyBox displays multiple sequence alignments as shaded boxes in Postscript format (e.g., the output file must be printed and/or displayed on a Postscript-compatible device). PrettyBox will optionally calculate a consensus sequence. The program does not create the alignment; it simply displays it. PlotAlign takes a GCG format sequence alignment, and plots the mean and range of values for any amino acid parameter you supply. The "panel file" contains a list of parameters to be plotted. The main database of parameters is taken from Nakai et al. (1988), and the default panel file uses selected parameters from the 13 discrete clusters in that paper. This program is experimental. Any suggestions would be most welcome.


RESTRICTIONS

Multalign can (theoretically) align up to 1,000 sequences of up to 2,500 symbols each. As gaps are inserted the length of the final alignment grows, but the length in the final alignment cannot exceed 2,500 characters for any sequence. That means, the maximum sequence length is 2,500 - X, where X is the number of gaps introduced by Multalign.


ALGORITHM

MultAlign is based on a generalisation of the algorithm of M.S Waterman, T.F. Smith, W.A. Bayer (Adv.Math. Vol. 20, pp. 367-387 (1976) by M. Krueger and G. Osterburg (Comp. Prog. in Biomed. Vol. 16, pp. 68-69 (1983).


ALIGNMENT ORDER

Internally, MultAlign aligns the sequences sequentially. The program starts with two sequences, takes the resulting alignment and aligns it with another alignment and so on. Therefore, the result of Multalign depends extremely on the order of the sequences that will be compared. You can specify a tree to define that order. Let's look at an example: we want to align seven sequences (HBAHUM, HBBHU, HBGHUM, HBHAGF, HBRLAM, MYOHUM, MYCRHI) and we specify the following tree: 1 (2 3) 7 (4 5) 6

  
  
  
                    hbghumi(3)     hbbhum(2)
                            \\    /     hbahum(1)
                             \\  /     /
                              \\/     /
                               \\    /     myohum(7)
                                \\  /     /
                                 \\/     /
                                  \\    /     hbhagf(4)
                                   \\  /     |
                                    \\/      |
                                     \\      |___ hbrlam(5)
                                      \\    /
                                       \\  /       mycrhi(6)
                                        \\/       /
                                         \\      /
                                          \\    /
                                           \\  /
                                            \\/
  
  
  
This means that the second and third sequences entered (HBBHUM and HBGHUM) will be aligned first. In a next step the resulting alignment will be aligned to the sequences number 1 (HBAHUM) followed by sequence number 7 (MYOHUM). Independently, the sequences 4 and 5 will be compared and finally the overall alignment will be computed including sequence number 6 (MYCRHI). Generally, closely related sequences should be aligned first, while distantly related sequences (or sequence groups) should be compared in later steps. The tree can be either entered interactively, or by the -TREE parameter (note that you have to replace spaces by underscores "_") or you can store the tree in a local data file and use the parameters -DATa3 (see the LOCAL DATA FILES topic below).


CONSIDERATIONS

There are several important parameters which profoundly influence the resulting alignments. Though the program is installed with a default value for each parameter, they should be carefully checked to avoid meaningless results.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimal Syntax: % multalign [-INfile=]@globin.fil  -Default
  
  Prompted Parameters:
  
  [-OUTfile=]globin.mult  output file name
  -BEGin1=1 -END1=100      range of interest for sequence 1
  -REV1    -NOREV3         strand of each sequence
  -TREE="((1_2)_3)"        order in which to align the sequences
  -DIND=80                 weighting for longer insertions
  -GAPweight=5.0           gap weight
  -LENgthweight=1.0        gap length weight
  -BATch[=long]            run MultAlign in specified batch-queue
  
  
  Local Data Files:
  
  -DATa1=multdna.cmp   distance matrix for DNA sequences
  -DATa2=multdna.grp   grouping of sequence symbols to be used for the
                  (DNA) consensus sequence
  -DATa3=tree.mult     contains tree (alternative to -TREE)
  Optional Parameters:
  
  -LIMit1=20               sum of all gaps in sequence 2 is restricted
                      so that sequence 2 does not come out of
                      phase with sequence 1 for more than
                      20 elements
  -LIMit2=20               sum of all gaps in sequence 1 is restricted
                      so that sequence 1 does not come out of
                      phase with sequence 2 for more than
                      20 elements
  
  All limits are initial values. The program itself checks for
  every sequence whether they are sufficient to compute a
  complete alignment. If not, values are changed to the lowest
  ones that allow computation.
  
  -PERCent=50              defines how many percent of identities are
                      necessary to obtain a letter in the consensus.
  -ENDWeight               weights end gaps like other gaps
  -WIDTH=60                number of bases per line
  -NOCONSensus             suppresses consensus
  -PROtein                 insists that your sequences are protein
  -DNA                     insists that your sequences are DNA
  -MSF[=globin.msf]        writes an MSF file
  


ACKNOWLEDGEMENTS

MultAlign is based on the ALIGNSTAT program of Michael Krueger. It has been implemented and adapted to HUSAR by Ulrike Goebel and Karl-Heinz Glatting, DKFZ Heidelberg.


LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

MultAlign uses a symbol distance table found in either multdna.cmp or multpep.cmp to find the distance value between each pair of symbols. The public version of multdna.cmp is a unitary matrix for nucleic acids, the public version of multpep.cmp is a distance matrix according to Dayhoff (a modified LOM-matrix). You can choose your own distance matrix via the -DATa1 parameter. To each distance table belongs a "group" file, where possible consensus sequence symbols are stored. Standard group files are multdna.grp and multpep.grp for nucleotides and amino acids, respectively. Other group files may be choosen via the -DATa2 parameter. You can fetch a unitary distance matrix for amino acids named multpepuni.cmp as well as a corresponding group file named multpepuni.grp. Furthermore, you can store the tree defining the order to align the sequences in a local data file and specify it with -DATa3. Fetch the file globintree.mult as an example.


REQUIRED PARAMETERS

-GAPweight=5.0 and -LENgthweight=1.0

These variables define the function for weighting insertions of length k with the penalty factor

Weight(k) = GapWeight + k * LengthWeigth

Please note that -GAPweight should never be less than the minimum distance between two different letters. There is no simple rule for selecting these variables. Though in most cases the default values lead to alignments, which are acceptable in first approximation, one has to optimize both parameters with great patience to find the optimum or to handle more difficult situations. (default: -GAPweight= 5.0 and -LENgthweight=1.0). For a more detailed description of gap penalties see the ALGORITHM topic in the program description of Gap.

-DIND=80

This variable affects the weighting for longer insertions: an insertion of length k is weighted as k/DIND insertions of length DIND. To get optimal alignments with longer insertions, reasonable values for DIND are 80 - 100. Otherwise, small values also lead to reasonable results. (default: -DIND= 80).


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-LIMit1=20 and -LIMit2=20

The sum of all gaps in the first/second (group of) sequence(s) is restricted so that the second/first (group of) sequence(s) does not come out of phase for more than 20 symbols.

BOTH LIMITS ARE INITIAL VALUES. THE PROGRAM CHECKS FOR EACH (GROUP OF) SEQUENCE(S) WHETHER THEY ARE SUFFICIENT TO COMPUTE A COMPLETE ALIGNMENT. IF NOT, THE VALUES ARE CHANGED TO THE LOWEST ONES THAT ALLOW COMPUTATION !

Both parameters are set to 20 by default. Usually, you will only have to change them, if you want to align sequences of nearly equal length but with homologies within different regions.

-PERCent=50

defines how many percent of identities are necessary to obtain a letter in the consensus sequence. For example : if there are three sequences, but only two of the three symbols are identical and -PERCent is set to 70, the consensus sequence will contain a space at this position, (default value: 50)

-ENDweight

causes the endgaps to the weighted in the same way as all other gaps.

-WIDTH=60

puts 60 symbols on each line in the output file. You can set the width to anything from 10 to 150 symbols.

-NOCONSensus

suppresses the display of the consensus sequence

-PROtein

insists that your sequences are protein sequences.

-DNA

insists that your sequences are nucleotide sequences.

-MSF[=seqname.msf]

writes an MSF output file which can be used as input for other programs (e.g. ELineUp) . See the Specifying Sequences section of the GCG User's Guide for a complete description of the MSF files.


REFERENCES

Waterman M.S., Smith T.F., Beyer W.A. (1976) "Some biological sequence metrics." Adv. Math. 20, 367-387.

Krueger M. and Osterburg G. (1983) Comp. Prog. in Biomed. 16, 68-69.

Printed: April 22, 1996 15:54 (1162)