MultAlign does a simultaneous alignment for two or more DNA or protein sequences. It introduces a certain number of gaps into either pairwise aligned sequences or groups of sequences to find a minimal global distance. The user can influence the result by defining the order in which the sequences will be aligned. The program is based on a generalization of the algorithm of Waterman, Smith and Beyer by Krueger and Osterburg.
The program is based on a generalization of the algorithm of M.S.Waterman, T.F.Smith and W.A.Beyer (Adv.Math. Vol. 20, pp. 367- 387 (1976)) by M.Krueger and G. Osterburg (Comp. Prog. in Biomed. Vol. 16, pp. 68- 69 (1983)). The necessary changes (insertions) are counted with a penalty factor (defined below), and their contribution is substracted from the total score of the matching residues. The resulting optimal alignment is printed marking equal letters with an asterisk and similarities (purines or pyrimidines for DNA, physicochemically related amino acid groups for proteins) with different signs.
This program was written by Weiyun Chen and Karl-Heinz Glatting at the German Cancer Research Centre (DKFZ), Heidelberg, Germany.
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
% multalign MultAlign on what sequence(s) ? @globin.fil hbahum.pep >>> Length of sequence 1: 141 symbols <<< hbbhum.pep >>> Length of sequence 2: 146 symbols <<< hbghum.pep >>> Length of sequence 3: 146 symbols <<< hbhagf.pep >>> Length of sequence 4: 148 symbols <<< hbrlam.pep >>> Length of sequence 5: 149 symbols <<< mycrhi.pep >>> Length of sequence 6: 151 symbols <<< myohum.pep >>> Length of sequence 7: 153 symbols <<< Would you like to: A)dd more sequences Q)uit and compute alignment Please choose one (* Q *): What should I call the output file (* globin.mult *) ? What value for DIND (* 80 *) ? What is the gap weight (* 5.0 *) ? 1 What is the gap length weight (* 1.0 *) ? Enter tree to define grouping of sequences ( * 7 6 5 4 3 2 1 *) 1(2 3)7(4 5)6 USED SEQUENCES: 1 HBAHUMPEP HBAHUM HEMOGLOBIN ALPHA CHAIN, HUMAN 2 HBBHUMPEP HBBHUM HEMOGLOBIN BETA CHAIN, HUMAN 3 HBGHUMPEP HBGHUM HEMOGLOBIN GAMMA CHAIN, HUMAN 4 HBHAGFPEP HBHAGF HEMOGLOBIN, HAGFISH (MYXINE GLUTINOSA) 5 HBRLAMPEP HBRLAM HEMOGLOBIN, RIVER LAMPREY (LAMPETRA FLUVIATILIS) 6 MYCRHIPEP MYCRHI MYOGLOBIN, GASTROPOD, CERITHIDEA RHIZOPHORARUM 7 MYOHUMPEP MYOHUM MYOGLOBIN, HUMAN %
The result looks like:
MultAlign September 14, 1990 16:51 of: @globin.fil hbahum.pep ck: 9231 from: 1 to: 141 Length: 141 hbbhum.pep ck: 1242 from: 1 to: 146 Length: 146 hbghum.pep ck: 3104 from: 1 to: 146 Length: 146 hbhagf.pep ck: 4827 from: 1 to: 148 Length: 148 hbrlam.pep ck: 7737 from: 1 to: 149 Length: 149 mycrhi.pep ck: 918 from: 1 to: 151 Length: 151 myohum.pep ck: 4188 from: 1 to: 153 Length: 153 PARAMETER SET: DIND: 80 Percent: 50 Gap Weight: 1.0 Length Weight: 1.0 Limit1: 20 Limit2: 20 Symbol comparison table: gendatabase:multpep.cmp Consensus group file: gendatabase:multpep.grp Specified grouping: S Nucl./A.A. Description 1STPAG HYDROXYL / SMALL ALIPHATIC 2STPAGNDEQ HYDROPHILIC 3NDEQ ACID / ACID AMIDE 4HRK BASIC 5EDHKR CHARGED 6AMILV ALIPHATIC 7FYW AROMATIC 8ALIVMFYW HYDROPHOBIC C CYSTEINE USED SEQUENCES: 1 HBAHUMPEP HBAHUM HEMOGLOBIN ALPHA CHAIN, HUMAN 2 HBBHUMPEP HBBHUM HEMOGLOBIN BETA CHAIN, HUMAN 3 HBGHUMPEP HBGHUM HEMOGLOBIN GAMMA CHAIN, HUMAN 4 HBHAGFPEP HBHAGF HEMOGLOBIN, HAGFISH (MYXINE GLUTINOSA) 5 HBRLAMPEP HBRLAM HEMOGLOBIN, RIVER LAMPREY (LAMPETRA FLUVIATILIS) 6 MYCRHIPEP MYCRHI MYOGLOBIN, GASTROPOD, CERITHIDEA RHIZOPHORARUM 7 MYOHUMPEP MYOHUM MYOGLOBIN, HUMAN USED TREE: 1(2 3)7(4 5)6 alignment of sequences: 2 3 10 20 30 40 50 60 . . . . . . 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV HBBHUMPEP 1 GHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKV HBGHUMPEP * * * * * ******* *** *************** *** ** * ****** H8T2E2K116T1LWGKVNV226GGE1LGRLLVVYPWTQRFF2SFG2LS112A6MGNPKV CONSENSUS . . . . . . 61 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK HBBHUMPEP 61 KAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGK HBGHUMPEP ******** * *** ****** **************** ******* *** **** KAHGKKVL1181D16 HLD2LKGTFA2LSELHCDKLHVDPENF4LLGNVLV VLA HFGK CONSENSUS . . . . . . 121 EFTPPVQAAYQKVVAGVANALAHKYH HBBHUMPEP 121 EFTPEVQASWQKMVTGVASALSSRYH HBGHUMPEP **** *** ** * *** ** ** EFTP2VQA17QK6V1GVA2AL1 4YH CONSENSUS alignment of sequences: alignment of sequences: 4 5 10 20 30 40 50 60 . . . . . . 1 PITDHGQPPTLSEGDKKAIRESWPQIYKNFEQNSLAVLLEFLKKFPKAQDSFPKFSAKKS HBHAGFPEP 1 PIVDSGSVAPLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKGMTS HBRLAMPEP ** * * ** * ** * * * * * * * ** **** * PI D G2 11LS212K IR21W126Y N7E221626L65F8 P AQ2 FPKF 1 S CONSENSUS . . . . . . 61 --HLEQDPAVKLQAEVIINAVNHTIGLMDKEAAMKKYLKDLSTKHSTEFQVNPDMFKELS HBHAGFPEP 61 ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA HBRLAMPEP * * ** ****** ** * * *** ** *** * ** * 12 L5 212V48 AE IINAVN5161 MD522 M L4DLS1KH1 2FQV2P28FK L1 CONSENSUS . . . . . . 119 AVFVSTMG-GKAAYEKLFSIIATLLRSTYDA HBHAGFPEP 121 AVIADTVAAGDAGFEKLMSMICILLRSA--Y HBRLAMPEP ** * * * *** * * **** AV862T611G5A17EKL8S6I LLRS1728111 CONSENSUS alignment of sequences: 1(2 3)7(4 5)6 10 20 30 40 50 60 . . . . . . 1 ---------VLSPADKTNVKAAW---GKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-D HBAHUMPEP 1 --------VHLTPEEKSAVTALW---GKVNVDEV--GGEALGRLLVVYPWTQRFFESFGD HBBHUMPEP 1 --------GHFTEEDKATITSLW---GKVNVEDA--GGETLGRLLVVYPWTQRFFDSFGN HBGHUMPEP 1 PITDHGQPPTLSEGDKKAIRESW---PQIYKNFEQNSLAVLLEFLKKFPKAQDSFPKFSA HBHAGFPEP 1 PIVDSGSVAPLSAAEKTKIRSAW---APVYSNYETSGVDILVKFFTSTPAAQEFFPKFKG HBRLAMPEP 1 ---------SLQPASKSALASSWKTLAKDAATIQNNGATLFSLLFKQFPDTRNYFTHFGN MYCRHIPEP 1 --------GL-SDGEWQLVLNVW---GKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKH MYOHUMPEP .. . * ... . . . ... * .. * * 16 2 12 LS222K 6 2 W416GKV 2 G E L RLF P TQ F2 F CONSENSUS . . . . . . 48 LSH-----GSAQVKGHGKKVADALTNAVAHVDD---MPNALSALSDLHAHKLRVDPVNFK HBAHUMPEP 48 LSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN---LKGTFATLSELHCDKLHVDPENFR HBBHUMPEP 48 LSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDD---LKGTFAQLSELHCDKLHVDPENFK HBGHUMPEP 58 KKS--HLEQDPAVKLQAEVIINAVNHTIGLMDKEAAMKKYLKDLSTKHSTEFQVNPDMFK HBHAGFPEP 58 MTSADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFK HBRLAMPEP 52 M-SDAEMKTTGVGKAHSMAVFAGIGSMIDSMDDADCMNGLALKLSRNHIQR-KIGASRFG MYCRHIPEP 49 LKSEDEMKASEDLKKHGATVLTALGGILKKKGH---HEAEIKPLAQSHATKHKIPVKYLE MYOHUMPEP . . .. .. . . .. . *. * . ... .. L S22 6 22 VK HG V82A82 6 DD22 M 8 LS H K VDP FK CONSENSUS . . . . . . 100 LLSHCLLVTLAAHLPA--EFTPAVHASLDKFLASVSTVLTSKYR------ HBAHUMPEP 105 LLGNVLVCVLAHHFGK--EFTPPVQAAYQKVVAGVANALAHKYH------ HBBHUMPEP 105 LLGNVLVTVLAIHFGK--EFTPEVQASWQKMVTGVASALSSRYH------ HBGHUMPEP 116 ELSAVFVSTMG-GKAAYEKLFSIIATLLRSTYDA---------------- HBHAGFPEP 118 VLAAVIADTVAAGDAGFEKLMSMICILLRSA--Y---------------- HBRLAMPEP 110 EMRQVFPNFLDEALGGGAS--GDVKGAWDALLAYLQDNKQAQAL------ MYCRHIPEP 106 FISECIIQVLQSKHPG--DFGADAQGAMNKALELFRKDMASNYKELGFQG MYOHUMPEP . . .. . . . . L V8 LA 1 2 F 1 V 8 K 82 8 2 Y 261721 CONSENSUS Alignment of 7 different sequences using 0.13 minutes of CPU time Output file: globin.mult.
Using the -msf option, Multalign writes the alignment into a multiple sequence format (MSF) file that interleaves the sequences to show their alignment. Any or all of the sequences in this file can be used by any other GCG sequence analysis program. For instance, you could generate a profile from the sequences in an MSF file with a command like % profilemake multalign.msf{*} and then use that profile to search the database for sequences similar to the sequences in the alignment. (See the Specifying Sequences section of the User's Guide for help specifying sequences in MSF files.)
EClustAlW calculates a multiple alignment of nucleic acid or protein sequences according to the method of Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). This is part of the original ClustalW distribution, modified for inclusion in EGCG. Tree produces a multiple alignment for a set of protein sequences by iteratively acting on the sequences. An approximate phylogenetic order of the sequences is first determinded by a series of pairwise alignments using the Needleman and Wunsch method. Any subclusters that may exist in the set are prealigned before the final alignment is undertaken. Finally, the phylogenetic tree of the sequences is plotted in the form of a dendrogram. PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp is a screen editor for editing multiple sequence alignments. You can edit up to 30 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict. Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds. ProfileGap makes an optimal alignment between a profile and a sequence. Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it. PrettyPlot displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment, it simply displays it. PrettyBox displays multiple sequence alignments as shaded boxes in Postscript format (e.g., the output file must be printed and/or displayed on a Postscript-compatible device). PrettyBox will optionally calculate a consensus sequence. The program does not create the alignment; it simply displays it. PlotAlign takes a GCG format sequence alignment, and plots the mean and range of values for any amino acid parameter you supply. The "panel file" contains a list of parameters to be plotted. The main database of parameters is taken from Nakai et al. (1988), and the default panel file uses selected parameters from the 13 discrete clusters in that paper. This program is experimental. Any suggestions would be most welcome.
Multalign can (theoretically) align up to 1,000 sequences of up to 2,500 symbols each. As gaps are inserted the length of the final alignment grows, but the length in the final alignment cannot exceed 2,500 characters for any sequence. That means, the maximum sequence length is 2,500 - X, where X is the number of gaps introduced by Multalign.
MultAlign is based on a generalisation of the algorithm of M.S Waterman, T.F. Smith, W.A. Bayer (Adv.Math. Vol. 20, pp. 367-387 (1976) by M. Krueger and G. Osterburg (Comp. Prog. in Biomed. Vol. 16, pp. 68-69 (1983).
Internally, MultAlign aligns the sequences sequentially. The program starts with two sequences, takes the resulting alignment and aligns it with another alignment and so on. Therefore, the result of Multalign depends extremely on the order of the sequences that will be compared. You can specify a tree to define that order. Let's look at an example: we want to align seven sequences (HBAHUM, HBBHU, HBGHUM, HBHAGF, HBRLAM, MYOHUM, MYCRHI) and we specify the following tree: 1 (2 3) 7 (4 5) 6
hbghumi(3) hbbhum(2) \\ / hbahum(1) \\ / / \\/ / \\ / myohum(7) \\ / / \\/ / \\ / hbhagf(4) \\ / | \\/ | \\ |___ hbrlam(5) \\ / \\ / mycrhi(6) \\/ / \\ / \\ / \\ / \\/This means that the second and third sequences entered (HBBHUM and HBGHUM) will be aligned first. In a next step the resulting alignment will be aligned to the sequences number 1 (HBAHUM) followed by sequence number 7 (MYOHUM). Independently, the sequences 4 and 5 will be compared and finally the overall alignment will be computed including sequence number 6 (MYCRHI). Generally, closely related sequences should be aligned first, while distantly related sequences (or sequence groups) should be compared in later steps. The tree can be either entered interactively, or by the -TREE parameter (note that you have to replace spaces by underscores "_") or you can store the tree in a local data file and use the parameters -DATa3 (see the LOCAL DATA FILES topic below).
There are several important parameters which profoundly influence the resulting alignments. Though the program is installed with a default value for each parameter, they should be carefully checked to avoid meaningless results.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimal Syntax: % multalign [-INfile=]@globin.fil -Default Prompted Parameters: [-OUTfile=]globin.mult output file name -BEGin1=1 -END1=100 range of interest for sequence 1 -REV1 -NOREV3 strand of each sequence -TREE="((1_2)_3)" order in which to align the sequences -DIND=80 weighting for longer insertions -GAPweight=5.0 gap weight -LENgthweight=1.0 gap length weight -BATch[=long] run MultAlign in specified batch-queue Local Data Files: -DATa1=multdna.cmp distance matrix for DNA sequences -DATa2=multdna.grp grouping of sequence symbols to be used for the (DNA) consensus sequence -DATa3=tree.mult contains tree (alternative to -TREE) Optional Parameters: -LIMit1=20 sum of all gaps in sequence 2 is restricted so that sequence 2 does not come out of phase with sequence 1 for more than 20 elements -LIMit2=20 sum of all gaps in sequence 1 is restricted so that sequence 1 does not come out of phase with sequence 2 for more than 20 elements All limits are initial values. The program itself checks for every sequence whether they are sufficient to compute a complete alignment. If not, values are changed to the lowest ones that allow computation. -PERCent=50 defines how many percent of identities are necessary to obtain a letter in the consensus. -ENDWeight weights end gaps like other gaps -WIDTH=60 number of bases per line -NOCONSensus suppresses consensus -PROtein insists that your sequences are protein -DNA insists that your sequences are DNA -MSF[=globin.msf] writes an MSF file
MultAlign is based on the ALIGNSTAT program of Michael Krueger. It has been implemented and adapted to HUSAR by Ulrike Goebel and Karl-Heinz Glatting, DKFZ Heidelberg.
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.
MultAlign uses a symbol distance table found in either multdna.cmp or multpep.cmp to find the distance value between each pair of symbols. The public version of multdna.cmp is a unitary matrix for nucleic acids, the public version of multpep.cmp is a distance matrix according to Dayhoff (a modified LOM-matrix). You can choose your own distance matrix via the -DATa1 parameter. To each distance table belongs a "group" file, where possible consensus sequence symbols are stored. Standard group files are multdna.grp and multpep.grp for nucleotides and amino acids, respectively. Other group files may be choosen via the -DATa2 parameter. You can fetch a unitary distance matrix for amino acids named multpepuni.cmp as well as a corresponding group file named multpepuni.grp. Furthermore, you can store the tree defining the order to align the sequences in a local data file and specify it with -DATa3. Fetch the file globintree.mult as an example.
These variables define the function for weighting insertions of length k with the penalty factor
Weight(k) = GapWeight + k * LengthWeigth
Please note that -GAPweight should never be less than the minimum distance between two different letters. There is no simple rule for selecting these variables. Though in most cases the default values lead to alignments, which are acceptable in first approximation, one has to optimize both parameters with great patience to find the optimum or to handle more difficult situations. (default: -GAPweight= 5.0 and -LENgthweight=1.0). For a more detailed description of gap penalties see the ALGORITHM topic in the program description of Gap.
This variable affects the weighting for longer insertions: an insertion of length k is weighted as k/DIND insertions of length DIND. To get optimal alignments with longer insertions, reasonable values for DIND are 80 - 100. Otherwise, small values also lead to reasonable results. (default: -DIND= 80).
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
The sum of all gaps in the first/second (group of) sequence(s) is restricted so that the second/first (group of) sequence(s) does not come out of phase for more than 20 symbols.
BOTH LIMITS ARE INITIAL VALUES. THE PROGRAM CHECKS FOR EACH (GROUP OF) SEQUENCE(S) WHETHER THEY ARE SUFFICIENT TO COMPUTE A COMPLETE ALIGNMENT. IF NOT, THE VALUES ARE CHANGED TO THE LOWEST ONES THAT ALLOW COMPUTATION !
Both parameters are set to 20 by default. Usually, you will only have to change them, if you want to align sequences of nearly equal length but with homologies within different regions.
defines how many percent of identities are necessary to obtain a letter in the consensus sequence. For example : if there are three sequences, but only two of the three symbols are identical and -PERCent is set to 70, the consensus sequence will contain a space at this position, (default value: 50)
causes the endgaps to the weighted in the same way as all other gaps.
puts 60 symbols on each line in the output file. You can set the width to anything from 10 to 150 symbols.
suppresses the display of the consensus sequence
insists that your sequences are protein sequences.
insists that your sequences are nucleotide sequences.
writes an MSF output file which can be used as input for other programs (e.g. ELineUp) . See the Specifying Sequences section of the GCG User's Guide for a complete description of the MSF files.
Waterman M.S., Smith T.F., Beyer W.A. (1976) "Some biological sequence metrics." Adv. Math. 20, 367-387.
Krueger M. and Osterburg G. (1983) Comp. Prog. in Biomed. 16, 68-69.
Printed: April 22, 1996 15:54 (1162)