Profalign

Go back to top

PROFALIGN

PROFALIGN

FUNCTION

ProfAlign is for taking two old aligments (or single sequences) and aligning them with each other. The result is one bigger aligment. This is part of the original ClustalW distribution, modified for inclusion in EGCG.

DESCRIPTION

ProfAlign is a program for performing multiple alignments of two sets of either DNA or protein sequences. The two profiles together can contain the total number of up to 500 sequences. Either profile can consist of already aligned sequences (usually written into multiple sequence format [MSF] files) or of single sequences. This program is very similar to the multiple alignment step in the program EClustAlW except that there is no mention of dendrograms here because they are not needed.

ProfAlign is useful because one may wish to build up a multiple alignment gradually, choosing different parameters manually or correcting intermediate errors as the alignment proceeds. Often, just a few sequences cause multiple alignments in the progressive algorithm and these can removed from the process and then added at the end by profile alignment. A second use is where one has a high quality references alignment and wishes to keep it fixed while adding new sequences automatically.

A file is usually produced during the alignment process containing the multiple alignment.

AUTHOR

ClustalW was written by Des Higgins (E-mail:Des.Higgins@ebi.ac.uk)

The EGCG version of the program was modified by Weiyun Chen and Karl-Heinz Glatting at the German Cancer Research Centre (DKFZ), Heidelberg, Germany.

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).

EXAMPLE

  
  % profalign -msf
  
   ProfAlign of what first profile ? globinhum.msf{}
  
   ProfAlign of what second profile ? hbrlam.pep
  
   What should I call the alignment output file (* profalign.aln *) ?
  
   What should I call the MSF output file (* profalign.msf *) ?
  
   Start of Pairwise alignments
   Aligning...
   Sequences (1:4) Aligned. Score:  31
   Sequences (2:4) Aligned. Score:  19
   Sequences (3:4) Aligned. Score:  25
  
   Start of Multiple Alignment
   Sequence:4     Score:881
  
   Alignment Score 2370
  
   Consensus length = 159
  
  %

OUTPUT

The final multiple alignment is sent to a file whose name is profalign.aln. The output is self explanatory. Positions where all residues are identical are marked with an asterisk ( * ) and, for proteins, positions where all residues are "similar" are marked with a dot ( . ).

Here is the output file:

  
  
  
                     profalign August 21, 1995 16:58
  
  of: globinhum.msf{} and globinfish.msf{}
  
   Multiple alignment parameter:
  
   Gap Penalty (fixed):           10.00
   Gap Penalty (varying):         .05
   Gap separation penalty range:  8
   Percent. identity for delay:   0%
   List of hydrophilic residue:   GPSNDQEKR
   Protein Weight Matrix:         blosum
  
                     10        20        30        40        50        60
                      .         .         .         .         .         .
  hbahum.pep      ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-----
  hbbhum.pep      --------VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST
  hbghum.pep      --------GHFTEEDKATITSLWGKV--NVEDAGGETLGRLLVVYPWTQRFFDSFGNLSS
  myohum.pep      ---------GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKS
  hbhagf.pep
   PITDHGQPPTLSEGDKKAIRESWPQIYKNFEQNSLAVLLEFLKKFPKAQDSFPKFSAKKS
  hbrlam.pep
   PIVDSGSVAPLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKGMTS
                        .  .   .   *  .           *       * .   *  *
  
  hbahum.pep
   -DLSHGSAQVKGHGKKVADALTNAVAHVD---DMPNALSALSDLHAHKLRVDPVNFKLLS
  hbbhum.pep
   PDAVMGNPKVKAHGKKVLGAFSDGLAHLD---NLKGTFATLSELHCDKLHVDPENFRLLG
  hbghum.pep
   ASAIMGNPKVKAHGKKVLTSLGDAIKHLD---DLKGTFAQLSELHCDKLHVDPENFKLLG
  myohum.pep
   EDEMKASEDLKKHGATVLTALGGILKKKG---HHEAEIKPLAQSHATKHKIPVKYLEFIS
  hbhagf.pep
   --HLEQDPAVKLQAEVIINAVNHTIGLMDKEAAMKKYLKDLSTKHSTEFQVNPDMFKELS
  hbrlam.pep
   ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
                      .. .   .  .    .               *.  *.   ..       .
  
  hbahum.pep      HCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-------
  hbbhum.pep      NVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-------
  hbghum.pep      NVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH-------
  myohum.pep      ECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQ-G
  hbhagf.pep      AVFVSTMG----------GKAAYEKLFSIIATLLRSTYDA------
  hbrlam.pep      AVIADTVAA---------GDAGFEKLMSMICILLRSAY--------
                   .                 *        .   *

MULTIPLE SEQUENCE FILES

Using the -msf option, ProfAlign writes the alignment into a multiple sequence format (MSF) file that interleaves the sequences to show their alignment. Any or all of the sequences in this file can be used by any other GCG or EGCG sequence analysis program. For instance, you could generate a profile from the sequences in an MSF file with a command like % profilemake profalign.msf{*} and then use that profile to search the database for sequences similar to the sequences in the alignment. (See the Specifying Sequences section of the GCG User's Guide for help specifying sequences in MSF files.)

RELATED PROGRAMS

BoxAlign displays a sequence alignment graphically marking columns with conserved amino-acids or nucleotides with boxes. BoxAlign does not compute an alignment, it simply displays it.

EClustAlW calculates a multiple alignment of nucleic acid or protein sequences according to the method of Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). This is part of the original ClustalW distribution, modified for inclusion in EGCG.

ClusTree computes a phylogenetic tree according to the Neighbor-Joining Method of Saitou and Nei (1987). This is part of the original ClustalW distribution, modified for inclusion in EGCG. The tree will be displayed graphically.

LineUp is a screen editor for editing multiple sequence alignments. You can edit up to 30 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict.

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.

MultAlign does a simultaneous alignment for two or more DNA or protein sequences. It introduces a certain number of gaps into either pairwise aligned sequences or groups of sequences to find a minimal global distance. The user can influence the result by defining the order in which the sequences will be aligned. The program is based on a generalization of the algorithm of Waterman, Smith and Beyer by Krueger and Osterburg.

PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment.

PlotAlign takes a GCG format sequence alignment, and plots the mean and range of values for any amino acid parameter you supply. The "panel file" contains a list of parameters to be plotted. The main database of parameters is taken from Nakai et al. (1988), and the default panel file uses selected parameters from the 13 discrete clusters in that paper. This program is experimental. Any suggestions would be most welcome.

Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it.

ProfileGap makes an optimal alignment between a profile and a sequence.

TProfileGap makes an optimal alignment between a profile and a sequence.

Tree produces a multiple alignment for a set of protein sequences by iteratively acting on the sequences. An approximate phylogenetic order of the sequences is first determinded by a series of pairwise alignments using the Needleman and Wunsch method. Any subclusters that may exist in the set are prealigned before the final alignment is undertaken. Finally, the phylogenetic tree of the sequences is plotted in the form of a dendrogram.

RESTRICTIONS

ProfAlign always requires the input of two sets of sequences (usually written as multiple sequence format [MSF] files). The number of sequences contained in both sets must not exceed 500 in total, whereas each sequence can have a maximum length of up to 10,000 characters. As gaps are inserted, the length of the final alignment grows, but the length in the final alignment cannot exceed 10,000 characters for any sequence. This means, the maximum sequence length is 10,000 - X, where X is the number of gaps introduced by ProfAlign For DNA U = T. No ambiguity codes are used.

ALGORITHM

Multiple Alignment

The basic algorithm used attempts to minimize the distance between groups of sequences. A full dynamic programming (Myers E.W. and Miller W. CABIOS 4: 11-17 (1988); Thompson J.D. CABIOS 11; 181-186 (1995)) algorithm is used with a residue weight matrix and penalties for opening and extending gaps. For detailed description of the method used for alignment two existing alignments or sequences, see the entry for Clustal in the Program Manual.

COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimal Syntax: % profalign  [-PROFile1=]name1.msf{*} \
                               [-PROFile2=]name2.msf{*} -Default
  
  Prompted Parameters:
  
  [-OUTfile=]profalign.aln     output file name
  
  Local Data Files:
  
  -DATa=pam250.clus       comparison table for protein alignments
  
  Optional Parameters:
  
  -MSF[=profalign.msf]    writes an MSF file
  
  Multiple alignment:
  
  -MATRIX=blosum          BLOSUM, PAM or ID
                     comparison table for protein multiple alignments
  -GAPC=10.0              gap penalty (fixed);
                     increase to prevent gaps; decrease to encourage them
  -GAPV=5.0               gap penalty (varying);
                     decrease to encourage LONGER gaps
  -UNWeighted             controls whether transitions are weighted twice as
                     much as transversions (only applies to DNA)
  -ENDGAPs                no end gap separation penalty
  -GAPDist=8              gap separation penalty range
  -NORGAP                 no residue specific gaps
  -RGAPRes="GPSNDQEKR"    list of hydrophilic residues
  -NOHGAP                 no hydrophilic gaps

ACKNOWLEDGEMENT

For details about the ClustAlW program package, including ClustAlW, ProfAlign and Clustree, see J. D. Thompson et al. (Nucleic Acids Research, 22 (22): 4673-4680 (1994)) and D. G. Higgins et al. (CABIOS 8 (2):189-191 (1992)). For details about the overall multiple alignment algorithm see D. G. Higgins and P. M. Sharp (CABIOS 5: 151-153 (1989)).

ProfAlign is part of ClustalW which was developed and written by Des Higgins, European Bioinformatics Institute, EMBL Outstation, Hinxton, UK. The program was added to the Package for HUSAR version 3.0 by Weiyun Chen and Karl-Heinz Glatting, DKFZ Heidelberg, Germany, and converted to EGCG by Peter Rice, Sanger Centre, Hinxton, UK.

LOCAL DATA FILES

For protein comparison, a weight matrix is used to weight aligned amino acid. The default is the BLOSUM series. But you can also use your own protein matrix as a local data file. By naming a file on the command line with an expression like -DATa=pam250.clus the matrix in the file pam250.clus then will be used.

OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-MSF[=seqname.msf]

writes an MSF output file which can be used as input for other GCG/EGCG programs (e.g. ClusTree) . See the Specifying Sequences section of the GCG User's Guide for a complete description of the MSF files.

MULTIPLE ALIGNMENT PARAMETERS

-MATRIX=blosum

For protein comparisons, a weight matrix is used to differentially weight different pairs of aligned amino acids. The BLOSUM matrix used by default can be changed into a PAM, an ID (Identity) matrix or an user defined matrix in a local data file (see also LOCAL DATA FILES section).

-GAPC=10

The gap penalty (Fixed) parameter is a penalty for every gap that is introduced, regardless of the length of the gap. Therefore, decreasing this parameter will encourage gaps of all sizes. Terminal gaps are penalized just as all others. BEWARE: if you choose the penalty too small (approx. 5 or so), then the program may prefer to align each sequence opposite a long gap.

-GAPV=0.05

The gap penalty (Varying) parameter is a penalty for each item in each gap. Therefore, this is a penalty for longer gaps. Increase this and gaps will get shorter. BEWARE: if you choose the penalty too small, then the program may prefer to align each sequence opposite a long gap (default 5.0 for DNA).

For a more detailed description of gap penalties see the ALGORITHM section in the program description of Gap.

-UNWeighted

If transitions are unweighted, then all nucleic acid mismatches have the same weight (all pairs of nucleotides are equally weighted). If transitions (C vs T; A vs G) are weighted more strongly than transversions (an A aligned with a G will be preferred to an A aligned with a C or a T), then transitions have an intermediate score between exact matches and other mismatches. The default is weighted transitions.

-ENDGAPs

End gap separaton treats end gaps just like internal gaps for the purposes of avoiding gaps that are too close. If you turn this parameter off, end gaps will be ignored. This is useful when you wish to align fragments where the end gaps are not biologically meaningful.

-GAPDist=8

This parameter defines the gap separation penalty range. Gap separation distance tries to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalized more than other gaps. This does not prevent close gaps, it makes them less frequent, promoting a block-like appearance of the alignment.

-NORGAP

By setting this parameter, no residue specific gap is introduced. Residue specific penalties are amino acid specific gap penalties that reduce or increase the gap opening penalties at each position in the alignment or sequence.

-RGAPRes="GPSNDQEKR"

This parameter defines the list of hydrophilic residues. Hydrophilic gap penalties are used to increase the chances of a gap within a run (5 more residues) of hydrophilic amino acids, these are likely to be loop or random coil regions where gaps are more common.

-NOHGAP

By setting this parameter, no hydrophilic gap is introduced.

OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

REFERENCES

Myers E.W. and Miller W. CABIOS (1988). "Optimal alignments in linear space." Comput. Appl. Biosci. 4, 11-17.

Thompson J.D. (1995). "Introducing variable gap penalties to sequence alignment in linear space."; Comput. Appl. Biosci. 11, 181-186.

Thompson J.D., Higgins D.G. and Gibson T.J. (1994) "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic Acids Research 22 4673-4680.

Higgins D.G., Bleasby A.J., Fuchs R. (1992). "CLUSTAL V: improved software for multiple sequence alignment." Comput. Appl. Biosci. 8 189-191.

Higgins D.G., Sharp P.M. (1989). "Fast and sensitive multiple sequence alignments on a microcomputer." Comput. Appl. Biosci. 5 151-153.

Printed: April 22, 1996 15:55 (1162)