Eprotdist

Go back to top

EPROTDIST*

EPROTDIST*

FUNCTION

EProtDist computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid.

EProtDist is a modified version of the PHYLIP version 3.572c's PROTDIST, by Joseph Felsenstein, with command line control added.

DESCRIPTION

EProtPars uses protein sequences to compute a distance matrix, under three different models of amino acid replacement. These models of amino acid substitution are one which is based on the PAM matrixes of Margaret Dayhoff, one due to Kimura (1983) which approximates it based simply on the fraction of similar amino acids, and one based on a model in which the amino acids are divided up into groups, with change occurring based on the genetic code but with greater difficulty of changing between groups. The program correctly takes into account a variety of sequence ambiguities. The distance for each pair of species estimates the total branch length between the two species, and can be used in the distance matrix programs EFitch, EKitsch or ENeighbor. This is an alternative to use of the sequence data itself in the parsimony program EProtPars.

The input file for EProtDist can be an MSF or PHYLIP formated file.

AUTHOR

This program was originally written by Joe Felsenstein (E-mail:joe@evolution.genetics.washington.edu. Post: Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, U.S.A.)

This version was modified for inclusion in EGCG by Maria Jesus Martin (E-mail: martin@ebi.ac.uk; Post: EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SQ or E-mail: martin@tdi.es; Post: Tecnologia para Diagnostico e Investigacion, Condes de Torreanaz 5, 28028 Madrid).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).

EXAMPLE

Here is a session with EProtDist

  
  
  % eprotdist -options
  
   EPROTDIST of what sequences file ?  fos.msf{*}
  
   What should I call the output file (* fos.eprotdist *) ?
  
   Distance methods :
  
     D)ayhoff PAM matrix.
     K)imura formula.
     C)ategories model.
  
   Choose the method to use  (* D *) ?
  
   Print out the data at start of run (* No *) ?
  
  Computing distances:
    FOSAVINK
    FOSCHICK    .
    FOSMOUSE    ..
    FOSRAT      ...
    FOSHUMAN    ....
    FOSXMSVFR   .....
    FOSMSVFB    ......
    FOSBMOUSE   .......
    FOSBSTAEP   ........
  
  Output written to fos.eprotdist
  
  %

INPUT FILE

The input file for EProtDist is either GCG MSF protein sequence file or a PHYLIP protein sequence file.

In the PHYLIP format the first line contains the number of species and the number of nucleic acid positions (counting any stop codons that you want to include), separated by blanks. Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats. In the interleaved format, some lines giving the first part of each of the sequences, then lines giving the next part of each, and so on. Thus the sequences mightlook like this:

  
  
   6 50
  APHAVP1C    ---------- ---------- -------TAC
  APHAVP1D    ---------- ---------- -------TAC
  APHAVP1A    ------GTCA CCACCACCNN GGAGAACTAC
  APHAVP1B    ---------- ---------- ----------
  APHAVP1E    ---------- ---------- ----------
  APHAVP1F    GACCCTGTCA CCACCACCGT GGAGAACTAC
  
  GGCGGTCAGA CACAAACCCA
  GGCGGTCAGA CACAAACCCA
  GGCGGTGAGA CACAAACCCA
  ------GAGA CACAAACCCA
  ---------- ---AAACCCA
  GGCGGTGAGA CACAAACCCA

The "sequential" format has all of the data for the first species, then all of the characters for the next species, and so on.

Here is an example with a sequential PHYLIP format:

  
  
   6 50
  APHAVP1C    ---------- ---------- -------TAC
  GGCGGTCAGA CACAAACCCA
  APHAVP1D    ---------- ---------- -------TAC
  GGCGGTCAGA CACAAACCCA
  APHAVP1A    ------GTCA CCACCACCNN GGAGAACTAC
  GGCGGTGAGA CACAAACCCA
  APHAVP1B    ---------- ---------- ----------
  ------GAGA CACAAACCCA
  APHAVP1E    ---------- ---------- ----------
  ---------- ---AAACCCA
  APHAVP1F    GACCCTGTCA CCACCACCGT GGAGAACTAC
  GGCGGTGAGA CACAAACCCA

For more information about the Phylip format, please see the "main.doc" file from PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein, available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip.

OUTPUT FILE

The output file contains on its first line the number of species. The distance matrix is then printed in standard form, with each species starting on a new line with the species name, followed by the distances to the species in order. These continue onto a new line after every nine distances. The distance matrix is square with zero distances on the diagonal. In general the format of the distance matrix is such that it can serve as input to any of the distance matrix programs.

Here is the output file from the example session.

  
  
 9
  FOSAVINK      0.00000 0.00000 0.24002 0.24380 0.22374 0.32911 0.42018 0.71864 2.77796
  FOSCHICK      0.00000 0.00000 0.23111 0.23436 0.21710 0.30266 0.38712 0.74135 2.77796
  FOSMOUSE      0.24002 0.23111 0.00000 0.02962 0.06026 0.04391 0.11485 0.70792 2.75141
  FOSRAT        0.24380 0.23436 0.02962 0.00000 0.05493 0.07287 0.14781 0.71646 2.67681
  FOSHUMAN      0.22374 0.21710 0.06026 0.05493 0.00000 0.09834 0.18213 0.73742 2.63869
  FOSXMSVFR     0.32911 0.30266 0.04391 0.07287 0.09834 0.00000 0.05646 0.71709 2.87444
  FOSMSVFB      0.42018 0.38712 0.11485 0.14781 0.18213 0.05646 0.00000 0.88217 2.80303
  FOSBMOUSE     0.71864 0.74135 0.70792 0.71646 0.73742 0.71709 0.88217 0.00000 3.13469
  FOSBSTAEP     2.77796 2.77796 2.75141 2.67681 2.63869 2.87444 2.80303 3.13469 0.00000

RELATED PROGRAMS

PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments. Distances creates a table of the pairwise distances within a group of aligned sequences. GrowTree creates a phylogenetic tree from a distance matrix created by Distances using either the UPGMA or neighbor-joining method. You can create a text or graphics output file.

ToPhylip writes GCG sequences into a single file in PHYLIP format. ESeqBoot produces multiple data sets from a molecular sequence data set by bootstrap, jackknife, or permutation resampling. EDnaDist computes a distance matrix from nucleic acid sequences, under four different models of nucleotide substitution (Jukes and Cantor (1969), Kimura (1980), Jin and Nei(1990) and a model of maximum likelihood (Felsenstein, 1981)). EFitch estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. EKitsch estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed. ENeighbor estimates phylogenies from distance matrix data using the Neighbor-Joining method or the UPGMA method of clustering. EDnaPars estimates phylogenies from nucleic acid sequences using the parsimony method. EProtPars estimates phylogenies from amino acid sequences using the parsimony method. EDnaML estimates phylogenies from nucleotide sequences by maximum likelihood. EDnaMLK does the same as EDnaML but assumes a molecular clock. EConsense computes consensus trees by the majority-rule consensus tree. It can be used as the final step in doing bootstrap analyses.

ALGORITHM

The distance methods available are three:

1. The Dayhoff PAM matrix. This uses Dayhoff's PAM 001 matrix from Dayhoff (1979). The PAM model is an empirical one that scales probabilities of change from one amino acid to another in terms of a unit which is an expected 1% change between two amino acid sequences. The PAM 001 matrix is used to make a transition probability matrix which allows prediction of the probability of changing from any one amino acid to any other, and also predicts equilibrium amino acid composition. The program assumes that these probabilities are correct and bases its computations of distance on them. The distance that is computed is scaled in units of expected fraction of amino acids changed.

2. Kimura's distance. This is a rough-and-ready distance formula for approximating PAM distance by simply measuring the fraction of amino acids, p, that differs between two sequences and computing the distance as (Kimura, 1983) D = -log(e) (1 - p - 0.2 p)(2). This is very quick to do but has some obvious limitations. It does not take into account which amino acids differ or to what amino acids they change, so some information is lost. The units of the distance measure are fraction of amino acids differing, as also in the case of the PAM distance. If the fraction of amino acids differing gets larger than 0.8541 the distance becomes infinite.

3. The Categories distance.. Similar to Kimura's 2-parameter model, with the exception that some changes of amino acids are less likely than others (adapted by Felsenstein). The amino acids are grouped into a series of categories. Any base change that does not change which category the amino acid is in is allowed, but if an amino acid changes category this is allowed only a certain fraction of the time. The fraction is called the "ease" and there is a parameter for it, which is 1.0 when all changes are allowed and near 0.0 when changes between categories are nearly impossible.

In this option, the user can select the Transition/Transversion ratio, which of several genetic codes to use, and which categorization of amino acids to use. There are three of them, a somewhat random sample:

(a) The George-Hunt-Barker (1988) classification of amino acids.

(b) A Ben Hall's classification.

Interestingly enough, all of them are consistent with the same, linear, ordering of amino acids, which they divide up in different ways. For the Categories model the default is the George/Hunt/Barker classification with the "ease" parameter set to 0.457 which is approximately the value implied by the empirical rates in the Dayhoff PAM matrix.

The method uses Kimura's (1980) 2-parameter model of DNA change. The Kimura "2-parameter" model allows for a difference between transition and transversion rates. Its transition probability matrix for a short interval of time is:

  
  
           To:     A        G        C        T
                ---------------------------------
            A  | 1-a-2b     a         b       b
    From:   G  |   a      1-a-2b      b       b
            C  |   b        b       1-a-2b    a
            T  |   b        b         a     1-a-2b

where a is u dt, the product of the rate of transitions per unit time and dt is the length dt of the time interval, and b is v dt, the product of half the rate of transversions (i.e., the rate of a specific transversion) and the length dt of the time interval.

Each distance that is calculated is an estimate, from that particular pair of species, of the divergence time between those two species. The Kimura distance is straightforward to compute. The other two are considerably slower, as they look at all positions, and find that distance which makes the likelihood highest. This likelihood is in effect the length of the internal branch in a two-species tree that connects these two species. Its likelihood is just the product, under the model, of the probabilities of each position having the (one or) two amino acids that are actually found. This is fairly slow to compute.

The computation proceeds from an eigenanalysis (spectral decomposition) of the transition probability matrix. In the case of the PAM 001 matrix the eigenvalues and eigenvectors are precomputed and are hard-coded into the program in over 400 statements. In the case of the Categories model the program computes the eigenvalues and eigenvectors itself, which will add a delay. But the delay is independent of the number of species as the calculation is done only once, at the outset.

The actual algorithm for estimating the distance is in both cases a bisection algorithm which tries to find the point at which the derivative of the likelihood is zero. Some of the kinds of ambiguous amino acids like "glx" are correctly taken into account. However, gaps are treated as if they are unkown nucleotides, which means those positions get dropped from that particular comparison. However, they are not dropped from the whole analysis. You need not eliminate regions containing gaps, as long as you are reasonably sure of the alignment there.

Note that the program is looking at all positions, including those that have not changed at all. It is important not to restrict attention to some positions based on whether or not they have changed; doing that would bias the distances by making them too large, and that in turn would cause the distances to misinterpret the meaning of those positions that had changed.

CONSIDERATIONS

In the Categories model of substitution, the distances printed on the output file are scaled in terms of expected numbers of substitutions, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change is set to 1.0. For the Dayhoff PAM and Kimura models the distance are scaled in terms of the expected numbers of amino acid substitutions per site. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes may occur in the same site and overlie or even reverse each other. The branch lengths estimates here are in terms of the expected underlying numbers of changes. That means that a branch of length 0.26 is 26 times as long as one which would show a 1% difference between the protein (or nucleotide) sequences at the beginning and end of the branch. But we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.

One problem that can arise is that two or more of the species can be so dissimilar that the distance between them would have to be infinite, as the likelihood rises indefinitely as the estimated divergence time increases. For example, with the Kimura model, if the two sequences differ in 85.41% or more of their positions then the estimate of divergence time would be infinite. Since there is no way to represent an infinite distance in the output file, the program regards this as an error, issues a warning message indicating which pair of species are causing the problem, and computes a distance of -1.0.

COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum Syntax: % eprotdist [-INfile=]file.msf{*} -default
  
  
  EPROTDIST does not support complete command-line control.
  
  

  Prompted Parameters:
  
  [-OUTfile=]file.eprotdist the output file.
  -INTERLeaved              interleaved PHYLIP formated input file
                             (only for PHYLIP formated input file).
  -NOINTERLeaved            sequencial PHYLIP formated input file
                             (only for PHYLIP formated input file).
  -MENu=D                   menu for the distance method, where:
                             D)ayhoff PAM matrix.
                             K)imura formula.
                             C)ategories model.
  -CODe=U                   menu for the genetic code, where:
                             U)niversal.
                             M)itochondrial.
                             V)ertebrate mitochondrial.
                             F)ly mitochondrial.
                             Y)east mitochondrial.
                             (only for categories model. MENu=C)
  -CATEGories=G             menu for the categorizations of amino acids,
                             where :
                             G)eorge/Hunt/Barker: (Cys),(Met Val Leu Ileu),
                              (Gly Ala Ser Thr  Pro).
                             C)hemical: (Cys Met),(Val Leu Ileu
                               Gly Ala Ser Thr),(Pro).
                             H)all: (Cys),(Met Val Leu Ileu),
                               (Gly Ala Ser Thr),(Pro).
                             (only for categories model. MENu=C)
  -EASE=0.457               Ease of changing category of amino acid.
                             (only for categories model. MENu=C)
  -TTRATio=2.0              Transition/transversion ratio. It can be any
                             number from 0.5 upwards.
                             (only for categories model. MENu=C)
  
  
  Optional Parameters:
  
  -OPTions                  makes the program ask for further specific
                             options.
  -SETS=2                   multiple data sets.
                             (only for PHYLIP formated input file).
  -SHOWData                 print data in the output file.

OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-OPTions

makes the program ask for all specific options.

-SETS=2

tells the program how many data sets there are from the input file. This is possible only with a PHYLIP formated input file.

-SHOWData

prints the sequences data on the output file before the distance matrix.

Below is an example of the output file when using these options.

  
  
  Name            Sequences
  ----            ---------
  
  Alpha        AACGTGGCCA CAT
  Beta         ..G..C.... ..C
  Gamma        C.GT.C.... ..A
  Delta        G.GA.TT..G .C.
  Epsilon      G.GA.CT..G .CC
  
  
  
 5
  Alpha          0.00000  0.47285  0.88304  1.29841  2.12269
  Beta           0.47285  0.00000  0.45192  1.34185  0.84009
  Gamma          0.88304  0.45192  0.00000  1.30693  1.21582
  Delta          1.29841  1.34185  1.30693  0.00000  0.27536
  Epsilon        2.12269  0.84009  1.21582  0.27536  0.00000

REFERENCES

Conn, E. E. and P. K. Stumpf. 1963. Outlines of Biochemistry. John Wiley and Sons, New York.

Dayhoff, M. O. 1979. Atlas of Protein Sequence and Structure, Volume 5, Supplement 3, 1978. National Biomedical Research Foundation, Washington, D.C.

George, D. G., L. T. Hunt, and W. C. Barker. 1988. Current methods in sequence comparison and analysis. pp. 127-149 in Macromolecular Sequencing and Synthesis, ed. D. H. Schlesinger. Alan R. Liss, New York.

Kimura, M. 1980. A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16: 111-120.

Kimura, M. 1983. The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge.

For further information please refer to the "Main.doc" and "protdist.doc" files from the PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein (available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip).

Printed: November 15, 1996 11:46 (1162)