Ednadist

Go back to top

EDNADIST*

EDNADIST*

FUNCTION

EDnaDist computes a distance matrix from nucleic acid sequences, under four different models of nucleotide substitution (Jukes and Cantor (1969), Kimura (1980), Jin and Nei(1990) and a model of maximum likelihood (Felsenstein, 1981)).

EDnaDist is a modified version of the PHYLIP version 3.572c's DNADIST, by Joseph Felsenstein, with command line control added.

DESCRIPTION

EDnaDist uses nucleotide sequences to compute a distance matrix, under four different models of nucleotide substitution. These models are those of Jukes and Cantor (1969), Kimura (1980), a modification of Kimura's model to allow for unequal rates of substitution at different sites by Jin and Nei (1990) and a model of maximum likelihood by Felsenstein (1981). The distance for each pair of species estimates the total branch length between the two species, and can be used in the distance matrix programs EFitch, EKitsch or ENeighbor. This is an alternative to use of the sequence data itself in the maximum likelihood program EDnaML or the parsimony program EDnaPars.

The input file for EDnaDist can be an MSF or PHYLIP formated file.

AUTHOR

This program was originally written by Joe Felsenstein (E-mail:joe@evolution.genetics.washington.edu. Post: Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, U.S.A.)

This version was modified for inclusion in EGCG by Maria Jesus Martin (E-mail: martin@ebi.ac.uk; Post: EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SQ or E-mail: martin@tdi.es; Post: Tecnologia para Diagnostico e Investigacion, Condes de Torreanaz 5, 28028 Madrid).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).

EXAMPLE

Here is a session with EDnaDist

  
  
  % ednadist -options
  
   EDNADIST of what sequences file ?  fmdv.msf{*}
  
   What should I call the output file (* fmdv.ednadist *) ?
  
   Distance methods :
  
     1) Kimura 2-parameter distance.
     2) Jin and Nei distance.
     3) Maximum Likelihood distance.
     4) Jukes-Cantor distance.
  
   Choose the method to use  (* 1 *) ?
  
   Transition/transversion ratio (* 2.0 *) ?
  
   Form of distance matrix:
  
     S)quare.
     L)ower-triangular.
  
   Form (* S *) ?
  
   Print out the data at start of run (* No *) ?
  
  Distances calculated for species
 APHAVP1C     .....
 APHAVP1D     ....
 APHAVP1A     ...
 APHAVP1B     ..
 APHAVP1E     .
 APHAVP1F
  
  Distances written to fmdv.ednadist
  
  %

INPUT FILE

The input file for EDnaDist is either GCG MSF protein sequence file or a PHYLIP protein sequence file.

In the PHYLIP format the first line contains the number of species and the number of nucleic acid positions (counting any stop codons that you want to include), separated by blanks. Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats. In the interleaved format, some lines giving the first part of each of the sequences, then lines giving the next part of each, and so on. Thus the sequences mightlook like this:

  
  
   6 50
  APHAVP1C    ---------- ---------- -------TAC
  APHAVP1D    ---------- ---------- -------TAC
  APHAVP1A    ------GTCA CCACCACCNN GGAGAACTAC
  APHAVP1B    ---------- ---------- ----------
  APHAVP1E    ---------- ---------- ----------
  APHAVP1F    GACCCTGTCA CCACCACCGT GGAGAACTAC
  
  GGCGGTCAGA CACAAACCCA
  GGCGGTCAGA CACAAACCCA
  GGCGGTGAGA CACAAACCCA
  ------GAGA CACAAACCCA
  ---------- ---AAACCCA
  GGCGGTGAGA CACAAACCCA

The "sequential" format has all of the data for the first species, then all of the characters for the next species, and so on.

Here is an example with a sequential PHYLIP format:

  
  
   6 50
  APHAVP1C    ---------- ---------- -------TAC
  GGCGGTCAGA CACAAACCCA
  APHAVP1D    ---------- ---------- -------TAC
  GGCGGTCAGA CACAAACCCA
  APHAVP1A    ------GTCA CCACCACCNN GGAGAACTAC
  GGCGGTGAGA CACAAACCCA
  APHAVP1B    ---------- ---------- ----------
  ------GAGA CACAAACCCA
  APHAVP1E    ---------- ---------- ----------
  ---------- ---AAACCCA
  APHAVP1F    GACCCTGTCA CCACCACCGT GGAGAACTAC
  GGCGGTGAGA CACAAACCCA

For more information about the Phylip format, please see the "main.doc" file from PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein, available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip.

OUTPUT FILE

The output file contains the number of species on its first line. The distance matrix is then printed in standard form, with each species starting on a new line with the species name, followed by the distances to the species in order. These continue onto a new line after every nine distances. In lower triangular form only the distances to the other species that precede each species are printed. The format of the distance matrix is such that it can serve as input to any of the distance matrix programs.

Here is the output file from the example session.

  
  
 6
  APHAVP1C    0.0000  0.0083  0.0392  0.0348  0.0324  0.0353
  APHAVP1D    0.0083  0.0000  0.0412  0.0368  0.0344  0.0406
  APHAVP1A    0.0392  0.0412  0.0000  0.0034  0.0106  0.0178
  APHAVP1B    0.0348  0.0368  0.0034  0.0000  0.0071  0.0189
  APHAVP1E    0.0324  0.0344  0.0106  0.0071  0.0000  0.0232
  APHAVP1F    0.0353  0.0406  0.0178  0.0189  0.0232  0.0000

RELATED PROGRAMS

PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments. Distances creates a table of the pairwise distances within a group of aligned sequences. GrowTree creates a phylogenetic tree from a distance matrix created by Distances using either the UPGMA or neighbor-joining method. You can create a text or graphics output file.

ToPhylip writes GCG sequences into a single file in PHYLIP format. ESeqBoot produces multiple data sets from a molecular sequence data set by bootstrap, jackknife, or permutation resampling. EProtDist computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. EFitch estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. EKitsch estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed. ENeighbor estimates phylogenies from distance matrix data using the Neighbor-Joining method or the UPGMA method of clustering. EDnaPars estimates phylogenies from nucleic acid sequences using the parsimony method. EProtPars estimates phylogenies from amino acid sequences using the parsimony method. EDnaML estimates phylogenies from nucleotide sequences by maximum likelihood. EDnaMLK does the same as EDnaML but assumes a molecular clock. EConsense computes consensus trees by the majority-rule consensus tree. It can be used as the final step in doing bootstrap analyses.

ALGORITHM

The distance methods available are four:

1. Jukes and Cantor's (1969) model assumes that there is independent change at all sites, with equal probability. Whether a base changes is independent of its identity, and when it changes there is an equal probability of ending up with each of the other three bases. Thus the transition probability matrix (this is a technical term from probability theory and has nothing to do with transitions as opposed to transversions) for a short period of time dt is:

  
  
           To:    A        G        C        T
                ---------------------------------
            A  | 1-3a      a         a       a
    From:   G  |  a       1-3a       a       a
            C  |  a        a        1-3a     a
            T  |  a        a         a      1-3a

where a is u dt, the product of the rate of substitution per unit time (u) and the length dt of the time interval. For longer periods of time this implies that the probability that two sequences will differ at a given site is:

p = 3/4 (1 - e( - 4/3 u t))

and hence that if we observe p, we can compute an estimate of the branch length ut by inverting this to get

ut = -3/4 log(e) ( 1 - 4/3 p )

2. The Kimura "2-parameter" model is almost as symmetric as this, but allows for a difference between transition and transversion rates. Its transition probability matrix for a short interval of time is:

  
  
           To:     A        G        C        T
                ---------------------------------
            A  | 1-a-2b     a         b       b
    From:   G  |   a      1-a-2b      b       b
            C  |   b        b       1-a-2b    a
            T  |   b        b         a     1-a-2b

where a is u dt, the product of the rate of transitions per unit time and dt is the length dt of the time interval, and b is v dt, the product of half the rate of transversions (i.e., the rate of a specific transversion) and the length dt of the time interval.

3. The third model used is a model incorporating different rates of transition and transversion, but also allowing for different frequencies of the four nucleotides. It is the model which is used in EDnaML, the maximum likelihood nucleotide sequence phylogenies program . You will find the model described in the document for that program. The transition probabilities for this model are also given by Kishino and Hasegawa (1989).

These three models are closely related. The DNAML model reduces to Kimura's two-parameter model if we assume that the equilibrium frequencies of the four bases are equal. The Jukes-Cantor model in turn is a special case of the Kimura 2-parameter model where a = b. Thus each model is a special case of the ones that follow it, Jukes-Cantor being a special case of both of the others.

4. The Jin and Nei (1990) distance uses Kimura's model of base substitution, but assumes that the rate of substitution varies from site to site according to a gamma distribution, with a coefficient of variation that is specified by the user.

Each distance that is calculated is an estimate, from that particular pair of species, of the divergence time between those two species. For the Jukes- Cantor model, the estimate is computed using the formula for ut given above, as long as the nucleotide symbols in the two sequences are all either A, C, G, T, U, N, X, ?, or - (the latter four indicate a deletion or an unknown nucleotide). This estimate is a maximum likelihood estimate for that model. For the Kimura 2-parameter model, with only these nucleotide symbols, formulas special to that estimate are also computed. These are also, in effect, computing the maximum likelihood estimate for that model. In the Kimura case it depends on the observed sequences only through the sequence length and the observed number of transition and transversion differences between those two sequences. The calculation in that case is a maximum likelihood estimate and will differ somewhat from the estimate obtained from the formulas in Kimura's original paper. That formula was also a maximum likelihood estimate, but with the transition/transversion ratio estimated empirically, separately for each pair of sequences. In the present case, one overall preset transition/transversion ratio is used which makes the computations harder but achieves greater consistency between different comparisons.

For the DNAML model, or for any of the models where one or both sequences contain at least one of the other ambiguity codes such as Y, R, etc., a maximum likelihood calculation is also done using code which was originally written for DNAML. Its disadvantage is that it is slow. The resulting distance is in effect a maximum likelihood estimate of the divergence time (total branch length between) the two sequences.

Note that there is an assumption that we are looking at all sites, including those that have not changed at all. It is important not to restrict attention to some sites based on whether or not they have changed; doing that would bias the distances by making them too large, and that in turn would cause the distances to misinterpret the meaning of those sites that had changed.

A major innovation in this program is that, for all of these distance methods, the program allows us to specify that "third position" bases have a different rate of substitution than first and second positions, that introns have a different rate than exons, and so on. The Categories option allows us to make up to 9 categories of sites and specify different rates of change for them. Note that this Categories option is different from the one used in EDnaML and EDnaMLK where you do not have to specify which sites are in which categories.

CONSIDERATIONS

Phylip format

When using a PHYLIP formated input file, EDnaDist show some extra options.

The "categories" option specifies the relative rates of substitution at different sites. The sites are organized into up to nine categories. You are supposed to specify the relative rates of substitution in these categories. The user can provide the number of categories (up to a maximum of 9) aswering the question Number of categories of substitution rates ? which appears when using the -OPTions command-line option. Then the user has to enter the relative rates of change in the categories, as nonnegative real numbers. The default is one category with rate 1.0. IT IS IMPORTANT to note that this option requires one piece of information in the input file, which associates sites with categories. This is one or more lines, which are placed after the initial line of the input file, and also after the lines containing the Weights, if any, but before the sequences. It consists of a line whose first characters are ignored, until the maximum length of a species name has been reached (it is therefore convenient, if species names are a maximum of ten characters as in the program as distributed, to put CATEGORIES in the first ten characters of that line, just to remind yourself what it is). The line then contains single digits (1 through 9) indicating which category each site is in. The information can continue to a new line anytime in the middle of these digits. For example the line may read:

  
  CATEGORIES 5555555555 5123123123 1231231231 2344444444 4441231235 5555

(that is an example imagining five categories for the three codon positions, intron positions, and flanking sequence positions). A site may in effect be dropped from the analysis by placing it in a category which has an extremely high rate of expected change.

If you have a "multiple data sets" input file, answer 'yes' to the 'Analyze multiple data sets ?' question or use the -SETS= n command-line option (where n is the number of data sets). The data sets must have the same format as the first data set. Here is an (very small) input file with two five-species data sets:

5 6 Alpha CCACCA Beta CCAAAA Gamma CAACCA Delta AACAAC Epsilon AACCCA 5 6 Alpha CACACA Beta CCAACC Gamma CAACAC Delta GCCTGG Epsilon TGCAAT

Using the program ESeqBoot you can make multiple data sets by bootstrapping. A distance matrix can be produced for all of these using this option.

Output file

When answering 'yes' to the 'Print out the data at start of run ?' or using -SHOWData command-line option, the output file will precede the data by more complete information on the input and the options selected. The output file begins by giving the number of species and the number of characters, and the identity of the distance measure that is being used. If the "categories" option is used a table of the relative rates of expected substitution at each category of sites is printed, and a listing of the categories each site is in. There will then follow the equilibrium frequencies of the four bases. If the Jukes-Cantor or Kimura distances are used, these will necessarily be 0.25 : 0.25 : 0.25 : 0.25. The output then shows the transition/transversion ratio that was specified or used by default. In the case of the Jukes-Cantor distance this will always be 0.5. The transition-transversion parameter (as opposed to the ratio) is also printed out. This is used within the program and can be ignored. There then follow the sequence data, with the sequence bases printed in groups of ten bases.

The distances printed out are scaled in terms of expected numbers of substitutions, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change, averaged over all sites analyzed, is set to 1.0 if there are multiple categories of sites. This means that whether or not there are multiple categories of sites, the expected fraction of change for very small branches is equal to the branch length. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes may occur in the same site and overlie or even reverse each other. The branch lengths estimates here are in terms of the expected underlying numbers of changes. That means that a branch of length 0.26 is 26 times as long as one which would show a 1% difference between the nucleotide sequences at the beginning and end of the branch. But we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.

One problem that can arise is that two or more of the species can be so dissimilar that the distance between them would have to be infinite, as the likelihood rises indefinitely as the estimated divergence time increases. For example, with the Jukes-Cantor model, if the two sequences differ in 75% or more of their positions then the estimate of divergence time would be infinite. Since there is no way to represent an infinite distance in the output file, the program regards this as an error, issues an error message indicating which pair of species are causing the problem, and stops. It might be that, had it continued running, it would have also run into the same problem with other pairs of species. If the Kimura distance is being used there may be no error message; the program may simply give a large distance value (it is iterating towards infinity and the value is just where the iteration stopped). Likewise some maximum likelihood estimates may also become large for the same reason (the sequences showing more divergence than is expected even with infinite branch length).

Others considerations

When the Maximum Likelihood distance is selected and you are using -OPTions on the command-line, the "base frequencies" option appears. This distance requires that the program be provided with the equilibrium frequencies of the four bases A, C, G, and T (or U). The default use the empirical frequencies of the bases, observed in the input sequences, as the base frequencies. These empirical frequencies are not really the maximum likelihood estimates of the base frequencies, but they will often be close to those values (what they are is maximum likelihood estimates under a "star" or "explosion" phylogeny). If you answer 'no' to the Use empirical base frequencies? you will be prompted for the frequencies of the four bases. These must add to 1.

COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

Minimum Syntax: % ednadist [-INfile=]dna.msf{*} -default

EDNADIST does not support complete command-line control.

Prompted Parameters: [-OUTfile=]dna.ednadist output file. -INTERLeaved interleaved PHYLIP formated input file (only for PHYLIP formated input file). -NOINTERLeaved sequencial PHYLIP formated input file (only for PHYLIP formated input file). -MENu=1 menu for distance method, where: 1) Kimura 2-parameter distance. 2) Jin and Nei distance. 3) Maximum Likelihood distance. 4) Jukes-Cantor distance. -TTRATio=2.0 Transition/transversion ratio. -FORM=S Form of distance matrix. S)quare or L)ower-triangular. Optional Parameters: -OPTions makes the program ask for further specific options. -SETS=1 multiple data sets. (only for PHYLIP formated input file). -SHOWData print data in the output file.

OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-OPTions

makes the program ask for all specific options.

-SETS=2

tells the program how many data sets there are from the input file. This is possible only for PHYLIP formated input file.

-SHOWData

prints the sequences data on the output file before the distance matrix.

Below is an example of the output file when using these options.

  
  
  Ednadist of fmdv.phy. September  5, 1996 16:51
  
    5 species,   13 sites
  
    Kimura 2-parameter Distance
  
   Base Frequencies:
 A       0.25000
 C       0.25000
 G       0.25000
T(U)     0.25000
  
  Transition/transversion ratio =   2.000000
  
  (Transition/transversion parameter =   1.500000)
  
  
  Name            Sequences
  ----            ---------
  
  Alpha        AACGTGGCCA CAT
  Beta         ..G..C.... ..C
  Gamma        C.GT.C.... ..A
  Delta        G.GA.TT..G .C.
  Epsilon      G.GA.CT..G .CC
  
  
 5
  Alpha       0.0000  0.2997  0.7820  1.1716  1.4617
  Beta        0.2997  0.0000  0.3219  0.8997  0.5653
  Gamma       0.7820  0.3219  0.0000  1.4481  1.0726
  Delta       1.1716  0.8997  1.4481  0.0000  0.1679
  Epsilon     1.4617  0.5653  1.0726  0.1679  0.0000

REFERENCES

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Molecular Evolution 17: 368-376.

Jin, L. and M. Nei. 1990. Limitations of the evolutionary parsimony method of phylogenetic analysis. Molecular Biology and Evolution 7: 82-102.

Jukes, T. H. and C. R. Cantor. 1969. Evolution of protein molecules. pp. 21-132 in Mammalian Protein Metabolism, ed. H. N. Munro. Academic Press, New York.

Kimura, M. 1980. A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16: 111-120.

Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 29:170-179.

For further information please refer to the "main.doc" and "dnadist.doc" files from the PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein (available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip).

Printed: November 15, 1996 11:46 (1162)