Ednamlk

Go back to top

EDNAMLK*


FUNCTION

EDnaMLK estimates phylogenies from nucleotide sequences by maximum likelihood assuming a molecular clock.

EDnaMLK is a modified version of the PHYLIP version 3.572c's DNAMLK, by Joseph Felsenstein, with command line control added.


DESCRIPTION

EDnaMLK estimates phylogenies from nucleotide sequences by maximum likelihood under the constraint that the trees estimated must be consistent with a molecular clock. The molecular clock is the assumption that the tips of the tree are all equidistant, in branch length, from its root. This program is fairly slow, and can be expensive to run. The program is indirectly related to EDnaMLK The use of the two programs together permits a likelihood ratio test of the molecular clock hypothesis to be made.

The input file for EDnaMLK can be an MSF or PHYLIP formated file.


AUTHOR

This program was originally written by Joe Felsenstein (E-mail:joe@evolution.genetics.washington.edu. Post: Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, U.S.A.)

This version was modified for inclusion in EGCG by Maria Jesus Martin (E-mail: martin@ebi.ac.uk; Post: EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SQ or E-mail: martin@tdi.es; Post: Tecnologia para Diagnostico e Investigacion, Condes de Torreanaz 5, 28028 Madrid).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session with EDnaMLK

  
  
  % ednamlk -options
  
   EDNAMLK of what sequences file ?  fmdv.msf{*}
  
   What should I call the output file (* fmdv.ednamlk *) ?
  
   Transition/transversion ratio (* 2.0 *) ?
  
   Use empirical base frequencies (* Yes *) ?
  
   Number of categories of substitution rates (* 1 *) ?
  
   Global rearrangements (* No *) ?
  
   Randomize input order of sequences  (* No *) ?
  
   Print out the data at start of run (* No *) ?
  
  Adding species:
APHAVP1C
APHAVP1D
APHAVP1A
APHAVP1B
APHAVP1E
APHAVP1F
  
  Output written into fmdv.ednamlk
  
  Tree(s) also written into fmdv.ednamlktrees
  
  %
  


INPUT FILE

The input file for EDnaMLK is either a GCG MSF nucleic acid sequences file or a PHYLIP nucleic acid sequences file.

In the PHYLIP format the first line contains the number of species and the number of nucleic acid positions, separated by blanks. Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats. In the interleaved format, some lines giving the first part of each of the sequences, then lines giving the next part of each, and so on. Thus the sequences mightlook like this:

  
  
   6 50
  APHAVP1C    ---------- ---------- -------TAC
  APHAVP1D    ---------- ---------- -------TAC
  APHAVP1A    ------GTCA CCACCACCNN GGAGAACTAC
  APHAVP1B    ---------- ---------- ----------
  APHAVP1E    ---------- ---------- ----------
  APHAVP1F    GACCCTGTCA CCACCACCGT GGAGAACTAC
  
  GGCGGTCAGA CACAAACCCA
  GGCGGTCAGA CACAAACCCA
  GGCGGTGAGA CACAAACCCA
  ------GAGA CACAAACCCA
  ---------- ---AAACCCA
  GGCGGTGAGA CACAAACCCA
  
  
The "sequential" format has all of the data for the first species, then all of the characters for the next species, and so on. For the PHYLIP formats, there is an option ( Use user trees in input file?) which signals that one or more user-defined "in nested-pairs parenthesis notation" trees are to be provided for evaluation. This "user tree" is supplied in the input file after the species data, with a line containing the number of user-defined trees being defined.

Here is an example with one user-defined tree in a sequential PHYLIP format:

  
  
   6 50
  APHAVP1C    ---------- ---------- -------TAC
  GGCGGTCAGA CACAAACCCA
  APHAVP1D    ---------- ---------- -------TAC
  GGCGGTCAGA CACAAACCCA
  APHAVP1A    ------GTCA CCACCACCNN GGAGAACTAC
  GGCGGTGAGA CACAAACCCA
  APHAVP1B    ---------- ---------- ----------
  ------GAGA CACAAACCCA
  APHAVP1E    ---------- ---------- ----------
  ---------- ---AAACCCA
  APHAVP1F    GACCCTGTCA CCACCACCGT GGAGAACTAC
  GGCGGTGAGA CACAAACCCA
  1
  (APHAVP1D,(APHAVP1F((APHAVP1E,APHAVP1B),APHAVP1A)),APHAVP1C);
  
  

For more information about the Phylip format, please see the "main.doc" file from the PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein, available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip.


OUTPUT FILE

The output from EDnaMLK are two files, one containing an ASCII representation of the most parsimonius tree and another containing the tree in nested-pairs parenthesis notation.

Here is the output file from the example session.

  
  
  Ednamlk Phylogram of fmdv.msf{*}. October  8, 1996 11:11
  
  
     6 Species,  645 Sites
  
  Empirical Base Frequencies:
  
     A       0.24506
     C       0.29693
     G       0.26317
    T(U)     0.19484
  
  Transition/transversion ratio =   2.000000
  
  (Transition/transversion parameter =   1.543784)
  
  
  
  
  
          +APHAVP1E
       +--4
       !  !  +APHAVP1A
    +--5  +--3
    !  !     +APHAVP1B
    !  !
  --2  +APHAVP1F
    !
    !  +APHAVP1D
    +--1
       +APHAVP1C
  
  
  Ln Likelihood = -1114.18126
  
   Ancestor      Node      Node Time     Length
   --------      ----      ---- ----     ------
   root            2
     2             5         0.00873     0.00873
     5             4         0.01588     0.00714
     4          APHAVP1E     0.02009     0.00421
     4             3         0.01802     0.00214
     3          APHAVP1A     0.02009     0.00206
     3          APHAVP1B     0.02009     0.00206
     5          APHAVP1F     0.02009     0.01135
     2             1         0.01574     0.01574
     1          APHAVP1D     0.02009     0.00434
     1          APHAVP1C     0.02009     0.00434
  
  

Here is the output tree file from the example session.

  
  
  

(((APHAVP1E:0.00421,(APHAVP1A:0.00206,APHAVP1B:0.00206):0.00214):0.00714, APHAVP1F:0.01135):0.00873,(APHAVP1D:0.00434,APHAVP1C:0.00434):0.01574);


RELATED PROGRAMS

PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. The user should note that this tree is not a phylogenetic tree. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments.

ToPhylip writes GCG sequences into a single file in PHYLIP format. Phylip2Tree displays trees computed with one of the PHYLIP-programs or with EProtPars, EDnaPars, EDnaML, EDnaMLK ENeighbor, EFitch and EKitsch, in GCG style. EDnaML estimates phylogenies from nucleotide sequences by maximum likelihood. ESeqBoot produces multiple data sets from a molecular sequence data set by bootstrap, jackknife, or permutation resampling. EProtPars estimates phylogenies from amino acid sequences using the parsimony method. EDnaPars estimates phylogenies from nucleic acid sequences using the parsimony method. EDnaDist computes a distance matrix from nucleic acid sequences, under four different models of nucleotide substitution (Jukes and Cantor (1969), Kimura (1980), Jin and Nei(1990) and a model of maximum likelihood (Felsenstein, 1981)). EProtDist computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. ENeighbor estimates phylogenies from distance matrix data using the Neighbor-Joining method or the UPGMA method of clustering. EFitch estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. EKitsch estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed. EConsense computes consensus trees by the majority-rule consensus tree. It can be used as the final step in doing bootstrap analyses.


ALGORITHM

EDnaML implements the maximum likelihood method for DNA sequences. Details of the algorithm are described by Felsenstein (1981). The assumptions of the present model are:

1. Each site in the sequence evolves independently.

2. Different lineages evolve independently.

3. Each site undergoes substitution at an expected rate which is chosen from a series of rates (each with a probability of occurrence) which we specify.

4. All relevant sites are included in the sequence, not just those that have changed or those that are "phylogenetically informative".

5. A substitution consists of one of two sorts of events:

a. The first kind of event consists of the replacement of the existing base by a base drawn from a pool of purines or a pool of pyrimidines (depending on whether the base being replaced was a purine or a pyrimidine). It can lead either to no change or to a transition.

b. The second kind of event consists of the replacement of the existing base by a base drawn at random from a pool of bases at known frequencies, independently of the identity of the base which is being replaced. This could lead either to a no change, to a transition or to a transversion.

The ratio of the two purines in the purine replacement pool is the same as their ratio in the overall pool, and similarly for the pyrimidines.

The ratios of transitions to transversions can be set by the user. The substitution process can be diagrammed as follows: Suppose that you specified A, C, G, and T base frequencies of 0.24, 0.28, 0.27, and 0.21.

First kind of event:

1. Determine whether the existing base is a purine or a pyrimidine.

2. Draw from the proper pool:

  
  
   Purine pool:                Pyrimidine pool:
  
  I               I            I               I
  I   0.4706 A    I            I   0.5714 C    I
  I   0.5294 G    I            I   0.4286 T    I
  I (ratio is     I            I (ratio is     I
  I  0.24 : 0.27) I            I  0.28 : 0.21) I
  -----------------            -----------------
  

Second kind of event:

  
Draw from the overall pool:
  
           I                  I
           I      0.24 A      I
           I      0.28 C      I
           I      0.27 G      I
           I      0.21 T      I
           I                  I
            ------------------
  
Note that if the existing base is, say, an A, the first kind of event has a 0.4706 probability of "replacing" it by another A. The second kind of event has a 0.24 chance of replacing it by another A. This rather disconcerting model is used because it has nice mathematical properties that make likelihood calculations far easier. A closely similar, but not precisely identical model having different rates of transitions and transversions has been used by Hasegawa et. al. (1985). The transition probability formulas for the current model were given by Kishino and Hasegawa (1989).

Note the assumption that the program is looking at all sites, including those that have not changed at all. It is important not to restrict attention to some sites based on whether or not they have changed; doing that would bias branch lengths by making them too long, and that in turn would cause the method to misinterpret the meaning of those sites that had changed.

An important new development in this program is the Hidden Markov Chain method of inferring different rates of evolution at different sites. This will be described in a future paper by Felsenstein and Gary Churchill, of Cornell University. It allows to specify to the program that there will be a number of different possible evolutionary rates, what the prior probabilities of occurrence of each is, and what the average length of a patch of sites all having the same rate. The program then computes the likelihood by summing it over all possible assignments of rates to sites, weighting each by its prior probability of occurrence.

For example, if the user specifies that there are three possible rates of evolution, 1.0, 2.4, and 0.0, that the prior probabilities of a site having these rates are 0.4, 0.3, and 0.3, and that the average patch length (number of consecutive sites with the same rate) is 2.0, the program will sum the likelihood over all possibilities, but giving less weight to those that (say) assign all sites to rate 2.4, or that fail to have consecutive sites that have the same rate.

This feature effectively removes the artificial assumption that all sites have the same rate, and also means that we need not know in advance the identities of the sites that have a particular rate of evolution.


CONSIDERATIONS

If you want to use the empirical frequencies of the bases, answer 'yes' to the 'Use empirical base frequencies ?' question. These empirical frequencies are not really the maximum likelihood estimates of the base frequencies, but they will often be close to those values (what they are is maximum likelihood estimates under a "star" or "explosion" phylogeny). If you answer 'no', the program asks for the frequencies of the four bases. These must add to 1.

The user can specify how many categories of substitution rates there will be, and what are the rates and probabilities for each. The user first is asked how many categories there will be (for the moment there is an upper limit of 9). Then the program asks for the rates for each category. These rates are only meaningful relative to each other, so that rates 1.0, 2.0, and 2.4 have the exact same effect as rates 2.0, 4.0, and 4.8. Note that a category can have rate of change 0, so that this allows us to take into account that there may be a category of sites that are invariant. Note that the run time of the program will be proportional to the number of rate categories: twice as many categories means twice as long a run. Finally the program will ask for the probabilities of a random site falling into each of these categories. These probabilities must be nonnegative and sum to 1. Default for the program is one category, with rate 1.0 and probability 1.0 (actually the rate does not matter in that case).

If more than one category is specified, EDnaMLK asks for the rates at adjacent sites correlated. This allows us to specify that we want to assume that sites that have the same rate category are expected to be clustered. The program asks for the value of the average patch length. This is an expected length of patches that have the same rate. If it is 1, the rates of successive sites will be independent. If it is, say, 10.25, then the chance of change to a new rate will be 1/10.25 after every site. However the "new rate" is randomly drawn from the mix of rates, and hence could even be the same. So the actual observed length of patches with the same rate will be somewhat larger than 10.25. Note below that if you choose multiple patches, there will be an estimate in the output file as to which combination of rate categories contributed most to the likelihood.

With these options, the program has the ability to infer different rates at different sites and estimate phylogenies under a more realistic model. Note that Likelihood Ratio Tests can be used to test whether one combination of rates is significantly better than another.

The "global rearrangements" option causes, after the last species is added to the tree, each possible group to be removed and re-added. This improves the result, since the position of every species is reconsidered. It approximately triples the run-time of the program.

If you answer 'yes' to the 'Use user trees in input file ?' question or you use the -USERTree command-line option, EDnaMLK asks 'Use lengths from user trees ?. If you answer 'yes' (or you use -USELength command-line option) , the program take any branch lengths that are in the user tree and simply evaluate the likelihood of that tree, without further altering those branch lengths. This means that if some branches have lengths and others do not, the program will estimate the lengths of those that do not have lengths given in the user tree.

Output file

The output file starts by giving the base frequencies for A, C, G, and T that have been specified. It then prints out the transition/transversion ratio that was specified or used by default. It also uses the base frequencies to compute the actual transition/transversion ratio implied by the parameter.

If the "categories" option is used a table of the relative rates of expected substitution at each category of sites is printed, as well as the probabilities of each of those rates.

There then follow the data sequences, with the base sequences printed in groups of ten bases along the lines. The trees found are printed as a rooted tree topology. The internal nodes are numbered arbitrarily for the sake of identification. The number of trees evaluated so far and the log likelihood of the tree are also given. The branch lengths in the diagram are roughly proportional to the estimated branch lengths, except that very short branches are printed out at least three characters in length so that the connections can be seen.

A table is printed showing the length of each tree segment, and the time (in units of expected nucleotide substitutions per site) of each fork in the tree, measured from the root of the tree. I have not attempted in include code for approximate confidence limits on branch points, as I have done for branch lengths in EDnaML, both because of the extreme crudeness of that test, and because the variation of time for different forks would be highly correlated.

The log likelihood printed out with the final tree can be used to perform various likelihood ratio tests. One can, for example, compare runs with different values of the expected transition/transversion ratio to determine which value is the maximum likelihood estimate, and what is the allowable range of values (using a likelihood ratio test, which you will find described in mathematical statistics books). One could also estimate the base frequencies in the same way. Both of these, particularly the latter, require multiple runs of the program to evaluate different possible values, and this might get expensive.

This program makes possible a legitimate statistical test of the molecular clock. To do such a test, run EDnaML and EDnaMLK on the same data. If the trees obtained are of the same topology (when considered as unrooted), it is legitimate to compare their likelihoods by the likelihood ratio test. In EDnaML the likelihood has been computed by estimating 2n-3 branch lengths, if their are n tips on the tree. In EDnaMLK it has been computed by estimating n-1 branching times (in effect, n-1 branch lengths). The difference in the number of parameters is (2n-3)-(n-1) = n-2. To perform the test take the difference in log likelihoods between the two runs (EDnaML should be the higher of the two, barring numerical iteration difficulties) and double it. Look this up on a chi-square distribution with n-2 degrees of freedom. If the result is significant, the log likelihood has been significantly increased by allowing all 2n-3 branch lengths to be estimated instead of just n-1, and molecular clock may be rejected.

If more than one tree is supplied, and the program is not told to assume autocorrelation between the rates at different sites, the program also performs a statistical test of each of these trees against the one with highest likelihood. This test, due to Kishino and Hasegawa (1989), uses the mean and variance of log-likelihood differences between trees, taken across sites. If the mean is more than 1.96 standard deviations different then the trees are declared significantly different. This use of the empirical variance of log-likelihood differences is more robust and nonparametric than the classical likelihood ratio test, and may to some extent compensate for the any lack of realism in the model underlying this program. The program prints out a table of the log-likelihoods of each tree, the differences of each from the highest one, the variance of that quantity as determined by the log-likelihood differences at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one. However the Kishino-Hasegawa-Templeton test is not available if we assume that there is autocorrelation of rates at neighboring sites and is not done in those cases.

The branch lengths printed out are scaled in terms of expected numbers of substitutions, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change, averaged over all sites analyzed, is set to 1.0 if there are multiple categories of sites. This means that whether or not there are multiple categories of sites, the expected fraction of change for very small branches is equal to the branch length. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes occur in the same site and overlie or even reverse each other. The branch lengths estimates here are in terms of the expected underlying numbers of changes. That means that a branch of length 0.26 is 26 times as long as one which would show a 1% difference between the nucleotide sequences at the beginning and end of the branch. But we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.

Because of limitations of the numerical algorithm, branch length estimates of zero will often print out as small numbers such as 0.00001. If you see a branch length that small, it is really estimated to be of zero length.

Another possible source of confusion is the existence of negative values for the log likelihood. This is not really a problem; the log likelihood is not a probability but the logarithm of a probability. When it is negative it simply means that the corresponding probability is less than one (since we are seeing its logarithm). The log likelihood is maximized by being made more positive: -30.23 is worse than -29.14.

At the end of the output, if you are using multiple rate categories, the program will print a list of what site categories contributed the most to the final likelihood. This combination of rate categories need not have contributed a majority of the likelihood, just a plurality. Still, it will be helpful as a view of where the program infers that the higher and lower rates are. Note that the use in this calculations of the prior probabilities of different rates, and the average patch length, gives this inference a "smoothed" appearance: some other combination of rates might make a greater contribution to the likelihood, but be discounted because it conflicts with this prior information.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum Syntax: % ednamlk [-INfile=]file.msf{*} -default
  
  

EDNAMLK does not support complete command-line control.

Prompted Parameters: [-OUTfile=]file.ednaml output file. -TTRATio=2.0 Transition/transversion ratio. -BASEFrequence uses empirical base frequencies. Optional Parameters: -OPTions makes the program ask for further specific options. -USERTree one or more user-defined rooted trees are to be provided for evaluation in the input file. -USELength the tree is evaluated with their lengths fixed. (only for "user tree" option). -GLOBal Do global rearrangements. (not available with the USERTree parameter) -RANDom=3 use a random number generator to choose the input order of species. The seed should be an integer between 1 and 32767. -JUMnumber=10 number of times to restart the process (with different orders of species). -SETS=2 multiple data sets. -SHOWData print data in the output file.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-OPTions

makes the program ask for all specific options.

-USERTree

tells the program that one or more user-defined trees are to be provided for evaluation in the input file.

-USELength

tells the program that it should take any branch lengths that are in the user tree and simply evaluate the likelihood of that tree, without further altering those branch lengths. Available for "user tree" option only.

-GLOBal

global rearrangements. It is an additional stage to the search for the best tree. Each possible subtree is removed from the tree and added back in all possible places. This process continues until all subtrees can be removed and added again without any improvement in the tree. The purpose of this extra rearrangement is to make it less likely that one or more a species gets "stuck" in a suboptimal region of the space of all possible trees. The use of global optimization results in approximately a tripling (3x) of the run-time. It is not available with the "user tree" option.

-RANDom=1

use a random number generator to choose the input order of species. Must be odd.

-JUMnumber=10

causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run). Of course this is slow, taking 10 times longer than a single run. The program will print out the best tree found overall.

-SETS=2

tells the program how many data sets there are from the input file.

-SHOWData

prints the sequences data on the output file before the distance matrix.


REFERENCES

Felsenstein, J. 1973. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Zoology 22: 240-249.

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Molecular Evolution 17: 368-376.

Felsenstein, J. and E. Sober. 1986. Parsimony and likelihood: an exchange. Systematic Zoology 35: 617-626.

Hasegawa, M. and T. Yano. 1984. Maximum likelihood method of phylogenetic inference from DNA sequence data. Bulletin of the Biometric Society of Japan No. 5: 1-7.

Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22: 160-174.

Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 29: 170-179.

For further information please refer to the "main.doc" and "dnamlk.doc" files from the PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein (available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip).

Printed: November 15, 1996 11:46 (1162)