EDnaPars (DNA Sequence Parsimony Method) infers an unrooted phylogeny using nucleic acid sequences.
EDnaPars is a modified version of the PHYLIP version 3.572c's DNAPARS, by Joseph Felsenstein, with command line control added.
EDnaPars estimates phylogenies by the parsimony method using nucleic acid sequences. Allows use the full IUB ambiguity codes, and estimates ancestral nucleotide states. Gaps treated as a fifth nucleotide state.
The input file for EDnaPars can be an MSF or PHYLIP formated file.
This program was originally written by Joe Felsenstein (E-mail:joe@evolution.genetics.washington.edu. Post: Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, U.S.A.)
This version was modified for inclusion in EGCG by Maria Jesus Martin (E-mail: martin@ebi.ac.uk; Post: EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SQ or E-mail: martin@tdi.es; Post: Tecnologia para Diagnostico e Investigacion, Condes de Torreanaz 5, 28028 Madrid).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a session with EDnaPars
% ednapars -options EDNAPARS of what sequences file ? fmdv.msf{*} What should I call the output file (* fmdv.ednapars *) ? Randomize input order of sequences (* No *) ? OutGroup root (* No *) ? Use threshold parsimony (* No *) ? Print out the data at start of run (* No *) ? Print out steps in each site (* No *) ? Print sequences at all nodes of tree (* No *) ? Adding species: APHAVP1C APHAVP1D APHAVP1A APHAVP1B APHAVP1E APHAVP1F Doing global rearrangements !-----------! ........... Output written to fmdv.ednapars Trees also written onto fmdv.ednaparstrees %
The input file for EDnaPars is either a GCG MSF nucleic acid sequences file or a PHYLIP nucleic acid sequences file.
In the PHYLIP format the first line contains the number of species and the number of nucleic acid positions, separated by blanks. Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats. In the interleaved format, some lines giving the first part of each of the sequences, then lines giving the next part of each, and so on. Thus the sequences mightlook like this:
6 50 APHAVP1C ---------- ---------- -------TAC APHAVP1D ---------- ---------- -------TAC APHAVP1A ------GTCA CCACCACCNN GGAGAACTAC APHAVP1B ---------- ---------- ---------- APHAVP1E ---------- ---------- ---------- APHAVP1F GACCCTGTCA CCACCACCGT GGAGAACTAC GGCGGTCAGA CACAAACCCA GGCGGTCAGA CACAAACCCA GGCGGTGAGA CACAAACCCA ------GAGA CACAAACCCA ---------- ---AAACCCA GGCGGTGAGA CACAAACCCAThe "sequential" format has all of the data for the first species, then all of the characters for the next species, and so on. For the PHYLIP formats, there is an option ( Use user trees in input file?) which signals that one or more user-defined "in nested-pairs parenthesis notation" trees are to be provided for evaluation. This "user tree" is supplied in the input file after the species data, with a line containing the number of user-defined trees being defined.
Here is an example with one user-defined tree in a sequential PHYLIP format:
6 50 APHAVP1C ---------- ---------- -------TAC GGCGGTCAGA CACAAACCCA APHAVP1D ---------- ---------- -------TAC GGCGGTCAGA CACAAACCCA APHAVP1A ------GTCA CCACCACCNN GGAGAACTAC GGCGGTGAGA CACAAACCCA APHAVP1B ---------- ---------- ---------- ------GAGA CACAAACCCA APHAVP1E ---------- ---------- ---------- ---------- ---AAACCCA APHAVP1F GACCCTGTCA CCACCACCGT GGAGAACTAC GGCGGTGAGA CACAAACCCA 1 ((((APHAVP1E,APHAVP1B),(APHAVP1F,APHAVP1A)),APHAVP1D),APHAVP1C);
For more information about the Phylip format, please see the "main.doc" file from the PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein, available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip.
The output from EDnaPars are two files, one containing an ASCII representation of the most parsimonius tree and another containing the tree in nested-pairs parenthesis notation.
Here is the output file from the example session.
EDnaPars Phylogram of fmdv.msf{*}. August 23, 1996 16:32 One most parsimonious tree found: +--APHAVP1E +-----4 ! +--APHAVP1B +--3 ! ! +--APHAVP1F +--2 +-----5 ! ! +--APHAVP1A --1 ! ! +-----------APHAVP1D ! +--------------APHAVP1C remember: this is an unrooted tree! requires a total of 85.000
Here is the output tree file from the example session.
((((APHAVP1E,APHAVP1B),(APHAVP1F,APHAVP1A)),APHAVP1D),APHAVP1C);
PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. The user should note that this tree is not a phylogenetic tree. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments.
ToPhylip writes GCG sequences into a single file in PHYLIP format. Phylip2Tree displays trees computed with one of the PHYLIP-programs or with EProtPars, EDnaPars EDnaML, EDnaMLK, ENeighbor, EFitch and EKitsch, in GCG style. ESeqBoot produces multiple data sets from a molecular sequence data set by bootstrap, jackknife, or permutation resampling. EProtPars estimates phylogenies from amino acid sequences using the parsimony method. EDnaDist computes a distance matrix from nucleic acid sequences, under four different models of nucleotide substitution (Jukes and Cantor (1969), Kimura (1980), Jin and Nei(1990) and a model of maximum likelihood (Felsenstein, 1981)). EProtDist computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. ENeighbor estimates phylogenies from distance matrix data using the Neighbor-Joining method or the UPGMA method of clustering. EFitch estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. EKitsch estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed. EDnaML estimates phylogenies from nucleotide sequences by maximum likelihood. EDnaMLK does the same as EDnaML but assumes a molecular clock. EConsense computes consensus trees by the majority-rule consensus tree. It can be used as the final step in doing bootstrap analyses.
Ednapars carries out unrooted parsimony (analogous to Wagner trees) (Eck and Dayhoff, 1966; Kluge and Farris, 1969) on DNA sequences. The method of Fitch (1971) is used to count the number of changes of base needed on a given tree. Other than that, the algorithm is a direct modification of program WAGNER (an ancestor of MIX which was formerly in this package). The assumptions of this method are exactly analogous to those of MIX:
(1) Each site evolves independently. (2) Different lineages evolve independently. (3) The probability of a base substitution at a given site is small over the lengths of time involved in a branch of the phylogeny. (4) The expected amounts of change in different branches of the phylogeny do not vary by so much that two changes in a high-rate branch are more probable than one change in a low-rate branch. (5) The expected amounts of change do not vary enough among sites that two changes in one site are more probable than one change in another.
That these are the assumptions of parsimony methods has been documented in a series of papers by Felsenstein: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b, 1988), but also read the exchange between Felsenstein and Sober (1986).
Change from an occupied site to a deletion is counted as one change. Reversion from a deletion to an occupied site is allowed and is also counted as one change. Note that this in effect assumes that a deletion N bases long is N separate events.
When using a PHYLIP formated input file, Ednapars show some extra options.
If a "user tree" or "user trees" are supplied in the input file, as it is described in the input file section, EDnaPars reads a tree or trees from the input file and evaluates them. For that, say yes to the 'Use user trees in input file?' question or use the -USERTRee command-line option. When more than one tree is supplied, the program also performs a statistical test of each of these trees against the best tree. This test is a version of the test proposed by Alan Templeton (1983) and evaluated in a test case by Felsenstein (1985). It is closely parallel to a test using log likelihood differences described by Kishino and Hasegawa (1989), and uses the mean and variance of step differences between trees, taken across positions. If the mean is more than 1.96 standard deviations different then the trees are declared significantly different. The program prints out a table of the steps for each tree, the differences of each from the best one, the variance of that quantity as determined by the step differences at individual positions, and a conclusion as to whether that tree is or is not significantly worse than the best one.
If you have a "multiple data sets" input file, answer 'yes' to the 'Analyze multiple data sets ?' question or use -SETS= n command-line option (where n is the number of data sets). The data sets have the same format as the first data set. Here is an (very small) input file with two five-species data sets:
Using the program ESeqBoot you can make multiple data sets by bootstrapping. Trees can be produced for all of these using this option.5 6 Alpha CCACCA Beta CCAAAA Gamma CAACCA Delta AACAAC Epsilon AACCCA 5 6 Alpha CACACA Beta CCAACC Gamma CAACAC Delta GCCTGG Epsilon TGCAAT
The exact contents of the output file depends on which options you have selected. If you select all possible output information, the output will consist of (1) the name of the program and date, (2) the input information printed out, (3) a series of phylogenies, some with associated information indicating how much change there was in each character or on each part of the tree.
Answer 'yes' to the 'Print out the data at start of run ?' or use -SHOWDATA command-line option for the data to appear in the output file, with the convention that "." means "the same as in the first species".
It is important to realize that the lengths of the segments of the printed tree are not significant, but purely conventional and are presented just to make the topology visible.
If you answer 'yes' to 'Print out steps in each site ?' or use -STEPs command-line option, the program prints out a table containing the number of steps that different characters (or sites) require on the tree.
If you answer 'yes' to 'Print sequences at all nodes of tree?' or use -PRINTSeqs command-line option, a table is printed out after each tree, showing for each branch whether there are known to be changes in the branch, and what the states are inferred to have been at the top end of the branch. If the inferred state is a "?" or one of the IUB ambiguity symbols, there will be multiple equally-parsimonious assignments of states; the users must work these out for themselves by hand. A "?" in the reconstructed states means that in addition to one or more bases, a deletion may or may not be present.
The exact details of the search of different trees depend on the order of input of species. You have the option to tell the program to use a random number generator to choose the input order of species. The program will then prompt you for a "seed" for the random number generator (or you can tell it from -RANDOM=1 command-line option) . The seed should be an integer between 1 and 32767, and should of form 4n+1, which means that it must give a remainder of 1 when divided by 4. This can be judged by looking at the last two digits of the number. Each different seed leads to a different sequence of addition of species. By simply changing the random number seed and re-running the programs one can look for other, and better trees. If the seed entered is not odd, the program will not proceed, but will prompt for another seed. The Jumble option also causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run). Of course this is slow, taking 10 times longer than a single run. But it does give us a much greater chance of finding all of the most parsimonious trees.In practice, it is advisable to use the Jumble option to evaluate many different orderings of the input species and specify that it be done many times (as many as ten).
The Outgroup option ( -OUTGRoup=1 command-line option) specifies which species is to be used to root the tree by having it become the outgroup (the species being taken in the numerical order that they occur in the input file).
The Threshold option ( -THREshold=1000 command-line option) sets a threshold such that if the number of steps counted in a character is higher than the threshold, it will be taken to be the threshold value rather than the actual number of steps. The defaults a threshold so high that it will never be surpassed (this will be a positive real number greater than 1). Thresholds less than or equal to 1.0 do not have any meaning and should not be used: they will result in a tree dependent only on the input order of species and not at all on the data! The use of thresholds to obtain methods intermediate between parsimony and compatibility methods is described by Felsenstein (1981).
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum Syntax: % ednapars [-INfile=]dna.msf{*} -default Prompted Parameters: [-OUTfile=]dna.ednapars output file. -INTERLeaved interleaved PHYLIP formated input file (only for PHYLIP formated input file). -NOINTERLeaved sequencial PHYLIP formated input file (only for PHYLIP formated input file). Optional Parameters: -OPTions makes the program ask for further specific options. -USERTree one or more user-defined trees is to be provided for evaluation in the input file (only for PHYLIP formated input file). -RANDom=1 use a random number generator to choose the input order of species. The seed should be an integer between 1 and 32767. -JUMnumber=10 number of times to restart the process (with different orders of species). -OUTGroup=1 species used to root the tree. -THREShold=1000 threshold for the number of steps counted in a character. -SETS=2 multiple data sets (only for PHYLIP formated input file). -SHOWData print data in the output file. -SHOWSteps print out a table of the number of steps that different characters require on the tree. -SHOWChanges print sequences at all nodes of tree in the output file.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
makes the program ask for all specific options.
tells the program that one or more user-defined trees are to be provided for evaluation in the input file. Only when using PHYLIP formated input files. When more than one tree is supplied, the program also performs a statistical test of each of these trees against the best tree. The program prints out a table of the steps for each tree, the differences of each from the best one, the variance of that quantity as determined by the step differences at individual positions, and a conclusion as to whether that tree is or is not significantly worse than the best one.
use a random number generator to choose the input order of species. The seed should be an integer between 1 and 32767, and should of form 4n+1, which means that it must give a remainder of 1 when divided by 4. Each different seed leads to a different sequence of addition of species. If the seed entered is not odd, the program will not proceed, but will prompt for another seed.
causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run). Of course this is slow, taking 10 times longer than a single run. But it does give us a much greater chance of finding all of the most parsimonious trees.
specifies which species is to be used to root the tree by having it become the outgroup (the species being taken in the numerical order that they occur in the input file).
sets a threshold such that if the number of steps counted in a character is higher than the threshold, it will be taken to be the threshold value rather than the actual number of steps. The defaultis a threshold so high that it will never be surpassed (this will be a positive real number greater than 1).
tells the program how many data sets there are from the input file. This is possible only for PHYLIP formated input file.
print the sequences data in the output file, with the convention that "." means "the same as in the first species".
print out a table of the number of steps that different characters (or sites) require on the tree. A typical example looks like this:
steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 2 2 2 2 1 1 2 2 1 10! 1 2 3 1 1 1 1 1 1 2 20! 1 2 2 1 2 2 1 1 1 2 30! 1 2 1 1 1 2 1 3 1 1 40! 1The numbers across the top and down the side indicate which site is being referred to. Thus site 23 is column "3" of row "20" and has 2 steps in this case.
print out a table after each tree, showing for each branch whether there are known to be changes in the branch, and what the states are inferred to have been at the top end of the branch. If the inferred state is a "?" there will be multiple equally-parsimonious assignments of states; the user must work these out for themselves by hand.
Below is an example of the output file when using this options.
Name Sequences ---- --------- Alpha AACGUGGCCA AAU Beta ..G..C.... ..C Gamma C.UU.C.U.. C.A Delta GGUA.UU.GG CC. Epsilon GGGA.CU.GG CCC One most parsimonious tree found: +--Epsilon +--4 +--3 +--Delta ! ! +--2 +-----Gamma ! ! --1 +--------Beta ! +-----------Alpha remember: this is an unrooted tree! requires a total of 19.000 steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 2 1 3 2 0 2 1 1 1 10! 1 1 1 3 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AABGTSGCCA AAY 1 2 maybe .....C.... ... 2 3 yes V.KD...... C.. 3 4 yes GG.A..T.GG .C. 4 Epsilon maybe ..G....... ..C 4 Delta yes ..T..T.... ..T 3 Gamma yes C.TT...T.. ..A 2 Beta maybe ..G....... ..C 1 Alpha maybe ..C..G.... ..T
Eck, R. V., and M. O. Dayhoff. 1966. Atlas of Protein Sequence and Structure 1966. National Biomedical Research Foundation, Silver Spring, Maryland.
Farris, J. S. 1983. The logical basis of phylogenetic analysis. pp. 1-47 in Advances in Cladistics, Volume 2, Proceedings of the Second Meeting of the Willi Hennig Society. ed. Norman I. Platnick and V. A. Funk. Columbia University Press, New York.
Felsenstein, J. 1973. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Zoology 22: 240-249.
Felsenstein, J. 1978. Cases in which parsimony and compatibility methods will be positively misleading. Systematic Zoology 27: 401-410.
Felsenstein, J. 1979. Alternative methods of phylogenetic inference and their interrelationship. Systematic Zoology 28: 49-62.
Felsenstein, J. 1981. A likelihood approach to character weighting and what it tells us about parsimony and compatibility. Biological Journal of the Linnean Society 16: 183-196.
Felsenstein, J. 1983. Parsimony in systematics: biological and statistical issues. Annual Review of Ecology and Systematics 14:313-333.
Felsenstein, J. 1985. Confidence limits on phylogenies with a molecular clock. Systematic Zoology 34: 152-161.
Felsenstein, J. and E. Sober. 1986. Parsimony and likelihood: an exchange. Systematic Zoology 35: 617-626.
Felsenstein, J. 1988. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22: 521-565.
Fitch, W. M. 1971. Toward defining the course of evolution: minimum change for a specified tree topology. Systematic Zoology 20: 406-416.
Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 29: 170-179.
Kluge, A. G., and J. S. Farris. 1969. Quantitative phyletics and the evolution of anurans. Systematic Zoology 18: 1-32.
Sober, E. 1983a. Parsimony in systematics: philosophical issues. Annual Review of Ecology and Systematics 14: 335-357.
Sober, E. 1983b. A likelihood justification of parsimony. Cladistics 1: 209-233.
Sober, E. 1988. Reconstructing the Past: Parsimony, Evolution, and Inference. MIT Press, Cambridge, Massachusetts.
Templeton, A. R. 1983. Phylogenetic inference from restriction endonuclease cleavage site maps with particular reference to the evolution of humans and the apes. Evolution 37: 221-244.
For further information please refer to the "main.doc" and "dnapars.doc" files from the PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein (available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip).
Printed: November 15, 1996 11:46 (1162)