EProtPars (Protein Sequence Parsimony Method) infers an unrooted phylogeny from protein sequences, using a new method intermediate between the approaches of Eck and Dayhoff (1966) and Fitch (1971).
EProtPars is a modified version of the PHYLIP version 3.572c's PROTPARS, by Joseph Felsenstein, with command line control added.
EProtPars estimates phylogenies from protein sequences (input using the standard one-letter code for amino acids) using the parsimony method, in a variant which counts only those nucleotide changes that change the amino acid, on the assumption that silent changes are more easily accomplished.
The input file for EProtPars can be an MSF or PHYLIP formated file.
This program was originally written by Joe Felsenstein (E-mail:joe@evolution.genetics.washington.edu. Post: Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, U.S.A.)
This version was modified for inclusion in EGCG by Maria Jesus Martin (E-mail: martin@ebi.ac.uk; Post: EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SQ or E-mail: martin@tdi.es; Post: Tecnologia para Diagnostico e Investigacion, Condes de Torreanaz 5, 28028 Madrid).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a session with EProtPars
% eprotpars -options EPROTPARS of what sequences file ? fos.msf{*} What should I call the output file (* fos.eprotpars *) ? Randomize input order of sequences (* No *) ? OutGroup root (* No *) ? Use threshold parsimony (* No *) ? Print out the data at start of run (* No *) ? Print out steps in each site (* No *) ? Print sequences at all nodes of tree (* No *) ? Adding species: FOSAVINK FOSCHICK FOSMOUSE FOSRAT FOSHUMAN FOSXMSVFR FOSMSVFB FOSBMOUSE FOSBSTAEP Doing global rearrangements !-----------------! ................. Output written to fos.eprotpars Trees also written onto fos.eprotparstrees %
The input file for EProtPars is either GCG MSF protein sequence file or a PHYLIP protein sequence file.
In the PHYLIP format the first line contains the number of species and the number of amino acid positions (counting any stop codons that you want to include), separated by blanks. Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats. In the interleaved format, some lines giving the first part of each of the sequences, then lines giving the next part of each, and so on. Thus the sequences mightlook like this:
9 50 FOSAVINK ---------- ---------- ---------- FOSCHICK MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY FOSMOUSE MMFSGFNADY EASSSRCSSA SPAGDSLSYY FOSRAT MMFSGFNADY EASSSRCSSA SPAGDSLSYY FOSHUMAN MMFSGFNADY EASSSRCSSA SPAGDSLSYY FOSXMSVFR ---------- ---------- ----DSLSYY FOSMSVFB MMFSGFNADY EASSFRCSSA SPAGDSLSYY FOSBMOUSE -MFQAFPGDY DSGSRCSSSP SAESQ----Y FOSBSTAEP ---------- ---------- ---------- ---------- -----SQDFC PSPADSFSSM GSPVNSQDFC HSPADSFSSM GSPVNTQDFC HSPADSFSSM GSPVNTQDFC HSPADSFSSM GSPVNAQDFC HSPADSFSSM GSPVNTQDFC HSPADSFSSM GSPVNTQDFC LSSVDSFGSP PTAAASQE-C ---------- ----------The "sequential" format has all of the data for the first species, then all of the characters for the next species, and so on. For the PHYLIP formats, there is an option ( Use user trees in input file?) which signals that one or more user-defined "in nested-pairs parenthesis notation" trees are to be provided for evaluation. This "user tree" is supplied in the input file after the species data, with a line containing the number of user-defined trees being defined.
Here is an example with one user-defined tree in a sequential PHYLIP format:
9 50 FOSAVINK ---------- ---------- ---------- ---------- -----SQDFC FOSCHICK MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY PSPADSFSSM GSPVNSQDFC FOSMOUSE MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM GSPVNTQDFC FOSRAT MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM GSPVNTQDFC FOSHUMAN MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM GSPVNAQDFC FOSXMSVFR ---------- ---------- ----DSLSYY HSPADSFSSM GSPVNTQDFC FOSMSVFB MMFSGFNADY EASSFRCSSA SPAGDSLSYY HSPADSFSSM GSPVNTQDFC FOSBMOUSE -MFQAFPGDY DSGSRCSSSP SAESQ----Y LSSVDSFGSP PTAAASQE.C FOSBSTAEP ---------- ---------- ---------- ---------- ---------- 1 ((((FOSBMOUSE,(FOSBSTAEP,FOSXMSVFR)),(FOSHUMAN,(FOSRAT,(FOSMSVFB, FOSMOUSE)))),FOSCHICK),FOSAVINK);
For more information about the Phylip format, please see the "main.doc" file from PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein, available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip.
The output from EProtPars are two files, one containing an ASCII representation of the most parsimonius tree and another containing the tree in nested-pairs parenthesis notation.
Here is the output file from the example session.
Eprotpars Phylogram of fos.msf{*}. August 19, 1996 12:20 One most parsimonious tree found: +-----FOSBMOUSE +-----------7 ! ! +--FOSBSTAEP ! +--8 +--5 +--FOSXMSVFR ! ! ! ! +--------FOSHUMAN ! +--------4 ! ! +-----FOSRAT +--2 +--3 ! ! ! +--FOSMSVFB ! ! +--6 --1 ! +--FOSMOUSE ! ! ! +--------------------FOSCHICK ! +-----------------------FOSAVINK remember: this is an unrooted tree! requires a total of 1677.000
Here is the output tree file from the example session.
((((FOSBMOUSE,(FOSBSTAEP,FOSXMSVFR)),(FOSHUMAN,(FOSRAT,(FOSMSVFB, FOSMOUSE)))),FOSCHICK),FOSAVINK);
PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. The user should note that this tree is not a phylogenetic tree. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments.
ToPhylip writes GCG sequences into a single file in PHYLIP format. Phylip2Tree displays trees computed with one of the PHYLIP-programs or with EProtPars EDnaPars, EDnaML, EDnaMLK, ENeighbor, EFitch and EKitsch, in GCG style. ESeqBoot produces multiple data sets from a molecular sequence data set by bootstrap, jackknife, or permutation resampling.EDnaPars estimates phylogenies from nucleic acid sequences using the parsimony method. EDnaDist computes a distance matrix from nucleic acid sequences, under four different models of nucleotide substitution (Jukes and Cantor (1969), Kimura (1980), Jin and Nei(1990) and a model of maximum likelihood (Felsenstein, 1981)). EProtDist computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. ENeighbor estimates phylogenies from distance matrix data using the Neighbor-Joining method or the UPGMA method of clustering. EFitch estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. EKitsch estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed. EConsense computes consensus trees by the majority-rule consensus tree. It can be used as the final step in doing bootstrap analyses.
EProtPars infers an unrooted phylogeny from protein sequences, using a new method intermediate between the approaches of Eck and Dayhoff (1966) and Fitch (1971). Eck and Dayhoff (1966) allowed any amino acid to change to any other, and counted the number of such changes needed to evolve the protein sequences on each given phylogeny. This has the problem that it allows replacements which are not consistent with the genetic code, counting them equally with replacements that are consistent. Fitch, on the other hand, counted the minimum number of nucleotide substitutions that would be needed to achieve the given protein sequences. This counts silent changes equally with those that change the amino acid. .s 1 The present method insists that any changes of amino acid be consistent with the genetic code so that, for example, lysine is allowed to change to methionine but not to proline. However, changes between two amino acids via a third are allowed and counted as two changes if each of the two replacements is individually allowed. This sometimes allows changes that at first sight you would think should be outlawed. Thus we can change from phenylalanine to glutamine via leucine in two steps total. Consulting the genetic code, you will find that there is a leucine codon one step away from a phenylalanine codon, and a leucine codon one step away from glutamine. But they are not the same leucine codon. It actually takes three base substitutions to get from either of the phenylalanine codons UUU and UUC to either of the glutamine codons CAA or CAG. Why then does this program count only two? The answer is that recent DNA sequence comparisons seem to show that synonymous changes are considerably faster and easier than ones that change the amino acid. We are assuming that, in effect, synonymous changes occur so much more readily that they need not be counted. Thus, in the chain of changes UUU (Phe) --> CUU (Leu) --> CUA (Leu) --> CAA (Glu), the middle one is not counted because it does not change the amino acid (leucine). .s 1 To maintain consistency with the genetic code, it is necessary for the program internally to treat serine as two separate states (ser1 and ser2) since the two groups of serine codons are not adjacent in the code. Changes to the state "deletion" are counted as three steps to prevent the algorithm from assuming unnecessary deletions. The state "unknown" is simply taken to mean that the amino acid, which has not been determined, will in each tree that is evaluated be assumed be whichever one causes the fewest steps.
The assumptions of this method (which has not been described in the literature), are thus something like this:
(1) Change in different sites is independent. (2) Change in different lineages is independent. (3) The probability of a base substitution that changes the amino acid sequence is small over the lengths of time involved in a branch of the phylogeny. (4) The expected amounts of change in different branches of the phylogeny do not vary by so much that two changes in a high-rate branch are more probable than one change in a low-rate branch. (5) The expected amounts of change do not vary enough among sites that two changes in one site are more probable than one change in another. (6) The probability of a base change that is synonymous is much higher than the probability of a change that is not synonymous.
That these are the assumptions of parsimony methods has been documented by Felsenstein: (1973, 1978, 1979, 1981, 1983, 1988). For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the works by Farris (1983) and Sober (1983a, 1983b, 1988), but also read the exchange between Felsenstein and Sober (1986).
When using a PHYLIP formated input file, EProtPars show some extra options.
If a "user tree" or "user trees" are supplied in the input file, as it is described in the input file section, EProtPars reads a tree or trees from the input file and evaluates them. For that, answer 'yes' to the 'Use user trees in input file?' question or use the -USERTRee command-line option. When more than one tree is supplied, the program also performs a statistical test of each of these trees against the best tree. This test is a version of the test proposed by Alan Templeton (1983) and evaluated in a test case by Felsenstein (1985). It is closely parallel to a test using log likelihood differences described by Kishino and Hasegawa (1989), and uses the mean and variance of step differences between trees, taken across positions. If the mean is more than 1.96 standard deviations different then the trees are declared significantly different. The program prints out a table of the steps for each tree, the differences of each from the best one, the variance of that quantity as determined by the step differences at individual positions, and a conclusion as to whether that tree is or is not significantly worse than the best one.
If you have a "multiple data sets" input file, answer 'yes' to the 'Analyze multiple data sets ?' question or use -SETS= n command-line option (where n is the number of data sets). The data sets have the same format as the first data set. Here is an (very small) input file with two five-species data sets:
Using the program ESeqBoot you can make multiple data sets by bootstrapping. Trees can be produced for all of these using this option.5 6 Alpha CCACCA Beta CCAAAA Gamma CAACCA Delta AACAAC Epsilon AACCCA 5 6 Alpha CACACA Beta CCAACC Gamma CAACAC Delta GCCTGG Epsilon TGCAAT
The exact contents of the output file depend on which options you have selected. If you select all possible output information, the output will consist of (1) the name of the program and date, (2) the input information printed out, (3) a series of phylogenies, some with associated information indicating how much change there was in each character or on each part of the tree.
Answer 'yes' to the 'Print out the data at start of run ?' or use -SHOWData command-line option for the data to appear in the output file, with the convention that "." means "the same as in the first species".
It is important to realize that the lengths of the segments of the printed tree are not significant, but purely conventional and are presented just to make the topology visible.
If you answer yes to 'Print out steps in each site ?' or use -SHOWSteps command-line option, the program print out a table containing the number of steps that different characters (or sites) require on the tree.
If you answer yes to 'Print sequences at all nodes of tree?' or use -SHOWChanges command-line option, a table is printed out after each tree, showing for each branch whether there are known to be changes in the branch, and what the states are inferred to have been at the top end of the branch. If the inferred state is a "?" there will be multiple equally-parsimonious assignments of states; the users must work these out for themselves by hand.
The exact details of the search of different trees depend on the order of input of species. You have the option to tell the program to use a random number generator to choose the input order of species. The program will then prompt you for a "seed" for the random number generator (or you can tell it from -RANDom= 1 command-line option) . The seed should be an integer between 1 and 32767, and should of form 4n+1, which means that it must give a remainder of 1 when divided by 4. This can be judged by looking at the last two digits of the number. Each different seed leads to a different sequence of addition of species. By simply changing the random number seed and re-running the programs one can look for other, and better trees. If the seed entered is not odd, the program will not proceed, but will prompt for another seed. The Jumble option also causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run). Of course this is slow, taking 10 times longer than a single run. But it does give us a much greater chance of finding all of the most parsimonious trees.In practice, it is advisable to use the Jumble option to evaluate many different orderings of the input species and specify that it be done many times (as many as ten).
The Outgroup option ( -OUTGroup= 1 command-line option) specifies which species is to be used to root the tree by having it become the outgroup (the species being taken in the numerical order that they occur in the input file).
The Threshold option ( -THREShold= 1000 command-line option) sets a threshold such that if the number of steps counted in a character is higher than the threshold, it will be taken to be the threshold value rather than the actual number of steps. The defaults a threshold so high that it will never be surpassed (this will be a positive real number greater than 1). The use of thresholds to obtain methods intermediate between parsimony and compatibility methods is described by Felsenstein (1981).
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum Syntax: % eprotpars [-INfile=]file.msf{*} -default Prompted Parameters: [-OUTfile=]file.eprotpars output file. -INTERLeaved interleaved PHYLIP formated input file (only for PHYLIP formated input file). -NOINTERLeaved sequencial PHYLIP formated input file (only for PHYLIP formated input file). Optional Parameters: -OPTions makes the program ask for further specific options. -USERTree one or more user-defined trees is to be provided for evaluation in the input file (only for PHYLIP formated input file). -RANDom=1 use a random number generator to choose the input order of species. The seed should be an integer between 1 and 32767. -JUMnumber=10 number of times to restart the process (with different orders of species). -OUTGroup=1 species used to root the tree. -THREShold=1000 threshold for the number of steps counted in a character. -SETS=2 multiple data sets (only for PHYLIP formated input file). -SHOWData print data in the output file. -SHOWSteps print out a table of the number of steps that different characters require on the tree. -SHOWChanges print sequences at all nodes of tree in the output file.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
makes the program ask for all specific options.
tells the program that one or more user-defined trees are to be provided for evaluation in the input file. Only when using PHYLIP formated input files. When more than one tree is supplied, the program also performs a statistical test of each of these trees against the best tree. The program prints out a table of the steps for each tree, the differences of each from the best one, the variance of that quantity as determined by the step differences at individual positions, and a conclusion as to whether that tree is or is not significantly worse than the best one.
use a random number generator to choose the input order of species. The seed should be an integer between 1 and 32767, and should of form 4n+1, which means that it must give a remainder of 1 when divided by 4. Each different seed leads to a different sequence of addition of species. If the seed entered is not odd, the program will not proceed, but will prompt for another seed.
causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run). Of course this is slow, taking 10 times longer than a single run. But it does give us a much greater chance of finding all of the most parsimonious trees.
specifies which species is to be used to root the tree by having it become the outgroup (the species being taken in the numerical order that they occur in the input file).
sets a threshold such that if the number of steps counted in a character is higher than the threshold, it will be taken to be the threshold value rather than the actual number of steps. The defaultis a threshold so high that it will never be surpassed (this will be a positive real number greater than 1).
tells the program how many data sets there are from the input file. This is possible only for PHYLIP formated input file.
print the sequences data in the output file, with the convention that "." means "the same as in the first species".
print out a table of the number of steps that different characters (or sites) require on the tree. A typical example looks like this:
steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 2 2 2 2 1 1 2 2 1 10! 1 2 3 1 1 1 1 1 1 2 20! 1 2 2 1 2 2 1 1 1 2 30! 1 2 1 1 1 2 1 3 1 1 40! 1The numbers across the top and down the side indicate which site is being referred to. Thus site 23 is column "3" of row "20" and has 2 steps in this case.
print out a table after each tree, showing for each branch whether there are known to be changes in the branch, and what the states are inferred to have been at the top end of the branch. If the inferred state is a "?" there will be multiple equally-parsimonious assignments of states; the user must work these out for themselves by hand.
Below is an example of the output file when using these options.
Name Sequences ---- --------- Alpha ABCDEFGHIK Beta ..--...... Gamma ?...S...?? Delta CIK....... Epsilon DIK....... 3 trees in all found +--------Gamma ! +--2 +--Epsilon ! ! +--4 ! +--3 +--Delta --1 ! ! +-----Beta ! +-----------Alpha remember: this is an unrooted tree! requires a total of 14.000 steps in each position: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 3 1 5 3 2 0 0 0 0 10! 0 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 ANCDEFGHIK 1 2 no .......... 2 Gamma yes ?B..S...?? 2 3 yes ..?....... 3 4 yes ?IK....... 4 Epsilon maybe D......... 4 Delta yes C......... 3 Beta yes .B--...... 1 Alpha maybe .B........ +--Epsilon +--4 +--3 +--Delta ! ! +--2 +-----Gamma ! ! --1 +--------Beta ! +-----------Alpha remember: this is an unrooted tree! requires a total of 14.000 steps in each position: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 3 1 5 3 2 0 0 0 0 10! 0 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 ANCDEFGHIK 1 2 no .......... 2 3 maybe ?......... 3 4 yes .IK....... 4 Epsilon maybe D......... 4 Delta yes C......... 3 Gamma yes ?B..S...?? 2 Beta yes .B--...... 1 Alpha maybe .B........ +--Epsilon +-----4 ! +--Delta +--3 ! ! +--Gamma --1 +-----2 ! +--Beta ! +-----------Alpha remember: this is an unrooted tree! requires a total of 14.000 steps in each position: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 3 1 5 3 2 0 0 0 0 10! 0 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 ANCDEFGHIK 1 3 no .......... 3 4 yes ?IK....... 4 Epsilon maybe D......... 4 Delta yes C......... 3 2 no .......... 2 Gamma yes ?B..S...?? 2 Beta yes .B--...... 1 Alpha maybe .B........
Eck, R. V., and M. O. Dayhoff. 1966. Atlas of Protein Sequence and Structure 1966. National Biomedical Research Foundation, Silver Spring, Maryland.
Farris, J. S. 1983. The logical basis of phylogenetic analysis. pp. 1-47 in Advances in Cladistics, Volume 2, Proceedings of the Second Meeting of the Willi Hennig Society. ed. Norman I. Platnick and V. A. Funk. Columbia University Press, New York.
Felsenstein, J. 1973. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Zoology 22: 240-249.
Felsenstein, J. 1978. Cases in which parsimony and compatibility methods will be positively misleading. Systematic Zoology 27: 401-410.
Felsenstein, J. 1979. Alternative methods of phylogenetic inference and their interrelationship. Systematic Zoology 28: 49-62.
Felsenstein, J. 1981. A likelihood approach to character weighting and what it tells us about parsimony and compatibility. Biological Journal of the Linnean Society 16: 183-196.
Felsenstein, J. 1983. Parsimony in systematics: biological and statistical issues. Annual Review of Ecology and Systematics 14:313-333.
Felsenstein, J. 1985. Confidence limits on phylogenies with a molecular clock. Systematic Zoology 34: 152-161.
Felsenstein, J. and E. Sober. 1986. Parsimony and likelihood: an exchange. Systematic Zoology 35: 617-626.
Felsenstein, J. 1988. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22: 521-565.
Fitch, W. M. 1971. Toward defining the course of evolution: minimum change for a specified tree topology. Systematic Zoology 20: 406-416.
Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 29: 170-179.
Sober, E. 1983a. Parsimony in systematics: philosophical issues. Annual Review of Ecology and Systematics 14: 335-357.
Sober, E. 1983b. A likelihood justification of parsimony. Cladistics 1: 209-233.
Sober, E. 1988. Reconstructing the Past: Parsimony, Evolution, and Inference. MIT Press, Cambridge, Massachusetts.
Templeton, A. R. 1983. Phylogenetic inference from restriction endonuclease cleavage site maps with particular reference to the evolution of humans and the apes. Evolution 37: 221-244.
For further information please refer to the "main.doc" and "protpars.doc" files from the PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein (available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip).
Printed: November 15, 1996 11:47 (1162)