ENeighbor estimates phylogenies from distance matrix data using the Neighbor-Joining method or the UPGMA method of clustering.
ENeighbor is a modified version of the PHYLIP version 3.572c's NEIGHBOR, by Joseph Felsenstein, with command line control added.
ENeighbor implements the Neighbor-Joining method of Nei and Saitou (1987) and the UPGMA method of clustering. Neighbor-Joining is a distance matrix method producing an unrooted tree without the assumption of a evolutionary clock. UPGMA does assume a evolutionary clock. The branch lengths are not optimized by the least squares criterion but the methods are very fast and thus can handle much larger data sets.
The input file for ENeighbor is the output file from EDnaDist and EProtDist.
This program was originally written by Joe Felsenstein (E-mail:joe@evolution.genetics.washington.edu. Post: Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, U.S.A.)
This version was modified for inclusion in EGCG by Maria Jesus Martin (E-mail: martin@ebi.ac.uk; Post: EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SQ or E-mail: martin@tdi.es; Post: Tecnologia para Diagnostico e Investigacion, Condes de Torreanaz 5, 28028 Madrid).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a session with ENeighbor
% eneighbor -options ENEIGHBOR of what distance matrix file ? fmdv.ednadist What should I call the output file (* fmdv.eneighbor *) ? Phylogenetic method : N)eighbor Joining method. U)PGMA (Average Linkage clustering) method. Choose the method to use (* N *) ? Data matrix form : S)quare. L)ower-triangular. U)pper-triangular. Choose the matrix form (* S *) ? OutGroup root (* No *) ? Subreplicates (* No *) ? Randomize input order of sequences (* No *) ? Analyze multiple data sets (* No *) ? Print out the data at start of run (* No *) ? CYCLE 3: OTU 1 ( 0.00274) JOINS OTU 2 ( 0.00556) CYCLE 2: NODE 1 ( 0.02307) JOINS OTU 6 ( 0.01072) CYCLE 1: OTU 3 ( 0.00299) JOINS OTU 4 ( 0.00041) LAST CYCLE: NODE 1 ( 0.00484) JOINS NODE 3 ( 0.00266) JOINS OTU 5 ( 0.00449) Output written into fmdv.eneighbor Tree written into fmdv.eneighbortrees %
The input file for ENeighbor is the output file from EDnaDist and EProtDist. The first line of the input file contains the number of species. There follows species data, starting with a species name. The species name is ten characters long, and must be padded out with blanks if shorter. For each species there then follows a set of distances to all the other species (options allow the distance matrix to be upper or lower triangular or square).
Here is the input file for the example session.
6 APHAVP1C 0.0000 0.0083 0.0392 0.0348 0.0324 0.0353 APHAVP1D 0.0083 0.0000 0.0412 0.0368 0.0344 0.0406 APHAVP1A 0.0392 0.0412 0.0000 0.0034 0.0106 0.0178 APHAVP1B 0.0348 0.0368 0.0034 0.0000 0.0071 0.0189 APHAVP1E 0.0324 0.0344 0.0106 0.0071 0.0000 0.0232 APHAVP1F 0.0353 0.0406 0.0178 0.0189 0.0232 0.0000
The output consists of an tree (rooted if UPGMA, unrooted if Neighbor- Joining) and the lengths of the interior segments. The Average Percent Standard Deviation is not computed or printed out. If the tree found by ENeighbor is fed into EFitch as a User Tree, it will compute this quantity if the user also answers 'yes' to the 'Use lengths from user trees?' question of EFitch to ensure that none of the branch lengths is re-estimated.
Here is the output file from the example session.
6 Populations Neighbor-joining method Negative branch lengths allowed +APHAVP1A +--3 ! +APHAVP1B ! --4APHAVP1E ! ! +APHAVP1C ! +--1 +--2 +APHAVP1D ! +APHAVP1F remember: this is an unrooted tree! Between And Length ------- --- ------ 4 3 0.00266 3 APHAVP1A 0.00299 3 APHAVP1B 0.00041 4 APHAVP1E 0.00449 4 2 0.00484 2 1 0.02307 1 APHAVP1C 0.00274 1 APHAVP1D 0.00556 2 APHAVP1F 0.01072
PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments. Distances creates a table of the pairwise distances within a group of aligned sequences. GrowTree creates a phylogenetic tree from a distance matrix created by Distances using either the UPGMA or neighbor-joining method. You can create a text or graphics output file.
Phylip2Tree displays trees computed with one of the PHYLIP-programs or with EProtPars, EDnaPars, EDnaML, EDnaMLK, ENeighbor EFitch and EKitsch, in GCG style. EDnaDist computes a distance matrix from nucleic acid sequences, under four different models of nucleotide substitution (Jukes and Cantor (1969), Kimura (1980), Jin and Nei(1990) and a model of maximum likelihood (Felsenstein, 1981)). EProtDist computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. EFitch estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. EKitsch estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed. EDnaPars estimates phylogenies from nucleic acid sequences using the parsimony method. EProtPars estimates phylogenies from amino acid sequences using the parsimony method. EDnaML estimates phylogenies from nucleotide sequences by maximum likelihood. EDnaMLK does the same as EDnaML but assumes a molecular clock. ESeqBoot produces multiple data sets from a molecular sequence data set by bootstrap, jackknife, or permutation resampling. EConsense computes consensus trees by the majority-rule consensus tree. It can be used as the final step in doing bootstrap analyses.
The phylogenetic methods available are two:
1. The Neighbor-Joining method by Nei and Saitou (1987). It constructs a tree by successive clustering of lineages, setting branch lengths as the lineages join. The tree is not rearranged thereafter. The tree does not assume an evolutionary clock, so that it is in effect an unrooted tree. It should be somewhat similar to the tree obtained by EFitch. The program cannot evaluate a User tree, nor can it prevent branch lengths from becoming negative. However the algorithm is far faster than EFitch or EKitsch. This will make it particularly effective in their place for large studies or for bootstrap or jackknife resampling studies which require runs on multiple data sets.
2. The UPGMA method . It constructs a tree by successive (agglomerative) clustering using an average-linkage method of clustering.
This distance matrix program implicitly assume that:
a) Each distance is measured independently from the others: no item of data contributes to more than one distance.
b) The distance between each pair of taxa is drawn from a distribution with an expectation which is the sum of values (in effect amounts of evolution) along the tree from one tip to the other. The variance of the distribution is proportional to a power p of the expectation.
For more information, please see the Distance Matrix Programs documentation file ("distance.doc" ) from PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein, available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip.
The major advantage of ENeighbor is its speed: it requires a time only proportional to the square of the number of species. By contrast EFitch and EKitsch require a time that rises to the fourth power of the number of species. Thus ENeighbor is well-suited to bootstrapping studies and to analysis of very large trees. Simulation studies by Kuhner, Yamato and Felsenstein, show that, contrary to statements in the literature by others, ENeighbor does not get as accurate an estimate of the phylogeny as does FITCH. However it does nearly as well, and in view of its speed this will make it a quite useful program.
The "subreplication" option is present only to allow ENeighbor to read the input data. The number of replicates is actually ignored, even though it is read in. Note that this means that one cannot use it to have missing data in the input file, if ENeighbor is to be used.
When ENeighbor runs, it prints out an account of the successive clustering levels. In this printout of cluster levels the word "OTU" refers to a tip species, and the word "NODE" to an interior node of the resulting tree.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum Syntax: % eneighbor [-INfile=]file.ednadist -defaultPrompted Parameters: [-OUTfile=]file.eneighbor output file. -MENu=N menu for the phylogenetic method, where: N)eighbor Joining method. U)PGMA (Average Linkage clustering) method. -MATrix=S form of the data matrix, where: S)quare. L)ower-triangular. U)pper-triangular. Optional Parameters: -OPTions makes the program ask for further specific options. -RANDom=1 use a random number generator to choose the input order of species. The seed should be an integer between 1 and 32767. -OUTGroup=1 species used to root the tree. (only available for Neighbor Joining method) -SUBREPlicates subreplication option (not available). -SETS=2 multiple data sets. -SHOWData print data in the output file.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
makes the program ask for all specific options.
tells the program how many data sets there are from the input file.
use a random number generator to choose the input order of species.
specifies which species is to be used to root the tree by having it become the outgroup (the species being taken in the numerical order that they occur in the input file).
allows to read the number of replicates from the input data. However, this is actually ignored, even though it is read in.
prints the sequences data on the output file before the distance matrix.
Farris, J. S. 1981. Distance data in phylogenetic analysis.pp. 3-23 in Advances in Cladistics: Proceedings of the first meeting of the Willi Hennig Society, ed. V. A. Funk and D. R. Brooks. New York Botanical Garden, Bronx, New York.
Farris, J. S. 1985. Distance data revisited. Cladistics 1: 67-85.
Farris, J. S. 1986. Distances and statistics. Cladistics 2: 144-157.
Felsenstein, J. 1984. Distance methods for inferring phylogenies: a justification. Evolution 38: 16-24.
Felsenstein, J. 1986. Distance methods: a reply to Farris. Cladistics 2: 130-144.
Felsenstein, J. 1988. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22: 521-565.
Saitou, N., Nei, M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4: 406-425.
Studier, J. A. and K. J. Keppler. 1988. A note on the neighbor-joining algorithm of Saitou and Nei. Molecular Biology and Evolution 5: 729-731.