ESeqBoot produces multiple data sets from a molecular sequence data set by bootstrap, jackknife, or permutation resampling.
ESeqBoot is a modified version of the PHYLIP version 3.572c's SEQBOOT, by Joseph Felsenstein, with command line control added.
ESeqBoot is a general boostrapping tool. It generates multiple sets of data from one input data set by bootstrap, jackknife or permutation resampling. Other programs such as EProtPars, EDnaPars, EProtDist, EDnaDist, EDnaML and EDnaMLK can analyze these multiple data sets. ESeqBoot together with these programas, can be used along with EConsense to do bootstrap (or delete-half-jackknife) analyses with parsimony, distances and maximum likelihood methods. ESeqBoot can only handle molecular sequences (DNA or protein sequences). Note that SEQBOOT, from the PHYLIP programs package, reads molecular sequences, binary characters, restriction sites, or gene frequencies.
The input file for ESeqBoot can be an MSF or PHYLIP formated file.
This program was originally written by Joe Felsenstein (E-mail:joe@evolution.genetics.washington.edu. Post: Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, U.S.A.)
This version was modified for inclusion in EGCG by Maria Jesus Martin (E-mail: martin@ebi.ac.uk; Post: EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SQ or E-mail: martin@tdi.es; Post: Tecnologia para Diagnostico e Investigacion, Condes de Torreanaz 5, 28028 Madrid).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a session with ESeqBoot
% eseqboot -options ESEQBOOT of what sequences file ? fmdv.msf{*} What should I call the output file (* fmdv.eseqboot *) ? Resampling method : B) ootstrap. D) elete-half jackknife. P) ermute species for each character. Please choose one (* B *) ? How many replicates (* 100 *) ? Random number seed (must be odd) (* 239 *) ? Print out the data at start of run (* No *) ? completed replicate number 10 completed replicate number 20 completed replicate number 30 completed replicate number 40 completed replicate number 50 completed replicate number 60 completed replicate number 70 completed replicate number 80 completed replicate number 90 completed replicate number 100 Output written to dna.eseqboot %
The input file for ESeqBoot is either GCG MSF protein or DNA sequence file or a PHYLIP protein or DNA sequence file.
In the PHYLIP format the first line contains the number of species and the number of amino or nucleic acid positions, separated by blanks. Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats. In the interleaved format, some lines giving the first part of each of the sequences, then lines giving the next part of each, and so on. Thus the sequences mightlook like this:
9 50 FOSAVINK ---------- ---------- ---------- FOSCHICK MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY FOSMOUSE MMFSGFNADY EASSSRCSSA SPAGDSLSYY FOSRAT MMFSGFNADY EASSSRCSSA SPAGDSLSYY FOSHUMAN MMFSGFNADY EASSSRCSSA SPAGDSLSYY FOSXMSVFR ---------- ---------- ----DSLSYY FOSMSVFB MMFSGFNADY EASSFRCSSA SPAGDSLSYY FOSBMOUSE -MFQAFPGDY DSGSRCSSSP SAESQ----Y FOSBSTAEP ---------- ---------- ---------- ---------- -----SQDFC PSPADSFSSM GSPVNSQDFC HSPADSFSSM GSPVNTQDFC HSPADSFSSM GSPVNTQDFC HSPADSFSSM GSPVNAQDFC HSPADSFSSM GSPVNTQDFC HSPADSFSSM GSPVNTQDFC LSSVDSFGSP PTAAASQE-C ---------- ----------The "sequential" format has all of the data for the first species, then all of the characters for the next species, and so on. Here is an example with a sequential PHYLIP format:
9 50 FOSAVINK ---------- ---------- ---------- ---------- -----SQDFC FOSCHICK MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY PSPADSFSSM GSPVNSQDFC FOSMOUSE MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM GSPVNTQDFC FOSRAT MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM GSPVNTQDFC FOSHUMAN MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM GSPVNAQDFC FOSXMSVFR ---------- ---------- ----DSLSYY HSPADSFSSM GSPVNTQDFC FOSMSVFB MMFSGFNADY EASSFRCSSA SPAGDSLSYY HSPADSFSSM GSPVNTQDFC FOSBMOUSE -MFQAFPGDY DSGSRCSSSP SAESQ----Y LSSVDSFGSP PTAAASQE.C FOSBSTAEP ---------- ---------- ---------- ---------- ----------
For more information about the Phylip format, please see the "main.doc" file from PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein, available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip.
The output file will contain the data sets generated by the resampling process. The resulting data set has the same size as the original, except for Delete-Half-Jackknife resampling method, where it is half the size.
Here is an example with four five-species data sets:
5 6 Alpha ACCCAC Beta ACCCCC Gamma CCCCAC Delta CAAACA Epsilon CAAAAC 5 6 Alpha AAAACC Beta AACCCC Gamma ACAACC Delta CCCCAA Epsilon CCAACC 5 6 Alpha AAAAAC Beta AACCCC Gamma CCAAAC Delta CCCCCA Epsilon CCAAAC 5 6 Alpha AAAAAA Beta AACCCC Gamma ACAAAA Delta CCCCCC Epsilon CCAAAA
PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments.
ToPhylip writes GCG sequences into a single file in PHYLIP format. EDnaPars estimates phylogenies from nucleic acid sequences using the parsimony method. EProtPars estimates phylogenies from amino acid sequences using the parsimony method. EDnaDist computes four different distances between species from nucleic acid sequences. EProtDist computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. EDnaML estimates phylogenies from nucleotide sequences by maximum likelihood. EDnaMLK does the same as EDnaML but assumes a molecular clock. EConsense computes consensus trees by the majority-rule consensus tree. It can be used as the final step in doing bootstrap analyses.
The resampling methods available are three:
1. Bootstrap. Bootstrapping was invented by Bradley Efron in 1979, and its use in phylogeny estimation was introduced by Joe Felsentein (1985). It involves creating a new data set by sampling N characters randomly with replacement, so that the resulting data set has the same size as the original, but some characters have been left out and others are duplicated. The random variation of the results from analyzing these bootstrapped data sets can be shown statistically to be typical of the variation that you would get from collecting new data sets. The method assumes that the characters evolve independently, an assumption that may not be realistic for many kinds of data.
2. Delete-half-jackknifing. This alternative to the bootstrap involves sampling a random half of the characters, and including them in the data but dropping the others. The resulting data sets are half the size of the original, and no characters are duplicated. The random variation from doing this should be very similar to that obtained from the bootstrap. The method is advocated by Wu (1986).
3. Permuting species within characters. This method of resampling (well, OK, it may not be best to call it resampling) was introduced by Archie (1989) and Faith (1990; see also Faith and Cranston, 1991). It involves permuting the columns of the data matrix separately. This produces data matrices that have the same number and kinds of characters but no taxonomic structure. It is used for different purposes than the bootstrap, as it tests not the variation around an estimated tree but the hypothesis that there is no taxonomic structure in the data: if a statistic such as number of steps is significantly smaller in the actual data than it is in replicates that are permuted, then we can argue that there is some taxonomic structure in the data (though perhaps it might be just a pair of sibling species).
ESeqBoot can be used as the first step in doing bootstrap analyses. To carry out a bootstrap (or jackknife, or permutation test) you can use this program together with one of the tree-making programs, and the EConsense program. First, you need to run ESeqBoot to take the original data set and produce a large number (say 100) of bootstrapped data sets. Then you need to find the phylogeny estimate for each of these, using the particular method of interest. For parsimony use EDnaPars (amino acid sequences) or Eprotpars (protein sequences) and for maximum likelihood method use EDnaML or EDnaMLK (EDnaML with molecular clock). The input file for these programs is the output file from ESeqBoot All of these programs have a "multiple data" option that you can select by answering 'yes' to the 'Analyze multiple data sets ?' question or by using the -SETS=n command-line option (where n is the number of data sets). You would generate a big output file containing an ASCII representation of the trees and another file containing the trees in nested-pairs parenthesis notation, from the "n" data sets. This last file would serve as the input for EConsense. EConsense makes a majority rule consensus tree from the resulting tree file.
If you are using the Distance Matrix programs (EDnaDist or EProtDist) , you will have to add one extra step to this. For example: (1) run ESeqBoot (2) run EDnaDist using the output of ESeqBoot as its input. (3) run (say) ENeighbor using the output of EDnaDist as its input. (4) run Consense using the tree file from Neighbor as its input.
You can tell the program the number of replicate data sets. This defaults to 100. Most statisticians would be happiest with 1000 to 10,000 replicates in a bootstrap, but 100 gives a good rough picture. You will have to decide this based on how long a running time you want.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum Syntax: % eseqboot [-INfile=]data.msf{*} -default Prompted Parameters: [-OUTfile=]data.eseqboot output file. -MENu=b menu for resampling method, where: B) ootstrap. D) elete-half jackknife. P) ermute species for each character. -INTERLeaved interleaved PHYLIP formated input file (only for PHYLIP formated input file). -NOINTERLeaved sequencial PHYLIP formated input file (only for PHYLIP formated input file). -REPs=100 number of replicate data sets (defaults to 100) Optional Parameters: -OPTions makes the program ask for further specific options. -SEED=5 random number seed (must be odd). -SHOWData print data in the output file.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
makes the program ask for all specific options.
random seed number. This should be an integer greater than zero and less than 32767. Must be odd. For default, the program generates a random seed number by taking the minutes and seconds from the system's internal clock (minutes * 100 + seconds).
prints the original input data set on the output file before the resampled data sets.
Archie, J. W. 1989. A randomization test for phylogenetic information in systematic data. Systematic Zoology 38: 219-252.
Faith, D. P. 1990. Chance marsupial relationships. Nature 345: 393-394.
Faith, D. P. and P. S. Cranston. 1991. Could a cladogram this short have arisen by chance alone?: On permutation tests for cladistic structure. Cladistics 7: 1-28.
Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783-791.
Wu, C. F. J. 1986. Jackknife, bootstrap and other resampling plans in regression analysis. Annals of Statistics 14: 1261-1295.
Printed: November 15, 1996 11:47 (1162)