Eseqboot

Go back to top

ESEQBOOT*

ESEQBOOT*

FUNCTION

ESeqBoot produces multiple data sets from a molecular sequence data set by bootstrap, jackknife, or permutation resampling.

ESeqBoot is a modified version of the PHYLIP version 3.572c's SEQBOOT, by Joseph Felsenstein, with command line control added.

ESeqBoot is a general boostrapping tool. It generates multiple sets of data from one input data set by bootstrap, jackknife or permutation resampling. Other programs such as EProtPars, EDnaPars, EProtDist, EDnaDist, EDnaML and EDnaMLK can analyze these multiple data sets. ESeqBoot together with these programas, can be used along with EConsense to do bootstrap (or delete-half-jackknife) analyses with parsimony, distances and maximum likelihood methods. ESeqBoot can only handle molecular sequences (DNA or protein sequences). Note that SEQBOOT, from the PHYLIP programs package, reads molecular sequences, binary characters, restriction sites, or gene frequencies.

The input file for ESeqBoot can be an MSF or PHYLIP formated file.

AUTHOR

This program was originally written by Joe Felsenstein (E-mail:joe@evolution.genetics.washington.edu. Post: Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, U.S.A.)

This version was modified for inclusion in EGCG by Maria Jesus Martin (E-mail: martin@ebi.ac.uk; Post: EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SQ or E-mail: martin@tdi.es; Post: Tecnologia para Diagnostico e Investigacion, Condes de Torreanaz 5, 28028 Madrid).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).

EXAMPLE

Here is a session with ESeqBoot

  
  
  % eseqboot -options
  
   ESEQBOOT of what sequences file ?  fmdv.msf{*}
  
   What should I call the output file (* fmdv.eseqboot *) ?
  
   Resampling method :
  
         B) ootstrap.
         D) elete-half jackknife.
         P) ermute species for each character.
  
   Please choose one (* B *) ?
  
   How many replicates (* 100 *) ?
  
   Random number seed (must be odd) (* 239 *) ?
  
   Print out the data at start of run  (* No *) ?
  
  completed replicate number   10
  completed replicate number   20
  completed replicate number   30
  completed replicate number   40
  completed replicate number   50
  completed replicate number   60
  completed replicate number   70
  completed replicate number   80
  completed replicate number   90
  completed replicate number  100
  
  Output written to dna.eseqboot
  
  %

INPUT FILE

The input file for ESeqBoot is either GCG MSF protein or DNA sequence file or a PHYLIP protein or DNA sequence file.

In the PHYLIP format the first line contains the number of species and the number of amino or nucleic acid positions, separated by blanks. Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats. In the interleaved format, some lines giving the first part of each of the sequences, then lines giving the next part of each, and so on. Thus the sequences mightlook like this:

  
  
   9 50
  FOSAVINK   ---------- ---------- ----------
  FOSCHICK   MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY
  FOSMOUSE   MMFSGFNADY EASSSRCSSA SPAGDSLSYY
  FOSRAT     MMFSGFNADY EASSSRCSSA SPAGDSLSYY
  FOSHUMAN   MMFSGFNADY EASSSRCSSA SPAGDSLSYY
  FOSXMSVFR  ---------- ---------- ----DSLSYY
  FOSMSVFB   MMFSGFNADY EASSFRCSSA SPAGDSLSYY
  FOSBMOUSE  -MFQAFPGDY DSGSRCSSSP SAESQ----Y
  FOSBSTAEP  ---------- ---------- ----------
  
  ---------- -----SQDFC
  PSPADSFSSM GSPVNSQDFC
  HSPADSFSSM GSPVNTQDFC
  HSPADSFSSM GSPVNTQDFC
  HSPADSFSSM GSPVNAQDFC
  HSPADSFSSM GSPVNTQDFC
  HSPADSFSSM GSPVNTQDFC
  LSSVDSFGSP PTAAASQE-C
  ---------- ----------

The "sequential" format has all of the data for the first species, then all of the characters for the next species, and so on. Here is an example with a sequential PHYLIP format:

  
  
   9 50
  FOSAVINK    ---------- ---------- ----------
  ---------- -----SQDFC
  FOSCHICK    MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY
  PSPADSFSSM GSPVNSQDFC
  FOSMOUSE    MMFSGFNADY EASSSRCSSA SPAGDSLSYY
  HSPADSFSSM GSPVNTQDFC
  FOSRAT      MMFSGFNADY EASSSRCSSA SPAGDSLSYY
  HSPADSFSSM GSPVNTQDFC
  FOSHUMAN    MMFSGFNADY EASSSRCSSA SPAGDSLSYY
  HSPADSFSSM GSPVNAQDFC
  FOSXMSVFR   ---------- ---------- ----DSLSYY
  HSPADSFSSM GSPVNTQDFC
  FOSMSVFB    MMFSGFNADY EASSFRCSSA SPAGDSLSYY
  HSPADSFSSM GSPVNTQDFC
  FOSBMOUSE   -MFQAFPGDY DSGSRCSSSP SAESQ----Y
  LSSVDSFGSP PTAAASQE.C
  FOSBSTAEP   ---------- ---------- ----------
  ---------- ----------

For more information about the Phylip format, please see the "main.doc" file from PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein, available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip.

OUTPUT FILE

The output file will contain the data sets generated by the resampling process. The resulting data set has the same size as the original, except for Delete-Half-Jackknife resampling method, where it is half the size.

Here is an example with four five-species data sets:

  
 5    6
  Alpha        ACCCAC
  Beta         ACCCCC
  Gamma        CCCCAC
  Delta        CAAACA
  Epsilon      CAAAAC
 5    6
  Alpha        AAAACC
  Beta         AACCCC
  Gamma        ACAACC
  Delta        CCCCAA
  Epsilon      CCAACC
 5    6
  Alpha        AAAAAC
  Beta         AACCCC
  Gamma        CCAAAC
  Delta        CCCCCA
  Epsilon      CCAAAC
 5    6
  Alpha        AAAAAA
  Beta         AACCCC
  Gamma        ACAAAA
  Delta        CCCCCC
  Epsilon      CCAAAA

RELATED PROGRAMS

PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments.

ToPhylip writes GCG sequences into a single file in PHYLIP format. EDnaPars estimates phylogenies from nucleic acid sequences using the parsimony method. EProtPars estimates phylogenies from amino acid sequences using the parsimony method. EDnaDist computes four different distances between species from nucleic acid sequences. EProtDist computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. EDnaML estimates phylogenies from nucleotide sequences by maximum likelihood. EDnaMLK does the same as EDnaML but assumes a molecular clock. EConsense computes consensus trees by the majority-rule consensus tree. It can be used as the final step in doing bootstrap analyses.

ALGORITHM

The resampling methods available are three:

1. Bootstrap. Bootstrapping was invented by Bradley Efron in 1979, and its use in phylogeny estimation was introduced by Joe Felsentein (1985). It involves creating a new data set by sampling N characters randomly with replacement, so that the resulting data set has the same size as the original, but some characters have been left out and others are duplicated. The random variation of the results from analyzing these bootstrapped data sets can be shown statistically to be typical of the variation that you would get from collecting new data sets. The method assumes that the characters evolve independently, an assumption that may not be realistic for many kinds of data.

2. Delete-half-jackknifing. This alternative to the bootstrap involves sampling a random half of the characters, and including them in the data but dropping the others. The resulting data sets are half the size of the original, and no characters are duplicated. The random variation from doing this should be very similar to that obtained from the bootstrap. The method is advocated by Wu (1986).

3. Permuting species within characters. This method of resampling (well, OK, it may not be best to call it resampling) was introduced by Archie (1989) and Faith (1990; see also Faith and Cranston, 1991). It involves permuting the columns of the data matrix separately. This produces data matrices that have the same number and kinds of characters but no taxonomic structure. It is used for different purposes than the bootstrap, as it tests not the variation around an estimated tree but the hypothesis that there is no taxonomic structure in the data: if a statistic such as number of steps is significantly smaller in the actual data than it is in replicates that are permuted, then we can argue that there is some taxonomic structure in the data (though perhaps it might be just a pair of sibling species).

CONSIDERATIONS

ESeqBoot can be used as the first step in doing bootstrap analyses. To carry out a bootstrap (or jackknife, or permutation test) you can use this program together with one of the tree-making programs, and the EConsense program. First, you need to run ESeqBoot to take the original data set and produce a large number (say 100) of bootstrapped data sets. Then you need to find the phylogeny estimate for each of these, using the particular method of interest. For parsimony use EDnaPars (amino acid sequences) or Eprotpars (protein sequences) and for maximum likelihood method use EDnaML or EDnaMLK (EDnaML with molecular clock). The input file for these programs is the output file from ESeqBoot All of these programs have a "multiple data" option that you can select by answering 'yes' to the 'Analyze multiple data sets ?' question or by using the -SETS=n command-line option (where n is the number of data sets). You would generate a big output file containing an ASCII representation of the trees and another file containing the trees in nested-pairs parenthesis notation, from the "n" data sets. This last file would serve as the input for EConsense. EConsense makes a majority rule consensus tree from the resulting tree file.

If you are using the Distance Matrix programs (EDnaDist or EProtDist) , you will have to add one extra step to this. For example: (1) run ESeqBoot (2) run EDnaDist using the output of ESeqBoot as its input. (3) run (say) ENeighbor using the output of EDnaDist as its input. (4) run Consense using the tree file from Neighbor as its input.

You can tell the program the number of replicate data sets. This defaults to 100. Most statisticians would be happiest with 1000 to 10,000 replicates in a bootstrap, but 100 gives a good rough picture. You will have to decide this based on how long a running time you want.

COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum Syntax: % eseqboot [-INfile=]data.msf{*} -default
  
  Prompted Parameters:
  
  [-OUTfile=]data.eseqboot  output file.
  -MENu=b                   menu for resampling method, where:
             B) ootstrap.
             D) elete-half jackknife.
             P) ermute species for each character.
  
  -INTERLeaved              interleaved PHYLIP formated input file
                             (only for PHYLIP formated input file).
  -NOINTERLeaved            sequencial PHYLIP formated input file
                             (only for PHYLIP formated input file).
  -REPs=100                 number of replicate data sets (defaults to 100)
  
  
  Optional Parameters:
  
  -OPTions                  makes the program ask for further specific
                             options.
  -SEED=5                   random number seed (must be odd).
  -SHOWData                 print data in the output file.

OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-OPTions

makes the program ask for all specific options.

-SEED=4331

random seed number. This should be an integer greater than zero and less than 32767. Must be odd. For default, the program generates a random seed number by taking the minutes and seconds from the system's internal clock (minutes * 100 + seconds).

-SHOWData

prints the original input data set on the output file before the resampled data sets.

REFERENCES

Archie, J. W. 1989. A randomization test for phylogenetic information in systematic data. Systematic Zoology 38: 219-252.

Faith, D. P. 1990. Chance marsupial relationships. Nature 345: 393-394.

Faith, D. P. and P. S. Cranston. 1991. Could a cladogram this short have arisen by chance alone?: On permutation tests for cladistic structure. Cladistics 7: 1-28.

Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783-791.

Wu, C. F. J. 1986. Jackknife, bootstrap and other resampling plans in regression analysis. Annals of Statistics 14: 1261-1295.

Printed: November 15, 1996 11:47 (1162)