frestboot

 

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Bootstrapped restriction sites algorithm

Description

Reads in a data set, and produces multiple data sets from it by bootstrap resampling. Since most programs in the current version of the package allow processing of multiple data sets, this can be used together with the consensus tree program CONSENSE to do bootstrap (or delete-half-jackknife) analyses with most of the methods in this package. This program also allows the Archie/Faith technique of permutation of species within characters. It can also rewrite a data set to convert it from between the PHYLIP Interleaved and Sequential forms, and into a preliminary version of a new XML sequence alignment format which is under development

Algorithm

FRESTBOOT is a restriction site specific version of SEQBOOT.

SEQBOOT is a general bootstrapping and data set translation tool. It is intended to allow you to generate multiple data sets that are resampled versions of the input data set. Since almost all programs in the package can analyze these multiple data sets, this allows almost anything in this package to be bootstrapped, jackknifed, or permuted. SEQBOOT can handle molecular sequences, binary characters, restriction sites, or gene frequencies. It can also convert data sets between Sequential and Interleaved format, and into the NEXUS format or into a new XML sequence alignment format.

To carry out a bootstrap (or jackknife, or permutation test) with some method in the package, you may need to use three programs. First, you need to run SEQBOOT to take the original data set and produce a large number of bootstrapped or jackknifed data sets (somewhere between 100 and 1000 is usually adequate). Then you need to find the phylogeny estimate for each of these, using the particular method of interest. For example, if you were using DNAPARS you would first run SEQBOOT and make a file with 100 bootstrapped data sets. Then you would give this file the proper name to have it be the input file for DNAPARS. Running DNAPARS with the M (Multiple Data Sets) menu choice and informing it to expect 100 data sets, you would generate a big output file as well as a treefile with the trees from the 100 data sets. This treefile could be renamed so that it would serve as the input for CONSENSE. When CONSENSE is run the majority rule consensus tree will result, showing the outcome of the analysis.

This may sound tedious, but the run of CONSENSE is fast, and that of SEQBOOT is fairly fast, so that it will not actually take any longer than a run of a single bootstrap program with the same original data and the same number of replicates. This is not very hard and allows bootstrapping or jackknifing on many of the methods in this package. The same steps are necessary with all of them. Doing things this way some of the intermediate files (the tree file from the DNAPARS run, for example) can be used to summarize the results of the bootstrap in other ways than the majority rule consensus method does.

If you are using the Distance Matrix programs, you will have to add one extra step to this, calculating distance matrices from each of the replicate data sets, using DNADIST or GENDIST. So (for example) you would run SEQBOOT, then run DNADIST using the output of SEQBOOT as its input, then run (say) NEIGHBOR using the output of DNADIST as its input, and then run CONSENSE using the tree file from NEIGHBOR as its input.

The resampling methods available are:

Usage

Here is a sample session with frestboot


% frestboot -seed 3 
Bootstrapped restriction sites algorithm
Input file: restboot.dat
Phylip seqboot_rest program output file [restboot.frestboot]: 


completed replicate number   10
completed replicate number   20
completed replicate number   30
completed replicate number   40
completed replicate number   50
completed replicate number   60
completed replicate number   70
completed replicate number   80
completed replicate number   90
completed replicate number  100

Output written to file "restboot.frestboot"

Done.


Go to the input files for this example
Go to the output files for this example

Command line arguments

Bootstrapped restriction sites algorithm
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-infile]            discretestates File containing one or more sets of
                                  restriction data
  [-outfile]           outfile    [*.frestboot] Phylip seqboot_rest program
                                  output file

   Additional (Optional) qualifiers (* if not always prompted):
   -weights            properties Weights file
   -test               menu       [b] Choose test (Values: b (Bootstrap); j
                                  (Jackknife); c (Permute species for each
                                  character); o (Permute character order); s
                                  (Permute within species); r (Rewrite data))
*  -regular            toggle     [N] Altered sampling fraction
*  -fracsample         float      [100.0] Samples as percentage of sites
                                  (Number from 0.100 to 100.000)
*  -rewriteformat      menu       [p] Output format (Values: p (PHYLIP); n
                                  (NEXUS); x (XML))
*  -blocksize          integer    [1] Block size for bootstraping (Integer 1
                                  or more)
*  -reps               integer    [100] How many replicates (Integer 1 or
                                  more)
*  -justweights        menu       [d] Write out datasets or just weights
                                  (Values: d (Datasets); w (Weights))
   -enzymes            boolean    [N] Is the number of enzymes present in
                                  input file
*  -seed               integer    [1] Random number seed between 1 and 32767
                                  (must be odd) (Integer from 1 to 32767)
   -printdata          boolean    [N] Print out the data at start of run
*  -[no]dotdiff        boolean    [Y] Use dot-differencing
   -[no]progress       boolean    [Y] Print indications of progress of run

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-infile]
(Parameter 1)
discretestates File containing one or more sets of restriction data Discrete states file  
[-outfile]
(Parameter 2)
outfile Phylip seqboot_rest program output file Output file <*>.frestboot
Additional (Optional) qualifiers
-weights properties Weights file Property value(s)  
-test list Choose test
b (Bootstrap)
j (Jackknife)
c (Permute species for each character)
o (Permute character order)
s (Permute within species)
r (Rewrite data)
b
-regular toggle Altered sampling fraction Toggle value Yes/No No
-fracsample float Samples as percentage of sites Number from 0.100 to 100.000 100.0
-rewriteformat list Output format
p (PHYLIP)
n (NEXUS)
x (XML)
p
-blocksize integer Block size for bootstraping Integer 1 or more 1
-reps integer How many replicates Integer 1 or more 100
-justweights list Write out datasets or just weights
d (Datasets)
w (Weights)
d
-enzymes boolean Is the number of enzymes present in input file Boolean value Yes/No No
-seed integer Random number seed between 1 and 32767 (must be odd) Integer from 1 to 32767 1
-printdata boolean Print out the data at start of run Boolean value Yes/No No
-[no]dotdiff boolean Use dot-differencing Boolean value Yes/No Yes
-[no]progress boolean Print indications of progress of run Boolean value Yes/No Yes
Advanced (Unprompted) qualifiers
(none)
Associated qualifiers
"-outfile" associated outfile qualifiers
-odirectory2
-odirectory_outfile
string Output directory Any string  
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

Input file format

frestboot data files read by SEQBOOT are the standard ones for the various kinds of data. For molecular sequences the sequences may be either interleaved or sequential, and similarly for restriction sites. Restriction sites data may either have or not have the third argument, the number of restriction enzymes used. Discrete morphological characters are always assumed to be in sequential format. Gene frequencies data start with the number of species and the number of loci, and then follow that by a line with the number of alleles at each locus. The data for each locus may either have one entry for each allele, or omit one allele at each locus. The details of the formats are given in the main documentation file, and in the documentation files for the groups of programsreads any normal sequence USAs.

Input files for usage example

File: restboot.dat

   5   13   2
Alpha     ++-+-++--+++-
Beta      ++++--+--+++-
Gamma     -+--+-++-+-++
Delta     ++-+----++---
Epsilon   ++++----++---

Output file format

frestboot output will contain the data sets generated by the resampling process. Note that, when Gene Frequencies data is used or when Discrete Morphological characters with the Factors option are used, the number of characters in each data set may vary. It may also vary if there are an odd number of characters or sites and the Delete-Half-Jackknife resampling method is used, for then there will be a 50% chance of choosing (n+1)/2 characters and a 50% chance of choosing (n-1)/2 characters.

The Factors option causes the characters to be resampled together. If (say) three adjacent characters all have the same factors characters, so that they all are understood to be recoding one multistate character, they will be resampled together as a group.

The order of species in the data sets in the output file will vary randomly. This is a precaution to help the programs that analyze these data avoid any result which is sensitive to the input order of species from showing up repeatedly and thus appearing to have evidence in its favor.

The numerical options 1 and 2 in the menu also affect the output file. If 1 is chosen (it is off by default) the program will print the original input data set on the output file before the resampled data sets. I cannot actually see why anyone would want to do this. Option 2 toggles the feature (on by default) that prints out up to 20 times during the resampling process a notification that the program has completed a certain number of data sets. Thus if 100 resampled data sets are being produced, every 5 data sets a line is printed saying which data set has just been completed. This option should be turned off if the program is running in background and silence is desirable. At the end of execution the program will always (whatever the setting of option 2) print a couple of lines saying that output has been written to the output file.

Output files for usage example

File: restboot.frestboot

    5    13
Alpha     +--++-+++- -++
Beta      +++++----- -++
Gamma     -----+---+ +++
Delta     +--++----- -+-
Epsilon   +++++----- -+-
    5    13
Alpha     ++----+++- +++
Beta      +++-----+- +++
Gamma     -+-+++--++ ++-
Delta     ++-------- ++-
Epsilon   +++------- ++-
    5    13
Alpha     ++++++-+++ ---
Beta      ++++++-+++ ---
Gamma     --++-++--- +++
Delta     +++++----- ---
Epsilon   +++++----- ---
    5    13
Alpha     ++++-+++++ ---
Beta      ++-+-+++++ ---
Gamma     ---+-++--- +++
Delta     ++--+++--- ---
Epsilon   ++--+++--- ---
    5    13
Alpha     +-+++-++++ +--
Beta      +++++-++++ +--
Gamma     ----+----- +++
Delta     +-++-+---- ---
Epsilon   ++++-+---- ---
    5    13
Alpha     +++------- +++
Beta      +++------- +++
Gamma     +--++++++- +-+
Delta     +++------+ +--
Epsilon   +++------+ +--
    5    13
Alpha     ++++-+--++ ++-
Beta      ++++----++ ++-
Gamma     --+++---+- -++
Delta     ++++--+++- ---
Epsilon   ++++--+++- ---
    5    13
Alpha     +--+---+++ +--
Beta      ++++---+++ +--
Gamma     ----++-+-+ +++
Delta     +--+--++-- ---
Epsilon   ++++--++-- ---
    5    13
Alpha     +++--++--+ ++-


  [Part of this file has been deleted for brevity]

Gamma     -+--++-+++ -++
Delta     ++++------ +++
Epsilon   ++++------ +++
    5    13
Alpha     +++---+-++ +++
Beta      +++---+-++ +++
Gamma     ---++++-++ -++
Delta     +++----+++ ---
Epsilon   +++----+++ ---
    5    13
Alpha     ++++--+--+ +--
Beta      +++++----+ +--
Gamma     ---+-+-+++ +++
Delta     ++++-----+ +--
Epsilon   +++++----+ +--
    5    13
Alpha     +-----+--- +++
Beta      +++++++--- +++
Gamma     +------+++ +--
Delta     +-----+--- +--
Epsilon   +++++++--- +--
    5    13
Alpha     +-++--+-++ +--
Beta      ++++--+-++ +--
Gamma     +---++++-- -++
Delta     +-++------ ---
Epsilon   ++++------ ---
    5    13
Alpha     +++-+-++++ ++-
Beta      +++---++++ ++-
Gamma     --++--++-- +++
Delta     +++--+++-- ---
Epsilon   +++--+++-- ---
    5    13
Alpha     ++-+++--++ +--
Beta      +++-++--++ +--
Gamma     ----+++--- +++
Delta     ++-----+-- ---
Epsilon   +++----+-- ---
    5    13
Alpha     +---++---- ++-
Beta      ++---+---- ++-
Gamma     --++-+---- -++
Delta     +-----++++ ---
Epsilon   ++----++++ ---
    5    13
Alpha     +++-++++-+ +++
Beta      +++++----+ +++
Gamma     -++------+ +++
Delta     +++-+---++ ---
Epsilon   +++++---++ ---

Data files

None

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program name Description
distmat Create a distance matrix from a multiple sequence alignment
ednacomp DNA compatibility algorithm
ednadist Nucleic acid sequence distance matrix program
ednainvar Nucleic acid sequence invariants method
ednaml Phylogenies from nucleic acid maximum likelihood
ednamlk Phylogenies from nucleic acid maximum likelihood with clock
ednapars DNA parsimony algorithm
ednapenny Penny algorithm for DNA
eprotdist Protein distance algorithm
eprotpars Protein parsimony algorithm
erestml Restriction site maximum likelihood method
eseqboot Bootstrapped sequences algorithm
fdiscboot Bootstrapped discrete sites algorithm
fdnacomp DNA compatibility algorithm
fdnadist Nucleic acid sequence distance matrix program
fdnainvar Nucleic acid sequence invariants method
fdnaml Estimate nucleotide phylogeny by maximum likelihood
fdnamlk Estimates nucleotide phylogeny by maximum likelihood
fdnamove Interactive DNA parsimony
fdnapars DNA parsimony algorithm
fdnapenny Penny algorithm for DNA
fdolmove Interactive Dollo or polymorphism parsimony
ffreqboot Bootstrapped genetic frequencies algorithm
fproml Protein phylogeny by maximum likelihood
fpromlk Protein phylogeny by maximum likelihood
fprotdist Protein distance algorithm
fprotpars Protein parsimony algorithm
frestdist Calculate distance matrix from restriction sites or fragments
frestml Restriction site maximum likelihood method
fseqboot Bootstrapped sequences algorithm
fseqbootall Bootstrapped sequences algorithm

Author(s)

This program is an EMBOSS conversion of a program written by Joe Felsenstein as part of his PHYLIP package.

Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.

History

Written (2004) - Joe Felsenstein, University of Washington.

Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None