frestdist |
Please help by correcting and extending the Wiki pages.
These two distances assume that the restriction sites are accidental byproducts of random change of nucleotide sequences. For my restriction sites distance the DNA sequences are assumed to be changing according to the Kimura 2-parameter model of DNA change (Kimura, 1980). The user can set the transition/transversion rate for the model. For my restriction fragments distance there is there is an implicit assumption of a Jukes-Cantor (1969) model of change, The user can also set the parameter of a correction for unequal rates of evolution between sites in the DNA sequences, using a Gamma distribution of rates among sites. The Jukes-Cantor model is also implicit in the restriction fragments distance of Nei and Li(1979). It does not allow us to correct for a Gamma distribution of rates among sites.
The restriction sites distances use data coded for the presence of absence of individual restriction sites (usually as + and - or 0 and 1). My distance is based on the proportion, out of all sites observed in one species or the other, which are present in both species. This is done to correct for the ascertainment of sites, for the fact that we are not aware of many sites because they do not appear in any species.
My distance starts by computing from the particular pair of species the fraction
n++ f = --------------------- n++ + 1/2 (n+- + n-+)
where n++ is the number of sites contained in both species, n+- is the number of sites contained in the first of the two species but not in the second, and n-+ is the number of sites contained in the second of the two species but not in the first. This is the fraction of sites that are present in one species which are present in both. Since the number of sites present in the two species will often differ, the denominator is the average of the number of sites found in the two species.
If each restriction site is s nucleotides long, the probability that a restriction site is present in the other species, given that it is present in a species, is
Qs,
`where Q is the probability that a nucleotide has no net change as one goes from the one species to the other. It may have changed in between; we are interested in the probability that that nucleotide site is in the same base in both species, irrespective of what has happened in between. The distance is then computed by finding the branch length of a two-species tree (connecting these two species with a single branch) such that Q equals the s-th root of f. For this the program computes Q for various values of branch length, iterating them by a Newton-Raphson algorithm until the two quantities are equal.
The resulting distance should be numerically close to the original restriction sites distance of Nei and Li (1979) when divergence is small. Theirs computes the probability of retention of a site in a way that assumes that the site is present in the common ancestor of the two species. Ours does not make this assumption. It is inspired by theirs, but differs in this detail. Their distance also assumes a Jukes-Cantor (1969) model of base change, and does not allow for transitions being more frequent than transversions. In this sense mine generalizes theres somewhat. Their distance does include, as mine does as well, a correction for Gamma distribution of rate of change among nucleotide sites.
I have made their original distance available here
For restriction fragments data we use a different distance. If we average over all restriction fragment lengths, each at its own expected frequency, the probability that the fragment will still be in existence after a certain amount of branch length, we must take into account the probability that the two restriction sites at the ends of the fragment do not mutate, and the probability that no new restriction site occurs within the fragment in that amount of branch length. The result for a restriction site length of s is:
Q2s f = -------- 2 - Qs
(The details of the derivation will be given in my forthcoming book Inferring Phylogenies (to be published by Sinauer Associates in 2001). Given the observed fraction of restriction sites retained, f, we can solve a quadratic equation from the above expression for Qs. That makes it easy to obtain a value of Q, and the branch length can then be estimated by adjusting it so the probability of a base not changing is equal to that value. Alternatively, if we use the Nei and Li (1979) restriction fragments distance, this involves solving for g in the nonlinear equation
g = [ f (3 - 2g) ]1/4
and then the distance is given by
d = - (2/r) loge(g)
where r is the length of the restriction site.
Comparing these two restriction fragments distances in a case where their underlying DNA model is the same (which is when the transition/transversion ratio of the modified model is set to 0.5), you will find that they are very close to each other, differing very little at small distances, with the modified distance become smaller than the Nei/Li distance at larger distances. It will therefore matter very little which one you use.
Although these distances are designed for restriction sites and restriction fragments data, they can be applied to RAPD and AFLP data as well. RAPD (Randomly Amplified Polymorphic DNA) and AFLP (Amplified Fragment Length Polymorphism) data consist of presence or absence of individual bands on a gel. The bands are segments of DNA with PCR primers at each end. These primers are defined sequences of known length (often about 10 nucleotides each). For AFLPs the reolevant length is the primer length, plus three nucleotides. Mutation in these sequences makes them no longer be primers, just as in the case of restriction sites. Thus a pair of 10-nucleotide primers will behave much the same as a 20-nucleotide restriction site, for RAPDs (26 for AFLPs). You can use the restriction sites distance as the distance between RAPD or AFLP patterns if you set the proper value for the total length of the site to the total length of the primers (plus 6 in the case of AFLPs). Of course there are many possible sources of noise in these data, including confusing fragments of similar length for each other and having primers near each other in the genome, and these are not taken into account in the statistical model used here.
% frestdist Calculate distance matrix from restriction sites or fragments Input file: restdist.dat Phylip restdist program output file [restdist.frestdist]: Restriction site or fragment distances, version 3.69.650 Distances calculated for species Alpha .... Beta ... Gamma .. Delta . Epsilon Distances written to file "restdist.frestdist" Done. |
Go to the input files for this example
Go to the output files for this example
Calculate distance matrix from restriction sites or fragments Version: EMBOSS:6.6.0.0 Standard (Mandatory) qualifiers: [-data] discretestates File containing one or more sets of restriction data [-outfile] outfile [*.frestdist] Phylip restdist program output file Additional (Optional) qualifiers (* if not always prompted): -[no]restsites boolean [Y] Restriction sites (put N if you want restriction fragments) -neili boolean [N] Use original Nei/Li model (default uses modified Nei/Li model) * -gammatype boolean [N] Gamma distributed rates among sites * -gammacoefficient float [1] Coefficient of variation of substitution rate among sites (Number 0.001 or more) -ttratio float [2.0] Transition/transversion ratio (Number 0.001 or more) -sitelength integer [6] Site length (Integer 1 or more) -lower boolean [N] Lower triangular distance matrix -printdata boolean [N] Print data at start of run -[no]progress boolean [Y] Print indications of progress of run Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-outfile" associated qualifiers -odirectory2 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit |
Qualifier | Type | Description | Allowed values | Default |
---|---|---|---|---|
Standard (Mandatory) qualifiers | ||||
[-data] (Parameter 1) |
discretestates | File containing one or more sets of restriction data | Discrete states file | |
[-outfile] (Parameter 2) |
outfile | Phylip restdist program output file | Output file | <*>.frestdist |
Additional (Optional) qualifiers | ||||
-[no]restsites | boolean | Restriction sites (put N if you want restriction fragments) | Boolean value Yes/No | Yes |
-neili | boolean | Use original Nei/Li model (default uses modified Nei/Li model) | Boolean value Yes/No | No |
-gammatype | boolean | Gamma distributed rates among sites | Boolean value Yes/No | No |
-gammacoefficient | float | Coefficient of variation of substitution rate among sites | Number 0.001 or more | 1 |
-ttratio | float | Transition/transversion ratio | Number 0.001 or more | 2.0 |
-sitelength | integer | Site length | Integer 1 or more | 6 |
-lower | boolean | Lower triangular distance matrix | Boolean value Yes/No | No |
-printdata | boolean | Print data at start of run | Boolean value Yes/No | No |
-[no]progress | boolean | Print indications of progress of run | Boolean value Yes/No | Yes |
Advanced (Unprompted) qualifiers | ||||
(none) | ||||
Associated qualifiers | ||||
"-outfile" associated outfile qualifiers | ||||
-odirectory2 -odirectory_outfile |
string | Output directory | Any string | |
General qualifiers | ||||
-auto | boolean | Turn off prompts | Boolean value Yes/No | N |
-stdout | boolean | Write first file to standard output | Boolean value Yes/No | N |
-filter | boolean | Read first file from standard input, write first file to standard output | Boolean value Yes/No | N |
-options | boolean | Prompt for standard and additional values | Boolean value Yes/No | N |
-debug | boolean | Write debug output to program.dbg | Boolean value Yes/No | N |
-verbose | boolean | Report some/full command line options | Boolean value Yes/No | Y |
-help | boolean | Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose | Boolean value Yes/No | N |
-warning | boolean | Report warnings | Boolean value Yes/No | Y |
-error | boolean | Report errors | Boolean value Yes/No | Y |
-fatal | boolean | Report fatal errors | Boolean value Yes/No | Y |
-die | boolean | Report dying program messages | Boolean value Yes/No | Y |
-version | boolean | Report version number and exit | Boolean value Yes/No | N |
10 35 4
The site data are in standard form. Each species starts with a species name whose maximum length is given by the constant "nmlngth" (whose value in the program as distributed is 10 characters). The name should, as usual, be padded out to that length with blanks if necessary. The sites data then follows, one character per site (any blanks will be skipped and ignored). Like the DNA and protein sequence data, the restriction sites data may be either in the "interleaved" form or the "sequential" form. Note that if you are analyzing restriction sites data with the programs DOLLOP or MIX or other discrete character programs, at the moment those programs do not use the "aligned" or "interleaved" data format. Therefore you may want to avoid that format when you have restriction sites data that you will want to feed into those programs.
The presence of a site is indicated by a "+" and the absence by a "-". I have also allowed the use of "1" and "0" as synonyms for "+" and "-", for compatibility with MIX and DOLLOP which do not allow "+" and "-". If the presence of the site is unknown (for example, if the DNA containing it has been deleted so that one does not know whether it would have contained the site) then the state "?" can be used to indicate that the state of this site is unknown.
5 13 2 Alpha ++-+-++--+++- Beta ++++--+--+++- Gamma -+--+-++-+-++ Delta ++-+----++--- Epsilon ++++----++--- |
If the option to print out the data is selected, the output file will precede the data by more complete information on the input and the menu selections. The output file begins by giving the number of species and the number of characters.
The distances printed out are scaled in terms of expected numbers of substitutions per DNA site, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change, averaged over all sites analyzed, is set to 1.0. Thus when the G option is used, the rate of change at one site may be higher than at another, but their mean is expected to be 1.
5 Alpha 0.000000 0.022368 0.107681 0.082639 0.095581 Beta 0.022368 0.000000 0.107681 0.082639 0.056895 Gamma 0.107681 0.107681 0.000000 0.192466 0.207319 Delta 0.082639 0.082639 0.192466 0.000000 0.015945 Epsilon 0.095581 0.056895 0.207319 0.015945 0.000000 |
Program name | Description |
---|---|
distmat | Create a distance matrix from a multiple sequence alignment |
ednacomp | DNA compatibility algorithm |
ednadist | Nucleic acid sequence distance matrix program |
ednainvar | Nucleic acid sequence invariants method |
ednaml | Phylogenies from nucleic acid maximum likelihood |
ednamlk | Phylogenies from nucleic acid maximum likelihood with clock |
ednapars | DNA parsimony algorithm |
ednapenny | Penny algorithm for DNA |
eprotdist | Protein distance algorithm |
eprotpars | Protein parsimony algorithm |
erestml | Restriction site maximum likelihood method |
eseqboot | Bootstrapped sequences algorithm |
fdiscboot | Bootstrapped discrete sites algorithm |
fdnacomp | DNA compatibility algorithm |
fdnadist | Nucleic acid sequence distance matrix program |
fdnainvar | Nucleic acid sequence invariants method |
fdnaml | Estimate nucleotide phylogeny by maximum likelihood |
fdnamlk | Estimates nucleotide phylogeny by maximum likelihood |
fdnamove | Interactive DNA parsimony |
fdnapars | DNA parsimony algorithm |
fdnapenny | Penny algorithm for DNA |
fdolmove | Interactive Dollo or polymorphism parsimony |
ffreqboot | Bootstrapped genetic frequencies algorithm |
fproml | Protein phylogeny by maximum likelihood |
fpromlk | Protein phylogeny by maximum likelihood |
fprotdist | Protein distance algorithm |
fprotpars | Protein parsimony algorithm |
frestboot | Bootstrapped restriction sites algorithm |
frestml | Restriction site maximum likelihood method |
fseqboot | Bootstrapped sequences algorithm |
fseqbootall | Bootstrapped sequences algorithm |
Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.
Converted (August 2004) to an EMBASSY program by the EMBOSS team.
None