fdnainvar |
This program reads in nucleotide sequences for four species and computes the phylogenetic invariants discovered by James Cavender (Cavender and Felsenstein, 1987) and James Lake (1987). Lake's method is also called by him "evolutionary parsimony". I prefer Cavender's more mathematically precise term "invariants", as the method bears somewhat more relationship to likelihood methods than to parsimony. The invariants are mathematical formulas (in the present case linear or quadratic) in the EXPECTED frequencies of site patterns which are zero for all trees of a given tree topology, irrespective of branch lengths. We can consider at a given site that if there are no ambiguities, we could have for four species the nucleotide patterns (considering the same site across all four species) AAAA, AAAC, AAAG, ... through TTTT, 256 patterns in all.
The invariants are formulas in the expected pattern frequencies, not the observed pattern frequencies. When they are computed using the observed pattern frequencies, we will usually find that they are not precisely zero even when the model is correct and we have the correct tree topology. Only as the number of nucleotides scored becomes infinite will the observed pattern frequencies approach their expectations; otherwise, we must do a statistical test of the invariants.
Some explanation of invariants will be found in the above papers, and also in my recent review article on statistical aspects of inferring phylogenies (Felsenstein, 1988b). Although invariants have some important advantages, their validity also depends on symmetry assumptions that may not be satisfied. In the discussion below suppose that the possible unrooted phylogenies are I: ((A,B),(C,D)), II: ((A,C),(B,D)), and III: ((A,D),(B,C)).
Lake's invariants are fairly simple to describe: the patterns involved are only those in which there are two purines and two pyrimidines at a site. Thus a site with AACT would affect the invariants, but a site with AAGG would not. Let us use (as Lake does) the symbols 1, 2, 3, and 4, with the proviso that 1 and 2 are either both of the purines or both of the pyrimidines; 3 and 4 are the other two nucleotides. Thus 1 and 2 always differ by a transition; so do 3 and 4. Lake's invariants, expressed in terms of expected frequencies, are the three quantities:
(1) P(1133) + P(1234) - P(1134) - P(1233), (2) P(1313) + P(1324) - P(1314) - P(1323), (3) P(1331) + P(1342) - P(1341) - P(1332),
He showed that invariants (2) and (3) are zero under Topology I, (1) and (3) are zero under topology II, and (1) and (2) are zero under Topology III. If, for example, we see a site with pattern ACGC, we can start by setting 1=A. Then 2 must be G. We can then set 3=C (so that 4 is T). Thus its pattern type, making those substitutions, is 1323. P(1323) is the expected probability of the type of pattern which includes ACGC, TGAG, GTAT, etc.
Lake's invariants are easily tested with observed frequencies. For example, the first of them is a test of whether there are as many sites of types 1133 and 1234 as there are of types 1134 and 1233; this is easily tested with a chi-square test or, as in this program, with an exact binomial test. Note that with several invariants to test, we risk overestimating the significance of results if we simply accept the nominal 95% levels of significance (Li and Guoy, 1990).
Lake's invariants assume that each site is evolving independently, and that starting from any base a transversion is equally likely to end up at each of the two possible bases (thus, an A undergoing a transversion is equally likely to end up as a C or a T, and similarly for the other four bases from which one could start. Interestingly, Lake's results do not assume that rates of evolution are the same at all sites. The result that the total of 1133 and 1234 is expected to be the same as the total of 1134 and 1233 is unaffected by the fact that we may have aggregated the counts over classes of sites evolving at different rates.
Cavender's invariants (Cavender and Felsenstein, 1987) are for the case of a character with two states. In the nucleic acid case we can classify nucleotides into two states, R and Y (Purine and Pyrimidine) and then use the two-state results. Cavender starts, as before, with the pattern frequencies. Coding purines as R and pyrimidines as Y, the patterns types are RRRR, RRRY, and so on until YYYY, a total of 16 types. Cavender found quadratic functions of the expected frequencies of these 16 types that were expected to be zero under a given phylogeny, irrespective of branch lengths. Two invariants (called K and L) were found for each tree topology. The L invariants are particularly easy to understand. If we have the tree topology ((A,B),(C,D)), then in the case of two symmetric states, the event that A and B have the same state should be independent of whether C and D have the same state, as the events determining these happen in different parts of the tree. We can set up a contingency table:
C = D C =/= D ------------------------------ | A = B | YYYY, YYRR, YYYR, YYRY, | RRRR, RRYY RRYR, RRRY | A =/= B | YRYY, YRRR, YRYR, YRRY, | RYYY, RYRR RYYR, RYRY
and we expect that the events C = D and A = B will be independent. Cavender's L invariant for this tree topology is simply the negative of the crossproduct difference,
P(A=/=B and C=D) P(A=B and C=/=D) - P(A=B and C=D) P(A=/=B and C=/=D).
One of these L invariants is defined for each of the three tree topologies. They can obviously be tested simply by doing a chi-square test on the contingency table. The one corresponding to the correct topology should be statistically indistinguishable from zero. Again, there is a possible multiple tests problem if all three are tested at a nominal value of 95%.
The K invariants are differences between the L invariants. When one of the tables is expected to have crossproduct difference zero, the other two are expected to be nonzero, and also to be equal. So the difference of their crossproduct differences can be taken; this is the K invariant. It is not so easily tested.
The assumptions of Cavender's invariants are different from those of Lake's. One obviously need not assume anything about the frequencies of, or transitions among, the two different purines or the two different pyrimidines. However one does need to assume independent events at each site, and one needs to assume that the Y and R states are symmetric, that the probability per unit time that a Y changes into an R is the same as the probability that an R changes into a Y, so that we expect equal frequencies of the two states. There is also an assumption that all sites are changing between these two states at the same expected rate. This assumption is not needed for Lake's invariants, since expectations of sums are equal to sums of expectations, but for Cavender's it is, since products of expectations are not equal to expectations of products.
It is helpful to have both sorts of invariants available; with further work we may appreciate what other invaraints there are for various models of nucleic acid change.
% fdnainvar -printdata Nucleic acid sequence Invariants method Input (aligned) nucleotide sequence set(s): dnainvar.dat Phylip weights file (optional): Phylip dnainvar program output file [dnainvar.fdnainvar]: Output written to output file "dnainvar.fdnainvar" Done. |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers: [-sequence] seqsetall File containing one or more sequence alignments -weights properties Phylip weights file (optional) [-outfile] outfile [*.fdnainvar] Phylip dnainvar program output file Additional (Optional) qualifiers (* if not always prompted): -printdata boolean [N] Print data at start of run * -[no]dotdiff boolean [Y] Use dot-differencing to display results -[no]printpattern boolean [Y] Print counts of patterns -[no]printinvariant boolean [Y] Print invariants -[no]progress boolean [Y] Print indications of progress of run Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -odirectory2 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages |
Standard (Mandatory) qualifiers | Allowed values | Default | |
---|---|---|---|
[-sequence] (Parameter 1) |
File containing one or more sequence alignments | Readable sets of sequences | Required |
-weights | Phylip weights file (optional) | Property value(s) | |
[-outfile] (Parameter 2) |
Phylip dnainvar program output file | Output file | <*>.fdnainvar |
Additional (Optional) qualifiers | Allowed values | Default | |
-printdata | Print data at start of run | Boolean value Yes/No | No |
-[no]dotdiff | Use dot-differencing to display results | Boolean value Yes/No | Yes |
-[no]printpattern | Print counts of patterns | Boolean value Yes/No | Yes |
-[no]printinvariant | Print invariants | Boolean value Yes/No | Yes |
-[no]progress | Print indications of progress of run | Boolean value Yes/No | Yes |
Advanced (Unprompted) qualifiers | Allowed values | Default | |
(none) |
4 13 Alpha AACGTGGCCAAAT Beta AAGGTCGCCAAAC Gamma CATTTCGTCACAA Delta GGTATTTCGGCCT |
If option 3 is on the invariants are then printed out, together with their statistical tests. For Lake's invariants the two sums which are expected to be equal are printed out, and then the result of an one-tailed exact binomial test which tests whether the difference is expected to be this positive or more. The P level is given (but remember the multiple-tests problem!).
For Cavender's L invariants the contingency tables are given. Each is tested with a one-tailed chi-square test. It is possible that the expected numbers in some categories could be too small for valid use of this test; the program does not check for this. It is also possible that the chi-square could be significant but in the wrong direction; this is not tested in the current version of the program. To check it beware of a chi-square greater than 3.841 but with a positive invariant. The invariants themselves are computed, as the difference of cross-products. Their absolute magnitudes are not important, but which one is closest to zero may be indicative. Significantly nonzero invariants should be negative if the model is valid. The K invariants, which are simply differences among the L invariants, are also printed out without any test on them being conducted. Note that it is possible to use the bootstrap utility SEQBOOT to create multiple data sets, and from the output from sunning all of these get the empirical variability of these quadratic invariants.
Nucleic acid sequence Invariants method, version 3.67 4 species, 13 sites Name Sequences ---- --------- Alpha AACGTGGCCA AAT Beta ..G..C.... ..C Gamma C.TT.C.T.. C.A Delta GGTA.TT.GG CC. Pattern Number of times AAAC 1 AAAG 2 AACC 1 AACG 1 CCCG 1 CCTC 1 CGTT 1 GCCT 1 GGGT 1 GGTA 1 TCAT 1 TTTT 1 Symmetrized patterns (1, 2 = the two purines and 3, 4 = the two pyrimidines or 1, 2 = the two pyrimidines and 3, 4 = the two purines) 1111 1 1112 2 1113 3 1121 1 1132 2 1133 1 1231 1 1322 1 1334 1 Tree topologies (unrooted): I: ((Alpha,Beta),(Gamma,Delta)) II: ((Alpha,Gamma),(Beta,Delta)) III: ((Alpha,Delta),(Beta,Gamma)) [Part of this file has been deleted for brevity] different purine:pyrimidine ratios from 1:1. Tree I: Contingency Table 2 8 1 2 Quadratic invariant = 4.0 Chi-square = 0.23111 (not significant) Tree II: Contingency Table 1 5 1 6 Quadratic invariant = -1.0 Chi-square = 0.01407 (not significant) Tree III: Contingency Table 1 2 6 4 Quadratic invariant = 8.0 Chi-square = 0.66032 (not significant) Cavender's quadratic invariants (type K) using purines vs. pyrimidines (these are expected to be zero for the correct tree topology) They will be misled if there are substantially different evolutionary rate between sites, or different purine:pyrimidine ratios from 1:1. No statistical test is done on them here. Tree I: -9.0 Tree II: 4.0 Tree III: 5.0 |
Program name | Description |
---|---|
distmat | Create a distance matrix from a multiple sequence alignment |
ednacomp | DNA compatibility algorithm |
ednadist | Nucleic acid sequence Distance Matrix program |
ednainvar | Nucleic acid sequence Invariants method |
ednaml | Phylogenies from nucleic acid Maximum Likelihood |
ednamlk | Phylogenies from nucleic acid Maximum Likelihood with clock |
ednapars | DNA parsimony algorithm |
ednapenny | Penny algorithm for DNA |
eprotdist | Protein distance algorithm |
eprotpars | Protein parsimony algorithm |
erestml | Restriction site Maximum Likelihood method |
eseqboot | Bootstrapped sequences algorithm |
fdiscboot | Bootstrapped discrete sites algorithm |
fdnacomp | DNA compatibility algorithm |
fdnadist | Nucleic acid sequence Distance Matrix program |
fdnaml | Estimates nucleotide phylogeny by maximum likelihood |
fdnamlk | Estimates nucleotide phylogeny by maximum likelihood |
fdnamove | Interactive DNA parsimony |
fdnapars | DNA parsimony algorithm |
fdnapenny | Penny algorithm for DNA |
fdolmove | Interactive Dollo or Polymorphism Parsimony |
ffreqboot | Bootstrapped genetic frequencies algorithm |
fproml | Protein phylogeny by maximum likelihood |
fpromlk | Protein phylogeny by maximum likelihood |
fprotdist | Protein distance algorithm |
fprotpars | Protein parsimony algorithm |
frestboot | Bootstrapped restriction sites algorithm |
frestdist | Distance matrix from restriction sites or fragments |
frestml | Restriction site maximum Likelihood method |
fseqboot | Bootstrapped sequences algorithm |
fseqbootall | Bootstrapped sequences algorithm |
Although we take every care to ensure that the results of the EMBOSS version are identical to those from the original package, we recommend that you check your inputs give the same results in both versions before publication.
Please report all bugs in the EMBOSS version to the EMBOSS bug team, not to the original author.
Converted (August 2004) to an EMBASSY program by the EMBOSS team.