fkitsch |
Please help by correcting and extending the Wiki pages.
The programs FITCH, KITSCH, and NEIGHBOR are for dealing with data which comes in the form of a matrix of pairwise distances between all pairs of taxa, such as distances based on molecular sequence data, gene frequency genetic distances, amounts of DNA hybridization, or immunological distances. In analyzing these data, distance matrix programs implicitly assume that:
These assumptions can be traced in the least squares methods of programs FITCH and KITSCH but it is not quite so easy to see them in operation in the Neighbor-Joining method of NEIGHBOR, where the independence assumptions is less obvious.
THESE TWO ASSUMPTIONS ARE DUBIOUS IN MOST CASES: independence will not be expected to be true in most kinds of data, such as genetic distances from gene frequency data. For genetic distance data in which pure genetic drift without mutation can be assumed to be the mechanism of change CONTML may be more appropriate. However, FITCH, KITSCH, and NEIGHBOR will not give positively misleading results (they will not make a statistically inconsistent estimate) provided that additivity holds, which it will if the distance is computed from the original data by a method which corrects for reversals and parallelisms in evolution. If additivity is not expected to hold, problems are more severe. A short discussion of these matters will be found in a review article of mine (1984a). For detailed, if sometimes irrelevant, controversy see the papers by Farris (1981, 1985, 1986) and myself (1986, 1988b).
For genetic distances from gene frequencies, FITCH, KITSCH, and NEIGHBOR may be appropriate if a neutral mutation model can be assumed and Nei's genetic distance is used, or if pure drift can be assumed and either Cavalli-Sforza's chord measure or Reynolds, Weir, and Cockerham's (1983) genetic distance is used. However, in the latter case (pure drift) CONTML should be better.
Restriction site and restriction fragment data can be treated by distance matrix methods if a distance such as that of Nei and Li (1979) is used. Distances of this sort can be computed in PHYLIp by the program RESTDIST.
For nucleic acid sequences, the distances computed in DNADIST allow correction for multiple hits (in different ways) and should allow one to analyse the data under the presumption of additivity. In all of these cases independence will not be expected to hold. DNA hybridization and immunological distances may be additive and independent if transformed properly and if (and only if) the standards against which each value is measured are independent. (This is rarely exactly true).
FITCH and the Neighbor-Joining option of NEIGHBOR fit a tree which has the branch lengths unconstrained. KITSCH and the UPGMA option of NEIGHBOR, by contrast, assume that an "evolutionary clock" is valid, according to which the true branch lengths from the root of the tree to each tip are the same: the expected amount of evolution in any lineage is proportional to elapsed time.
The method may be considered as providing an estimate of the phylogeny. Alternatively, it can be considered as a phenetic clustering of the tip species. This method minimizes an objective function, the sum of squares, not only setting the levels of the clusters so as to do so, but rearranging the hierarchy of clusters to try to find alternative clusterings that give a lower overall sum of squares. When the power option P is set to a value of P = 0.0, so that we are minimizing a simple sum of squares of the differences between the observed distance matrix and the expected one, the method is very close in spirit to Unweighted Pair Group Arithmetic Average Clustering (UPGMA), also called Average-Linkage Clustering. If the topology of the tree is fixed and there turn out to be no branches of negative length, its result should be the same as UPGMA in that case. But since it tries alternative topologies and (unless the N option is set) it combines nodes that otherwise could result in a reversal of levels, it is possible for it to give a different, and better, result than simple sequential clustering. Of course UPGMA itself is available as an option in program NEIGHBOR.
An important use of this method will be to do a formal statistical test of the evolutionary clock hypothesis. This can be done by comparing the sums of squares achieved by FITCH and by KITSCH, BUT SOME CAVEATS ARE NECESSARY. First, the assumption is that the observed distances are truly independent, that no original data item contributes to more than one of them (not counting the two reciprocal distances from i to j and from j to i). THIS WILL NOT HOLD IF THE DISTANCES ARE OBTAINED FROM GENE FREQUENCIES, FROM MORPHOLOGICAL CHARACTERS, OR FROM MOLECULAR SEQUENCES. It may be invalid even for immunological distances and levels of DNA hybridization, provided that the use of common standard for all members of a row or column allows an error in the measurement of the standard to affect all these distances simultaneously. It will also be invalid if the numbers have been collected in experimental groups, each measured by taking differences from a common standard which itself is measured with error. Only if the numbers in different cells are measured from independent standards can we depend on the statistical model. The details of the test and the assumptions are discussed in my review paper on distance methods (Felsenstein, 1984a). For further and sometimes irrelevant controversy on these matters see the papers by Farris (1981, 1985, 1986) and myself (Felsenstein, 1986, 1988b).
A second caveat is that the distances must be expected to rise linearly with time, not according to any other curve. Thus it may be necessary to transform the distances to achieve an expected linearity. If the distances have an upper limit beyond which they could not go, this is a signal that linearity may not hold. It is also VERY important to choose the power P at a value that results in the standard deviation of the variation of the observed from the expected distances being the P/2-th power of the expected distance.
To carry out the test, fit the same data with both FITCH and KITSCH, and record the two sums of squares. If the topology has turned out the same, we have N = n(n-1)/2 distances which have been fit with 2n-3 parameters in FITCH, and with n-1 parameters in KITSCH. Then the difference between S(K) and S(F) has d1 = n-2 degrees of freedom. It is statistically independent of the value of S(F), which has d2 = N-(2n-3) degrees of freedom. The ratio of mean squares
[S(K)-S(F)]/d1 ---------------- S(F)/d2
should, under the evolutionary clock, have an F distribution with n-2 and N-(2n-3) degrees of freedom respectively. The test desired is that the F ratio is in the upper tail (say the upper 5%) of its distribution. If the S (subreplication) option is in effect, the above degrees of freedom must be modified by noting that N is not n(n-1)/2 but is the sum of the numbers of replicates of all cells in the distance matrix read in, which may be either square or triangular. A further explanation of the statistical test of the clock is given in a paper of mine (Felsenstein, 1986).
The program uses a similar tree construction method to the other programs in the package and, like them, is not guaranteed to give the best-fitting tree. The assignment of the branch lengths for a given topology is a least squares fit, subject to the constraints against negative branch lengths, and should not be able to be improved upon. KITSCH runs more quickly than FITCH.
% fkitsch Fitch-Margoliash method with contemporary tips Phylip distance matrix file: kitsch.dat Phylip tree file (optional): Phylip kitsch program output file [kitsch.fkitsch]: Adding species: 1. Bovine 2. Mouse 3. Gibbon 4. Orang 5. Gorilla 6. Chimp 7. Human Doing global rearrangements !-------------! ............. Output written to file "kitsch.fkitsch" Tree also written onto file "kitsch.treefile" Done. |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers: [-datafile] distances File containing one or more distance matrices [-intreefile] tree Phylip tree file (optional) [-outfile] outfile [*.fkitsch] Phylip kitsch program output file Additional (Optional) qualifiers (* if not always prompted): -matrixtype menu [s] Type of data matrix (Values: s (Square); u (Upper triangular); l (Lower triangular)) -minev boolean [N] Minimum evolution * -njumble integer [0] Number of times to randomise (Integer 0 or more) * -seed integer [1] Random number seed between 1 and 32767 (must be odd) (Integer from 1 to 32767) -power float [2.0] Power (Any numeric value) -negallowed boolean [N] Negative branch lengths allowed -replicates boolean [N] Subreplicates -[no]trout toggle [Y] Write out trees to tree file * -outtreefile outfile [*.fkitsch] Phylip tree output file (optional) -printdata boolean [N] Print data at start of run -[no]progress boolean [Y] Print indications of progress of run -[no]treeprint boolean [Y] Print out tree Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-outfile" associated qualifiers -odirectory3 string Output directory "-outtreefile" associated qualifiers -odirectory string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages |
Standard (Mandatory) qualifiers | Allowed values | Default | |||||||
---|---|---|---|---|---|---|---|---|---|
[-datafile] (Parameter 1) |
File containing one or more distance matrices | Distance matrix | |||||||
[-intreefile] (Parameter 2) |
Phylip tree file (optional) | Phylogenetic tree | |||||||
[-outfile] (Parameter 3) |
Phylip kitsch program output file | Output file | <*>.fkitsch | ||||||
Additional (Optional) qualifiers | Allowed values | Default | |||||||
-matrixtype | Type of data matrix |
|
s | ||||||
-minev | Minimum evolution | Boolean value Yes/No | No | ||||||
-njumble | Number of times to randomise | Integer 0 or more | 0 | ||||||
-seed | Random number seed between 1 and 32767 (must be odd) | Integer from 1 to 32767 | 1 | ||||||
-power | Power | Any numeric value | 2.0 | ||||||
-negallowed | Negative branch lengths allowed | Boolean value Yes/No | No | ||||||
-replicates | Subreplicates | Boolean value Yes/No | No | ||||||
-[no]trout | Write out trees to tree file | Toggle value Yes/No | Yes | ||||||
-outtreefile | Phylip tree output file (optional) | Output file | <*>.fkitsch | ||||||
-printdata | Print data at start of run | Boolean value Yes/No | No | ||||||
-[no]progress | Print indications of progress of run | Boolean value Yes/No | Yes | ||||||
-[no]treeprint | Print out tree | Boolean value Yes/No | Yes | ||||||
Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||
(none) |
((D,E),(C,(A,B)));
If a tree with a trifurcation at the base is by mistake fed into the U option of KITSCH then some of its species (the entire rightmost furc, in fact) will be ignored and too small a tree read in. This should result in an error message and the program should stop. It is important to understand the difference between the User Tree formats for KITSCH and FITCH. You may want to use RETREE to convert a user tree that is suitable for FITCH into one suitable for KITSCH or vice versa.
7 Bovine 0.0000 1.6866 1.7198 1.6606 1.5243 1.6043 1.5905 Mouse 1.6866 0.0000 1.5232 1.4841 1.4465 1.4389 1.4629 Gibbon 1.7198 1.5232 0.0000 0.7115 0.5958 0.6179 0.5583 Orang 1.6606 1.4841 0.7115 0.0000 0.4631 0.5061 0.4710 Gorilla 1.5243 1.4465 0.5958 0.4631 0.0000 0.3484 0.3083 Chimp 1.6043 1.4389 0.6179 0.5061 0.3484 0.0000 0.2692 Human 1.5905 1.4629 0.5583 0.4710 0.3083 0.2692 0.0000 |
7 Populations Fitch-Margoliash method with contemporary tips, version 3.68 __ __ 2 \ \ (Obs - Exp) Sum of squares = /_ /_ ------------ 2 i j Obs negative branch lengths not allowed +-------Human +-6 +----5 +-------Chimp ! ! +---4 +---------Gorilla ! ! +------------------------3 +--------------Orang ! ! +----2 +------------------Gibbon ! ! --1 +-------------------------------------------Mouse ! +------------------------------------------------Bovine Sum of squares = 0.107 Average percent standard deviation = 5.16213 From To Length Height ---- -- ------ ------ 6 Human 0.13460 0.81285 5 6 0.02836 0.67825 6 Chimp 0.13460 0.81285 4 5 0.07638 0.64990 5 Gorilla 0.16296 0.81285 3 4 0.06639 0.57352 4 Orang 0.23933 0.81285 2 3 0.42923 0.50713 3 Gibbon 0.30572 0.81285 1 2 0.07790 0.07790 2 Mouse 0.73495 0.81285 1 Bovine 0.81285 0.81285 |
((((((Human:0.13460,Chimp:0.13460):0.02836,Gorilla:0.16296):0.07638, Orang:0.23933):0.06639,Gibbon:0.30572):0.42923,Mouse:0.73495):0.07790, Bovine:0.81285); |
Program name | Description |
---|---|
efitch | Fitch-Margoliash and Least-Squares Distance Methods |
ekitsch | Fitch-Margoliash method with contemporary tips |
eneighbor | Phylogenies from distance matrix by N-J or UPGMA method |
ffitch | Fitch-Margoliash and Least-Squares Distance Methods |
fneighbor | Phylogenies from distance matrix by N-J or UPGMA method |
Although we take every care to ensure that the results of the EMBOSS version are identical to those from the original package, we recommend that you check your inputs give the same results in both versions before publication.
Please report all bugs in the EMBOSS version to the EMBOSS bug team, not to the original author.
Converted (August 2004) to an EMBASSY program by the EMBOSS team.