EMBOSS: fcontml

fcontml

Function

Gene frequency and continuous character Maximum Likelihood

Description

Estimates phylogenies from gene frequency data by maximum likelihood under a model in which all divergence is due to genetic drift in the absence of new mutations. Does not assume a molecular clock. An alternative method of analyzing this data is to compute Nei's genetic distance and use one of the distance matrix programs. This program can also do maximum likelihood analysis of continuous characters that evolve by a Brownian Motion model, but it assumes that the characters evolve at equal rates and in an uncorrelated fashion, so that it does not take into account the usual correlations of characters.

Algorithm

This program estimates phylogenies by the restricted maximum likelihood method based on the Brownian motion model. It is based on the model of Edwards and Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967). Gomberg (1966), Felsenstein (1973b, 1981c) and Thompson (1975) have done extensive further work leading to efficient algorithms. CONTML uses restricted maximum likelihood estimation (REML), which is the criterion used by Felsenstein (1973b). The actual algorithm is an iterative EM Algorithm (Dempster, Laird, and Rubin, 1977) which is guaranteed to always give increasing likelihoods. The algorithm is described in detail in a paper of mine (Felsenstein, 1981c), which you should definitely consult if you are going to use this program. Some simulation tests of it are given by Rohlf and Wooten (1988) and Kim and Burgman (1988).

The default (gene frequency) mode treats the input as gene frequencies at a series of loci, and square-root-transforms the allele frequencies (constructing the frequency of the missing allele at each locus first). This enables us to use the Brownian motion model on the resulting coordinates, in an approximation equivalent to using Cavalli-Sforza and Edwards's (1967) chord measure of genetic distance and taking that to give distance between particles undergoing pure Brownian motion. It assumes that each locus evolves independently by pure genetic drift.

The alternative continuous characters mode (menu option C) treats the input as a series of coordinates of each species in N dimensions. It assumes that we have transformed the characters to remove correlations and to standardize their variances.

A word about microsatellite data

Many current users of CONTML use it to analyze microsatellite data. There are three ways to do this:

Coding each copy number as an allele, and feeding in the frequencies of these alleles. As CONTML's gene frequency mode assumes that all change is by genetic drift, this means that no copy number arises by mutation during the divergence of the populations. Since microsatellite loci have very high mutation rates, this is questionable.
Use some other program, one not in the PHYLIP package, to compute distances among the populations. Some of the programs that can do this are RSTCalc, poptrfdos, Microsat, and Populations. Links to them can be found at my Phylogeny Programs web site at http://evolution.gs.washington.edu/phylip/software.html.
Those distance measures allow for mutation during the divergence of the populations. But even they are not perfect -- they do not allow us to use all the information contained in the gene frequency differences of within a copy number allele. There is a need for a more complete statistical treatment of inference of phylogenies from microsatellite models, ones that take both mutation and genetic drift fully into account.
Alternatively, there is the Brownian motion approximation to mean population copy number. This is described in my book (Felsenstein, 2004, Chapter 15, pp. 242-245), and it is implicit also in the microsatellite distances. Each locus is coded as a single continuous character, the mean of the copy number in at that microsatellite locus in that species. Thus if the species (or population) has frequencies 0.10, 0.24, 0.60, and 0.06 of alleles that have 18, 19, 20, and 21 copies, it is coded as having
```
0.10 X 18 + 0.24 X 19 + 0.60 X 20 + 0.06 X 21   =  19.62 
```
copies. These values can, I believe, be calculated by a spreadsheet program. Each microsatellite is represented by one character, and the continuous character mode of CONTML is used (not the gene frequencies mode). This coding allows for mutation that changes copy number. It does not make complete use of all data, but neither does the treatment of microsatellite gene frequencies as changing only genetic drift. frequency

Usage

Here is a sample session with fcontml

% fcontml -printdata Gene frequency and continuous character Maximum Likelihood Input file: contml.dat Phylip tree file (optional): Phylip contml program output file [contml.fcontml]: Adding species: 1. European 2. African 3. Chinese 4. American 5. Australian Output written to file "contml.fcontml" Tree also written onto file "contml.treefile" Done.

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-infile]            frequencies File containing one or more sets of data
  [-intreefile]        tree       Phylip tree file (optional)
  [-outfile]           outfile    [*.fcontml] Phylip contml program output
                                  file

   Additional (Optional) qualifiers (* if not always prompted):
   -datatype           menu       [g] Input type in infile (Values: g (Gene
                                  frequencies); i (Continuous characters))
*  -lengths            boolean    [N] Use branch lengths from user trees
*  -njumble            integer    [0] Number of times to randomise (Integer 0
                                  or more)
*  -seed               integer    [1] Random number seed between 1 and 32767
                                  (must be odd) (Integer from 1 to 32767)
*  -global             boolean    [N] Global rearrangements
   -outgrno            integer    [0] Species number to use as outgroup
                                  (Integer 0 or more)
   -[no]trout          toggle     [Y] Write out trees to tree file
*  -outtreefile        outfile    [*.fcontml] Phylip tree output file
                                  (optional)
   -printdata          boolean    [N] Print data at start of run
   -[no]progress       boolean    [Y] Print indications of progress of run
   -[no]treeprint      boolean    [Y] Print out tree

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   "-outtreefile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

acdDelPhyloFreq '0'

Standard (Mandatory) qualifiers Allowed values Default

[-infile]
(Parameter 1) File containing one or more sets of data Frequency value(s)

[-intreefile]
(Parameter 2) Phylip tree file (optional) Phylogenetic tree

[-outfile]
(Parameter 3) Phylip contml program output file Output file <*>.fcontml

Additional (Optional) qualifiers Allowed values Default

-datatype Input type in infile
g (Gene frequencies)
i (Continuous characters)
g

-lengths Use branch lengths from user trees Boolean value Yes/No No

-njumble Number of times to randomise Integer 0 or more 0

-seed Random number seed between 1 and 32767 (must be odd) Integer from 1 to 32767 1

-global Global rearrangements Boolean value Yes/No No

-outgrno Species number to use as outgroup Integer 0 or more 0

-[no]trout Write out trees to tree file Toggle value Yes/No Yes

-outtreefile Phylip tree output file (optional) Output file <*>.fcontml

-printdata Print data at start of run Boolean value Yes/No No

-[no]progress Print indications of progress of run Boolean value Yes/No Yes

-[no]treeprint Print out tree Boolean value Yes/No Yes

Advanced (Unprompted) qualifiers Allowed values Default

(none)

acdDelPhyloFreq '0'

Input file format

fcontml reads continuous character data.

Continuous character data

The programs in this group use gene frequencies and quantitative character values. One (CONTML) constructs maximum likelihood estimates of the phylogeny, another (GENDIST) computes genetic distances for use in the distance matrix programs, and the third (CONTRAST) examines correlation of traits as they evolve along a given phylogeny.

When the gene frequencies data are used in CONTML or GENDIST, this involves the following assumptions:

Different lineages evolve independently.
After two lineages split, their characters change independently.
Each gene frequency changes by genetic drift, with or without mutation (this varies from method to method).
Different loci or characters drift independently.

How these assumptions affect the methods will be seen in my papers on inference of phylogenies from gene frequency and continuous character data (Felsenstein, 1973b, 1981c, 1985c).

The input formats are fairly similar to the discrete-character programs, but with one difference. When CONTML is used in the gene-frequency mode (its usual, default mode), or when GENDIST is used, the first line contains the number of species (or populations) and the number of loci and the options information. There then follows a line which gives the numbers of alleles at each locus, in order. This must be the full number of alleles, not the number of alleles which will be input: i. e. for a two-allele locus the number should be 2, not 1. There then follow the species (population) data, each species beginning on a new line. The first 10 characters are taken as the name, and thereafter the values of the individual characters are read free-format, preceded and separated by blanks. They can go to a new line if desired, though of course not in the middle of a number. Missing data is not allowed - an important limitation. In the default configuration, for each locus, the numbers should be the frequencies of all but one allele. The menu option A (All) signals that the frequencies of all alleles are provided in the input data -- the program will then automatically ignore the last of them. So without the A option, for a three-allele locus there should be two numbers, the frequencies of two of the alleles (and of course it must always be the same two!). Here is a typical data set without the A option:

     5    3
2 3 2
Alpha      0.90 0.80 0.10 0.56
Beta       0.72 0.54 0.30 0.20
Gamma      0.38 0.10 0.05  0.98
Delta      0.42 0.40 0.43 0.97
Epsilon    0.10 0.30 0.70 0.62

whereas here is what it would have to look like if the A option were invoked:

     5    3
2 3 2
Alpha      0.90 0.10 0.80 0.10 0.10 0.56 0.44
Beta       0.72 0.28 0.54 0.30 0.16 0.20 0.80
Gamma      0.38 0.62 0.10 0.05 0.85  0.98 0.02
Delta      0.42 0.58 0.40 0.43 0.17 0.97 0.03
Epsilon    0.10 0.90 0.30 0.70 0.00 0.62 0.38

The first line has the number of species (or populations) and the number of loci. The second line has the number of alleles for each of the 3 loci. The species lines have names (filled out to 10 characters with blanks) followed by the gene frequencies of the 2 alleles for the first locus, the 3 alleles for the second locus, and the 2 alleles for the third locus. You can start a new line after any of these allele frequencies, and continue to give the frequencies on that line (without repeating the species name).

If all alleles of a locus are given, it is important to have them add up to 1. Roundoff of the frequencies may cause the program to conclude that the numbers do not sum to 1, and stop with an error message.

While many compilers may be more tolerant, it is probably wise to make sure that each number, including the first, is preceded by a blank, and that there are digits both preceding and following any decimal points.

CONTML and CONTRAST also treat quantitative characters (the continuous-characters mode in CONTML, which is option C). It is assumed that each character is evolving according to a Brownian motion model, at the same rate, and independently. In reality it is almost always impossible to guarantee this. The issue is discussed at length in my review article in Annual Review of Ecology and Systematics (Felsenstein, 1988a), where I point out the difficulty of transforming the characters so that they are not only genetically independent but have independent selection acting on them. If you are going to use CONTML to model evolution of continuous characters, then you should at least make some attempt to remove genetic correlations between the characters (usually all one can do is remove phenotypic correlations by transforming the characters so that there is no within-population covariance and so that the within-population variances of the characters are equal -- this is equivalent to using Canonical Variates). However, this will only guarantee that one has removed phenotypic covariances between characters. Genetic covariances could only be removed by knowing the coheritabilities of the characters, which would require genetic experiments, and selective covariances (covariances due to covariation of selection pressures) would require knowledge of the sources and extent of selection pressure in all variables.

CONTRAST is a program designed to infer, for a given phylogeny that is provided to the program, the covariation between characters in a data set. Thus we have a program in this set that allow us to take information about the covariation and rates of evolution of characters and make an estimate of the phylogeny (CONTML), and a program that takes an estimate of the phylogeny and infers the variances and covariances of the character changes. But we have no program that infers both the phylogenies and the character covariation from the same data set.

In the quantitative characters mode, a typical small data set would be:

     5   6
Alpha      0.345 0.467 1.213  2.2  -1.2 1.0
Beta       0.457 0.444 1.1    1.987 -0.2 2.678
Gamma      0.6 0.12 0.97 2.3  -0.11 1.54
Delta      0.68  0.203 0.888 2.0  1.67
Epsilon    0.297  0.22 0.90 1.9 1.74

Note that in the latter case, there is no line giving the numbers of alleles at each locus. In this latter case no square-root transformation of the coordinates is done: each is assumed to give directly the position on the Brownian motion scale.

For further discussion of options and modifiable constants in CONTML, GENDIST, and CONTRAST see the documentation files for those programs.

Input files for usage example

File: contml.dat

    5    10
2 2 2 2 2 2 2 2 2 2
European   0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205
0.8055 0.5043
African    0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600
0.7582 0.6207
Chinese    0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726
0.7482 0.7334
American   0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000
0.8086 0.8636
Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396
0.9097 0.2976

Output file format

fcontml output has a standard appearance. The topology of the tree is given by an unrooted tree diagram. The lengths (in time or in expected amounts of variance) are given in a table below the topology, and a rough confidence interval given for each length. Negative lower bounds on length indicate that rearrangements may be acceptable.

The units of length are amounts of expected accumulated variance (not time). The log likelihood (natural log) of each tree is also given, and it is indicated how many topologies have been tried. The tree does not necessarily have all tips contemporary, and the log likelihood may be either positive or negative (this simply corresponds to whether the density function does or does not exceed 1) and a negative log likelihood does not indicate any error. The log likelihood allows various formal likelihood ratio hypothesis tests. The description of the tree includes approximate standard errors on the lengths of segments of the tree. These are calculated by considering only the curvature of the likelihood surface as the length of the segment is varied, holding all other lengths constant. As such they are most probably underestimates of the variance, and hence may give too much confidence in the given tree.

One should use caution in interpreting the likelihoods that are printed out. If the model is wrong, it will not be possible to use the likelihoods to make formal statistical statements. Thus, if gene frequencies are being analyzed, but the gene frequencies change not only by genetic drift, but also by mutation, the model is not correct. It would be as well-justified in this case to use GENDIST to compute the Nei (1972) genetic distance and then use FITCH, KITSCH or NEIGHBOR to make a tree. If continuous characters are being analyzed, but if the characters have not been transformed to new coordinates that evolve independently and at equal rates, then the model is also violated and no statistical analysis is possible. Doing such a transformation is not easy, and usually not even possible.

If the U (User Tree) option is used and more than one tree is supplied, the program also performs a statistical test of each of these trees against the one with highest likelihood. If there are two user trees, the test done is one which is due to Kishino and Hasegawa (1989), a version of a test originally introduced by Templeton (1983). In this implementation it uses the mean and variance of log-likelihood differences between trees, taken across loci. If the two trees means are more than 1.96 standard deviations different then the trees are declared significantly different. This use of the empirical variance of log-likelihood differences is more robust and nonparametric than the classical likelihood ratio test, and may to some extent compensate for the any lack of realism in the model underlying this program.

If there are more than two trees, the test done is an extension of the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that a correction for the number of trees was necessary, and they introduced a resampling method to make this correction. The version used here is a multivariate normal approximation to their test; it is due to Shimodaira (1998). The variances and covariances of the sum of log likelihoods across loci are computed for all pairs of trees. To test whether the difference between each tree and the best one is larger than could have been expected if they all had the same expected log-likelihood, log-likelihoods for all trees are sampled with these covariances and equal means (Shimodaira and Hasegawa's "least favorable hypothesis"), and a P value is computed from the fraction of times the difference between the tree's value and the highest log-likelihood exceeds that actually observed. Note that this sampling needs random numbers, and so the program will prompt the user for a random number seed if one has not already been supplied. With the two-tree KHT test no random numbers are used.

In either the KHT or the SH test the program prints out a table of the log-likelihoods of each tree, the differences of each from the highest one, the variance of that quantity as determined by the log-likelihood differences at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one.

One problem which sometimes arises is that the program is fed two species (or populations) with identical transformed gene frequencies: this can happen if sample sizes are small and/or many loci are monomorphic. In this case the program "gets its knickers in a twist" and can divide by zero, usually causing a crash. If you suspect that this has happened, check for two species with identical coordinates. If you find them, eliminate one from the problem: the two must always show up as being at the same point on the tree anyway.

Output files for usage example

File: contml.fcontml


Continuous character Maximum Likelihood method version 3.67


   5 Populations,   10 Loci

Numbers of alleles at the loci:
------- -- ------- -- --- -----

   2   2   2   2   2   2   2   2   2   2

Name                 Gene Frequencies
----                 ---- -----------

  locus:         1         1         2         2         3         3
                 4         4         5         5         6         6
                 7         7         8         8         9         9
                10        10

European     0.28680   0.71320   0.56840   0.43160   0.44220   0.55780
             0.42860   0.57140   0.38280   0.61720   0.72850   0.27150
             0.63860   0.36140   0.02050   0.97950   0.80550   0.19450
             0.50430   0.49570
African      0.13560   0.86440   0.48400   0.51600   0.06020   0.93980
             0.03970   0.96030   0.59770   0.40230   0.96750   0.03250
             0.95110   0.04890   0.06000   0.94000   0.75820   0.24180
             0.62070   0.37930
Chinese      0.16280   0.83720   0.59580   0.40420   0.72980   0.27020
             1.00000   0.00000   0.38110   0.61890   0.79860   0.20140
             0.77820   0.22180   0.07260   0.92740   0.74820   0.25180
             0.73340   0.26660
American     0.01440   0.98560   0.69900   0.30100   0.32800   0.67200
             0.74210   0.25790   0.66060   0.33940   0.86030   0.13970
             0.79240   0.20760   0.00000   1.00000   0.80860   0.19140
             0.86360   0.13640
Australian   0.12110   0.87890   0.22740   0.77260   0.58210   0.41790
             1.00000   0.00000   0.20180   0.79820   0.90000   0.10000
             0.98370   0.01630   0.03960   0.96040   0.90970   0.09030
             0.29760   0.70240


  +-----------------------------------------------------------African   
  !  
  !             +-------------------------------Australian
  1-------------3  
  !             !     +-----------------------American  
  !             +-----2  
  !                   +Chinese   
  !  
  +European  


remember: this is an unrooted tree!

Ln Likelihood =    38.71914

Between     And             Length      Approx. Confidence Limits
-------     ---             ------      ------- ---------- ------
  1       African        0.09693444   (  0.03123910,  0.19853604)
  1          3           0.02252816   (  0.00089799,  0.05598045)
  3       Australian     0.05247405   (  0.01177094,  0.11542374)
  3          2           0.00945315   ( -0.00897717,  0.03795670)
  2       American       0.03806240   (  0.01095938,  0.07997877)
  2       Chinese        0.00208822   ( -0.00960622,  0.02017433)
  1       European       0.00000000   ( -0.01627246,  0.02516630)

File: contml.treefile

(African:0.09693444,(Australian:0.05247405,(American:0.03806240,Chinese:0.00208822):0.00945315):0.02252816,
European:0.00000000);

Data files

None

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

Program name	Description
egendist	Genetic Distance Matrix program
fgendist	Compute genetic distances from gene frequencies

Author(s)

This program is an EMBOSS conversion of a program written by Joe Felsenstein as part of his PHYLIP package.

Although we take every care to ensure that the results of the EMBOSS version are identical to those from the original package, we recommend that you check your inputs give the same results in both versions before publication.

Please report all bugs in the EMBOSS version to the EMBOSS bug team, not to the original author.

History

Written (2004) - Joe Felsenstein, University of Washington.

Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Function

Description

Algorithm

A word about microsatellite data

Usage

Command line arguments

Input file format

Continuous character data

Input files for usage example

File: contml.dat

Output file format

Output files for usage example

File: contml.fcontml

File: contml.treefile

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

See also

Author(s)

History

Target users