ffactor

 

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Multistate to binary recoding program

Description

Takes discrete multistate data with character state trees and produces the corresponding data set with two states (0 and 1). Written by Christopher Meacham. This program was formerly used to accomodate multistate characters in MIX, but this is less necessary now that PARS is available.

Algorithm

This program factors a data set that contains multistate characters, creating a data set consisting entirely of binary (0,1) characters that, in turn, can be used as input to any of the other discrete character programs in this package, except for PARS. Besides this primary function, FACTOR also provides an easy way of deleting characters from a data set. The input format for FACTOR is very similar to the input format for the other discrete character programs except for the addition of character-state tree descriptions.

Note that this program has no way of converting an unordered multistate character into binary characters. Fortunately, PARS has joined the package, and it enables unordered multistate characters, in which any state can change to any other in one step, to be analyzed with parsimony.

FACTOR is really for a different case, that in which there are multiple states related on a "character state tree", which specifies for each state which other states it can change to. That graph of states is assumed to be a tree, with no loops in it.

Usage

Here is a sample session with ffactor


% ffactor 
Multistate to binary recoding program
Phylip factor program input file: factor.dat
Phylip factor program output file [factor.ffactor]: 


Data matrix written on file "factor.ffactor"

Done.


Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-infile]            infile     Phylip factor program input file
  [-outfile]           outfile    [*.ffactor] Phylip factor program output
                                  file

   Additional (Optional) qualifiers:
   -anc                boolean    [N] Put ancestral states in output file
   -factors            boolean    [N] Put factors information in output file
   -[no]progress       boolean    [Y] Print indications of progress of run

   Advanced (Unprompted) qualifiers:
   -outfactorfile      outfile    [*.ffactor] Phylip factor data output file
                                  (optional)
   -outancfile         outfile    [*.ffactor] Phylip ancestor data output file
                                  (optional)

   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   "-outfactorfile" associated qualifiers
   -odirectory         string     Output directory

   "-outancfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-infile]
(Parameter 1)
Phylip factor program input file Input file Required
[-outfile]
(Parameter 2)
Phylip factor program output file Output file <*>.ffactor
Additional (Optional) qualifiers Allowed values Default
-anc Put ancestral states in output file Boolean value Yes/No No
-factors Put factors information in output file Boolean value Yes/No No
-[no]progress Print indications of progress of run Boolean value Yes/No Yes
Advanced (Unprompted) qualifiers Allowed values Default
-outfactorfile Phylip factor data output file (optional) Output file <*>.ffactor
-outancfile Phylip ancestor data output file (optional) Output file <*>.ffactor

Input file format

ffactor reads character state tree data.

This program factors a data set that contains multistate characters, creating a data set consisting entirely of binary (0,1) characters that, in turn, can be used as input to any of the other discrete character programs in this package, except for PARS. Besides this primary function, FACTOR also provides an easy way of deleting characters from a data set. The input format for FACTOR is very similar to the input format for the other discrete character programs except for the addition of character-state tree descriptions.

Note that this program has no way of converting an unordered multistate character into binary characters. Fortunately, PARS has joined the package, and it enables unordered multistate characters, in which any state can change to any other in one step, to be analyzed with parsimony.

FACTOR is really for a different case, that in which there are multiple states related on a "character state tree", which specifies for each state which other states it can change to. That graph of states is assumed to be a tree, with no loops in it.

First line

The first line of the input file should contain the number of species and the number of multistate characters. This first line is followed by the lines describing the character-state trees, one description per line. The species information constitutes the last part of the file. Any number of lines may be used for a single species.

The first line is free format with the number of species first, separated by at least one blank (space) from the number of multistate characters, which in turn is separated by at least one blank from the options, if present.

Character-state tree descriptions

The character-state trees are described in free format. The character number of the multistate character is given first followed by the description of the tree itself. Each description must be completed on a single line. Each character that is to be factored must have a description, and the characters must be described in the order that they occur in the input, that is, in numerical order.

The tree is described by listing the pairs of character states that are adjacent to each other in the character-state tree. The two character states in each adjacent pair are separated by a colon (":"). If character fifteen has this character state tree for possible states "A", "B", "C", and "D":

                         A ---- B ---- C
                                |
                                |
                                |
                                D

then the character-state tree description would be

                        15  A:B B:C D:B

Note that either symbol may appear first. The ancestral state is identified, if desired, by putting it "adjacent" to a period. If we wanted to root character fifteen at state C:

                         A <--- B <--- C
                                |
                                |
                                V
                                D

we could write

                      15  B:D A:B C:B .:C

Both the order in which the pairs are listed and the order of the symbols in each pair are arbitrary. However, each pair may only appear once in the list. Any symbols may be used for a character state in the input except the character that signals the connection between two states (in the distribution copy this is set to ":"), ".", and, of course, a blank. Blanks are ignored completely in the tree description so that even B:DA:BC:B.:C or B : DA : BC : B. : C would be equivalent to the above example. However, at least one blank must separate the character number from the tree description.

Deleting characters from a data set

If no description line appears in the input for a particular character, then that character will be omitted from the output. If the character number is given on the line, but no character-state tree is provided, then the symbol for the character in the input will be copied directly to the output without change. This is useful for characters that are already coded "0" and "1". Characters can be deleted from a data set simply by listing only those that are to appear in the output.

Terminating the list of tree descriptions

The last character-state tree description should be followed by a line containing the number "999". This terminates processing of the trees and indicates the beginning of the species information.

Species information

The format for the species information is basically identical to the other discrete character programs. The first ten character positions are allotted to the species name (this value may be changed by altering the value of the constant nmlngth at the beginning of the program). The character states follow and may be continued to as many lines as desired. There is no current method for indicating polymorphisms. It is possible to either put blanks between characters or not.

There is a method for indicating uncertainty about states. There is one character value that stands for "unknown". If this appears in the input data then "?" is written out in all the corresponding positions in the output file. The character value that designates "unknown" is given in the constant unkchar at the beginning of the program, and can be changed by changing that constant. It is set to "?" in the distribution copy.

Input files for usage example

File: factor.dat

   4   6
1 A:B B:C
2 A:B B:.
4
5 0:1 1:2 .:0
6 .:# #:$ #:%
999
Alpha     CAW00#
Beta      BBX01%
Gamma     ABY12#
Epsilon   CAZ01$

Output file format

The first line of ffactor output will contain the number of species and the number of binary characters in the factored data set followed by the letter "A" if the A option was specified in the input. If option F was specified, the next line will begin "FACTORS". If option A was specified, the line describing the ancestor will follow next. Finally, the factored characters will be written for each species in the format required for input by the other discrete programs in the package. The maximum length of the output lines is 80 characters, but this maximum length can be changed prior to compilation.

In fact, the format of the output file for the A and F options is not correct for the current release of PHYLIP. We need to change their output to write a factors file and an ancestors file instead of putting the Factors and Ancestors information into the data file.

Output files for usage example

File: factor.ffactor

    4    5
Alpha     CA00#
Beta      BB01%
Gamma     AB12#
Epsilon   CA01$

File: factor.factor


File: factor.ancestor


Data files

None

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

The output should be checked for error messages. Errors will occur in the character-state tree descriptions if the format is incorrect (colons in the wrong place, etc.), if more than one root is specified, if the tree contains loops (and hence is not a tree), and if the tree is not connected, e.g.


                             A:B B:C D:E

describes

                  A ---- B ---- C          D ---- E

This "tree" is in two unconnected pieces. An error will also occur if a symbol appears in the data set that is not in the tree description for that character. Blanks at the end of lines when the species information is continued to a new line will cause this kind of error.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program name Description
eclique Largest clique program
edollop Dollo and polymorphism parsimony algorithm
edolpenny Penny algorithm Dollo or polymorphism
efactor Multistate to binary recoding program
emix Mixed parsimony algorithm
epenny Penny algorithm, branch-and-bound
fclique Largest clique program
fdollop Dollo and polymorphism parsimony algorithm
fdolpenny Penny algorithm Dollo or polymorphism
fmix Mixed parsimony algorithm
fmove Interactive mixed method parsimony
fpars Discrete character parsimony
fpenny Penny algorithm, branch-and-bound

Author(s)

This program is an EMBOSS conversion of a program written by Joe Felsenstein as part of his PHYLIP package.

Although we take every care to ensure that the results of the EMBOSS version are identical to those from the original package, we recommend that you check your inputs give the same results in both versions before publication.

Please report all bugs in the EMBOSS version to the EMBOSS bug team, not to the original author.

History

Written (2004) - Joe Felsenstein, University of Washington.

Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.