EMBOSS: Sequence Features

We recently added suport for feature tables to EMBOSS.

This reads a feature table with a sequence, and can write the results to a feature table, or write a sequence with a feature table.

The first format supported is "General feature Format" or GFF which is used at the Sanger Centre and other institutes to exchange the results of gene finding and other programs.

GFF format is a tab delimited file where each line has:

seqname: The name of the sequence.
source: The source of this feature. This field will normally be used to indicate the program making the prediction
feature: The feature type name.
start: Integer. Sequence numbering starts at 1.
end: Integer. Start must be less than or equal to end
score: A floating point value. When there is no score you should use '.'
strand: One of '+', '-' or '.' where '.' should be used when strand is not relevant
frame: One of '0', '1', '2' or '.' if the frame is not relevant then use '.'
group: An optional string-valued field. Must have an tag value structure following the syntax used within objects in a .ace file, flattened onto one line by semicolon separators.

For example:

seq1     BLASTX  similarity   101  235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003
dJ102G20 GD_mRNA coding_exon 7105 7201   .  - 2 Sequence "dJ102G20.C1.1"

The "Sequence" tag is used to group a set of start/end positions, as for "join" in the EMBL feature table.

In EMBOSS features are supported by the Uniform Feature Object or UFO, which looks like a sequence USA. By default, feature reading uses a file called "seqname.gff" for input and output.

The GFF maintainers have agreed that GFF can be used for protein features by ignoring the "strand" and "frame" fields.

GFF has the advantage that the format includes the sequence ID so we can link the feature table back to the sequence and have many sequences in one GFF file or many GFF files for one sequence (for example the results of several programs to be merged).

EMBOSS reads and writes feature tables with sequences automatically. Adding "feature: Y" to the input and output of "seqret" in the ACD file creates an application that can automatically read and write any feature table format, with no changes to the 10 lines of source code.

Issues:

Is the UFO syntax suitable?
How should upper and lower case sequence names be handled in setting file names?
What other feature formats would be most useful?
- EMBL/Genbank feature table
- SwissProt feature table
How easy is it to inter-convert EMBL and GFF feature types?
How can we mark up the tag value fields for EMBOSS output?