EMBOSS: Sequence Features


We recently added suport for feature tables to EMBOSS.

This reads a feature table with a sequence, and can write the results to a feature table, or write a sequence with a feature table.

The first format supported is "General feature Format" or GFF which is used at the Sanger Centre and other institutes to exchange the results of gene finding and other programs.

GFF format is a tab delimited file where each line has:

For example:

seq1     BLASTX  similarity   101  235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003
dJ102G20 GD_mRNA coding_exon 7105 7201   .  - 2 Sequence "dJ102G20.C1.1"

The "Sequence" tag is used to group a set of start/end positions, as for "join" in the EMBL feature table.

In EMBOSS features are supported by the Uniform Feature Object or UFO, which looks like a sequence USA. By default, feature reading uses a file called "seqname.gff" for input and output.

The GFF maintainers have agreed that GFF can be used for protein features by ignoring the "strand" and "frame" fields.

GFF has the advantage that the format includes the sequence ID so we can link the feature table back to the sequence and have many sequences in one GFF file or many GFF files for one sequence (for example the results of several programs to be merged).

EMBOSS reads and writes feature tables with sequences automatically. Adding "feature: Y" to the input and output of "seqret" in the ACD file creates an application that can automatically read and write any feature table format, with no changes to the 10 lines of source code.

Issues:

  1. Is the UFO syntax suitable?
  2. How should upper and lower case sequence names be handled in setting file names?
  3. What other feature formats would be most useful?
  4. How easy is it to inter-convert EMBL and GFF feature types?
  5. How can we mark up the tag value fields for EMBOSS output?