EMBOSS: C2 Input Formats


Input formats are defined by "Uniform Sequence Addresses" in the form "format::filename"

Simple formats:

The following formats are currently supported, reading sequences one at a time:

Multiple sequence formats:

The aim is to include all formats that readseq can accept, plus some other recent additions such as ACEDB.

The readseq formats not yet implemented are DNAStrider, Fitch, Zuker, Olsen, Phylip3.2, ASN.1, PAUP/NEXUS

The latest version of readseq supports more formats, but is implemented in Java. EMBOSS aims to support all formats that readseq covers.

Issues:

  1. EMBOSS has a list of known formats, and a list of formats to be tested for an unknown sequence. There is also a default list of formats in variable EMBOSS_FORMAT.
  2. Any specified format must work. A failure should produce an error message and abort the application.
  3. Only if no format is specified should EMBOSS try alternatives until it succeeds.
  4. GCG format must be tested before other formats to test for the ".." line.
  5. For database definitions, a format must always be specified. A missing format is an error.
  6. GCG 9.x has a special first line. This can be used if found, but is not required. It is used to set the sequence type, but the type in the GCG ".." line overrides it.
  7. The length on the GCG ".." line is trusted. If a GCG format sequence is edited it should be reformatted before use. EMBOSS reads until the expected number of bases have been read, then stops.

Other points