EMBOSS: C2 Input Formats

Input formats are defined by "Uniform Sequence Addresses" in the form "format::filename"

The following formats are currently supported, reading sequences one at a time:

FASTA with optional accession number and description
NCBI a version of FASTA with NCBI entry/accession convention and description
GCG using the "Length:" value to read multiple files
EMBL
SWISS (SW)
GENBANK
NBRF
IG
CODATA
STRIDER
ACEDB
STADEN currently the same as TEXT but needs extensions for sequence ambiguity codes and optional identifier
TEXT (also called PLAIN or RAW to match readseq)

The aim is to include all formats that readseq can accept, plus some other recent additions such as ACEDB.

The readseq formats not yet implemented are DNAStrider, Fitch, Zuker, Olsen, Phylip3.2, ASN.1, PAUP/NEXUS

The latest version of readseq supports more formats, but is implemented in Java. EMBOSS aims to support all formats that readseq covers.

EMBOSS has a list of known formats, and a list of formats to be tested for an unknown sequence. There is also a default list of formats in variable EMBOSS_FORMAT.
Any specified format must work. A failure should produce an error message and abort the application.
Only if no format is specified should EMBOSS try alternatives until it succeeds.
GCG format must be tested before other formats to test for the ".." line.
For database definitions, a format must always be specified. A missing format is an error.
GCG 9.x has a special first line. This can be used if found, but is not required. It is used to set the sequence type, but the type in the GCG ".." line overrides it.
The length on the GCG ".." line is trusted. If a GCG format sequence is edited it should be reformatted before use. EMBOSS reads until the expected number of bases have been read, then stops.

Staden format is now known as "experiment file format" and is renamed to "Experiment" although "Staden" still exists as an alternative name.
Most formats (including GCG for EMBOSS) allow multiple sequences to be saved in one file. There is a command line option "-[no]ossingle" to control writing to separate files where there is a choice.
Additional multiple sequence input formats should include: Pfam, ProDom, HMMER 2.0 and Geoff Barton's block format.
Files of USAs are useful. These work like VMS lists with an "@filename" syntax, as also used by GCG. You can also say "list::filename" as "list" is treated as a special format. There is a problem with the format on the command line which is being ignored for list file processing.
USAs are useful. The format syntax is reasonable but can be changed if a good alternative is suggested. We are still considering allowing a list of formats with "," delimiters to be used as alternatives. This would also be available for use as a value for "-sformat" and for format control variables.