|
EMBOSS: C2 Input Formats
|
Input formats are defined by "Uniform Sequence Addresses"
in the form "format::filename"
Simple formats:
The following formats are currently supported, reading sequences
one at a time:
- FASTA with optional accession number and description
- NCBI a version of FASTA with NCBI entry/accession convention and
description
- GCG using the "Length:" value to read multiple files
- EMBL
- SWISS (SW)
- GENBANK
- NBRF
- IG
- CODATA
- STRIDER
- ACEDB
- STADEN currently the same as TEXT but needs extensions for
sequence ambiguity codes and optional identifier
- TEXT (also called PLAIN or RAW to match readseq)
Multiple sequence formats:
- MSF
- CLUSTAL (ALN files)
- PHYLIP
The aim is to include all formats that readseq can accept, plus
some other recent additions such as ACEDB.
The readseq formats not yet
implemented are DNAStrider, Fitch, Zuker, Olsen, Phylip3.2,
ASN.1, PAUP/NEXUS
The latest version of readseq supports more formats, but is
implemented in Java. EMBOSS aims to support all formats that readseq
covers.
Issues:
- EMBOSS has a list of known formats, and a list of formats to be
tested for an unknown sequence. There is also a default list of formats
in variable EMBOSS_FORMAT.
- Any specified format must work. A failure
should produce an error message and abort the application.
- Only if no format is specified should EMBOSS try alternatives
until it succeeds.
- GCG format must be tested before other formats to test for the ".."
line.
- For database definitions, a format must always be specified.
A missing format is an error.
- GCG 9.x has a special first line. This can be used if found, but
is not required. It is used to set
the sequence type, but the type in the GCG ".." line overrides
it.
- The length on the GCG ".." line is trusted. If a GCG format
sequence is edited it should be reformatted before use. EMBOSS reads
until the expected number of bases have been read, then stops.
Other points
- Staden format is now known as "experiment file format" and is
renamed to "Experiment" although "Staden" still exists as an
alternative name.
- Most formats (including GCG for EMBOSS) allow multiple sequences
to be saved in one file. There is a command line option "-[no]ossingle"
to control writing to separate files where there is a choice.
- Additional multiple sequence input formats should include: Pfam,
ProDom, HMMER 2.0 and Geoff Barton's block format.
- Files of USAs are useful. These work like VMS lists with an
"@filename" syntax, as also used by GCG. You can also say
"list::filename" as "list" is treated as a special format. There is
a problem with the format on the command line which is being ignored
for list file processing.
- USAs are useful. The format syntax is reasonable but can be
changed if a good alternative is suggested.
We are still considering allowing a list of formats
with "," delimiters to be used as alternatives. This would also be
available for use as a value for "-sformat" and for format
control variables.