|
EMBOSS: C2 Output Formats
|
Any format can be used for output, but some formats simply write a
sequence and have no problem if the output file is closed at the end
of the program. Others, e.g. MSF, store multiple sequences and write
them only when the file is closed by a FileClose call.
This means that all applications should use the ajSeqWriceClose call
to make sure sequence output is flushed.
Simple output formats:
- GCG. EMBOSS is using code from readseq for the checksum. GCG have
said that the algorithm is in the public domain.
- FASTA (also called PEARSON) with accession number and description
- NCBI (FASTA with NCBI formatted id and accession)
- PIR
- EMBL
- SWISS
- GENBANK
- ASN.1 (readseq's shortened version - perhaps not available as an
input format)
- FITCH
- IG
- NBRF (also called PIR)
- CODATA
- EXPERIMENT also known as STADEN
- TEXT
- STRIDER
- DEBUG (was TRACE) a sequence object detailed report mainly for
debugging purposes.
Multiple output formats:
- MSF
- CLUSTAL or ALN
- PHYLIP
The aim is to include all formats that EMBOSS accepts for input
plus some formats intended for output only.
Issues:
- Some formats require data which may be unavailable. For example,
TEXT input from standard input has no ID, but EMBL output requires an
ID. This could default or could be provided on the command line
as "-sentry" for use on output.
- The default "id:" can be set to EMBOSS or EMBOSS_00n for output
formats that need one.
- The default accession number can be any deleted or never used
accession number, to be agreed with the EBI. Obvious candidates are
M12345 and X00000.
It is not clear whether any format absolutely requires an accession
number.
- The "id"can be specified in an output USA. There is currently no
way to specify an accession number this way, unless we try NCBI
syntax of "id|accnum" but this seems more trouble than it is
worth.
Other points
- "debug" format is a report of the internal sequence object structure.
It is used for debugging and is not intended as a new sequence format.
- The output can be sent to a file or to stdout.
- Output specifications should use USA syntax to specify the
format and filename. Output specs now allow
"format::file" We tried to allow "format::file:id" but this has the
problem that "format:file" is valid, with "format" as the filename
and "file" as the entry ID. This is a very easy mistake. The "id"
is therefore not allowed on output, and the USA can be just
"format::file" (recommended) or "format:file" (to rescue careless
users)