Sequence Formats

Contents


Sequences

Before reading the rest of this document, please note:
Microsoft WORD format is not a sequence format.

Sequences can be read and written in a variety of formats. These can be very confusing for users, but EMBOSS aims to make life easier by automatically recognising the sequence format on input.

That means that if you are converting from using another sequencing package to EMBOSS and you have your existing sequences in a format that is specific for that package, for example GCG format, you will have no problem reading them in.

If you don't hold your sequence in a recognised standard format, you will not be able to analyse your sequence easily.


What a sequence format is NOT

When we talk about 'sequence format' we are NOT talking about any sort of program-specific format like a word processor format or text formatting language , so we are not talking about things like: 'NOTEPAD', 'WORD', 'WORDPAD', 'PostScript', 'PDF', 'RTF', 'TeX', 'HTML'

If you have somehow managed to type a sequence into a word-processor (!) you should:

Now, repeat after me:
Microsoft WORD format is not a sequence format

EMBOSS programs will not read in anything which is held in Microsoft WORD files.


What a sequence format IS

Sequence formats are ASCII TEXT.

They are the required arrangement of characters, symbols and keywords that specify what things such as the sequence, ID name, comments, etc. look like in the sequence entry and where in the entry the program should look to find them.

There are generally no hidden, unprintable 'control' characters in any sequence format (there are none in those that EMBOSS supports). All standard sequence formats can be printed out or viewed simply by displaying their file.


Why so many formats?

There are at least a couple of dozen sequence formats in existence at the moment. Some are much more common than others.

Formats were designed so as to be able to hold the sequence data and other information about the sequence.

Nearly every sequence analysis package written since programs were first used to read and write sequences has invented its own format. Except for EMBOSS.

Nearly every collection of sequences that dares call itself a database has stored its data in its own format.


Identification

A sequence does not require any sort of identification, but it certainly helps!

Most sequence formats include at least one form of ID name, usually placed somewhere at the top of the sequence format.

The simple format fasta has the ID name as the first word on its title line. For example the ID name 'xyz':

>xyz some other comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

IDs and Accessions

An entry in a database must have some way of being uniquely identified in that database. Most sequence databases have two such identifiers for each sequence - an ID name and an Accession number.

Why are there two such identifiers? The ID name was originally intended to be a human-readable name that had some indication of the function of its sequence. In EMBL and GenBank the first two (or three) letters indicated the species and the rest indicated the function, for example 'hsfau' is the 'Homo Sapiens FAU pseudogene'. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Instead, the Accession numbers were used as the ID name. Therefore you will now find ID names like 'AF061303', the same as the Accession number for that sequence in EMBL.

ID names are not guaranteed to remain the same between different versions of a database (although in practice they usually do).

Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the rest of the life of the database. If two sequences are merged into one, then the new sequence will get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers.

EMBL, GenBank and SwissProt share an Accession numbering scheme - an Accession number uniquely identifies a sequence within these three databases.


Annotation and Features

Most formats allow you to hold other description, annotation and comments, for example fasta format holds comments in the title line:

>xyz some other comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

Other formats have specific fields for holding information such as references, keywords, associated entries in other databases and feature tables


The Sequence

Nucleotide (DNA or RNA) sequences are usually stored in the IUBMB standard codes.

Similarly, protein sequences are usually stored in the IUPAC standard one-letter codes.

For example, fasta format holds the sequence as anything after the '>' line until the next entry starts:

>xyz some other comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

There are exceptions to this code, for example, staden format uses non-standard ambigiuty codes.


Sequence Database Formats

Some of the most widespread sequence formats apart from fasta are those used by the major sequence databases.


Sequence Files

Files can hold sequences in standard recognised formats.

Files can also hold sequences in non-standard unrecognisable ways. Do not expect EMBOSS to be able to read your sequences held in a word-processor format file. EMBOSS is not a word-processor!


Multiple sequences

Some sequence formats can hold multiple sequences in one file. The details of how many sequences are held in one file differs between formats, but they either allow many sequences to be concatenated one after the other, or they hold the sequences together in some sort of aligned set of sequences.

Other formats, such as gcg, plain and staden formats can only hold one sequence per file. An attempt to concatenate several sequences in one file leaves the results as a mess that makes it impossible to decide where the sequences start and end or what is annotation and what is sequence.

These single formats therefore cause problems when there are multiple sequences to write out because a single file containing multiple sequences in that format is invalid. When these formats are specified for output, an EMBOSS program will allow you to write many sequences to one file, but EMBOSS programs will not be able to reliably read in the resulting mess.

It you really wish to write multiple sequences out in formats that can not cope with multiple sequences, you are advised to add the global qualifier -ossingle on the command line. This will force the EMBOSS program to ignore the given output file name and will generate its own file names. One sequence will be written to each such file. These file names are made from the sequence ID name, with the name of the format as the extension (e.g. hsfau.gcg).

This is not ideal. Preferably, you should stay away from formats that can't cope with multiple sequences in a file.


Input Sequence Formats

To date, the following sequence formats are accepted as input.

By default, (i.e. if no format is explicitly specified) EMBOSS tries each format in turn until one succeeds.

Input Format Comments
abi ABI trace file format. This is the format of file produced by ABI sequencing machines. It contains the 'trace data' i.e. the probabilities of the 4 bases along the sequencing run, together with the sequence, as deduced from that data. The sequence information is what is normally read in and used by EMBOSS programs, although the trace data is available and may be utilised by some specialised EMBOSS programs.
The code for this is heavily based on David Mathog's fortran library with a description of ABI trace file format (abi.txt):
ftp://saf.bio.caltech.edu/pub/software/molbio/abitools.zip
acedb ACeDB format
clustal
aln
ClustalW ALN (multiple alignment) format.
codata CODATA format.
dbid Odd FASTA format with Database name first, then ID name then an optional accession number eg:
>database name description
or
>database name accession description
embl
em
EMBL entry format, or at least a minimal subset of the fields. The Staden package and others use EMBL or similar formats for sequence data.
experiment The Staden package stores single sequencing experiment reads in a format derived from EMBL. All EMBL tags are allowed, plus many extras. Unusually, the extra tags are allowed to continue beyond the '//' line which only marks the end of the sequence. The "EX" experiment line is used to create a sequence description. Accuracy values are stored, or at least the largest value for each sequence position. To date no EMBOSS program is using these values.
fasta
ncbi
FASTA format with optional accession number and database name in NCBI style included as part of the sequence identifier. eg
>database|accession|id description
or
>name description
or
>name accession description

(and other variants on this theme!)
gcg
gcg8
GCG 9.x and 10.x format with the format and sequence type identified on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data.
genbank
gb
ddbj
GENBANK entry format, including the feature table..
gff GFF format. Normally used as a pure feature format, but can hold the sequence as part of the structured header.
hennig86 Hennig86 format
ig IntelliGenetics format.
jackknifer Jackknifer format
jackknifernon Jackknifernon format
mega Mega format
meganon Meganon format
msf Wisconsin Package GCG's MSF multiple sequence format.
nbrf
pir
NBRF (PIR) format, as used in the PIR database sequence files. This format was used for some years as an interchange format with the reference data followed by the sequence data. This unofficial PIR format is what EMBOSS supports. If there is enough interest, we can also use NBRF database format with separate files for sequence (the main EMBOSS input/output) and for features. Documentation of this format is hard to find, but we do have a copy from PIR. The sequence files include the ID and description but no citation or feature information.
nexus
paup
Nexus/PAUP format
nexusnon
paupnon
Nexusnon/PAUPnon format
pearson FASTA format with no further processing of the "ID" eg:
>name description
Used where fasta or ncbi format interprets the ID in an unwanted way, this format skips the further ID parsing stage of reading these files.
pfam
stockholm
Pfam format
phylip
phylipnon
PHYLIP interleaved multiple alignment format.
raw Like text/plain format except that it removes any whitespace or digits, accepts only alphabetic characters and rejects anything else. This means that it is safer to use this format than plain format. If you have digits and spaces or TAB characters, these are removed and ignored. If you have other non-alphabetic characters (for example, punctuation characters), then the sequence will be rejected as erroneous. Gap characters, '-', and translated STOP codon characters '*' are legal.
selex SELEX format is used by Sean Eddy's HMMER package. It can store RNA secondary structure as part of the sequence annotation.
staden This format is actually obsolete, the latest version of the Staden package does not support it anymore (see "experiment" format for the new Staden package format). Staden format was a just the sequence in simple text with, optionally, comments at any position in the sequence. When EMBOSS reads in "staden" format, it recognizes a comment at the top of the sequence as the sequence dientifier and removes any comments inside the sequence. Some alternative nucleotide ambiguity codes are used and should be converted.
strider DNA Strider format
swissprot
swiss
sw
SWISSPROT entry format, or at least a minimal subset of the fields.
text
plain
Plain text. This is the format with no format. The whole of the file is read in as a sequence. No attempt is made to parse the file contents in any way.

Anything is acceptable in this format. This means that any character will be included in the sequence, even digits and punctuation. Use this format only when you are sure that the input sequence file is correct and contains only what you want to be considered as your 'sequence'.

treecon Treecon format
asis This is not so much a sequence format as a quick way of entering a sequence on the command line, but it is included here for completeness. Where a filename would normally be given, in asis format there is the sequence itself. An example would be:
asis::atacgcagttatctgaccat
In 'asis' format the name is the sequence so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines.

Output Sequence Formats

To date, the following sequence formats are available as output.

Some sequence formats can hold multiple sequences in one file, these are marked as multiple in the following table.

Other formats, such as GCG, plain and staden formats can only hold one sequence per file, these are marked as single.

Output FormatSingle/
Multiple
Comments
acedb multiple ACeDB format
asn1 multiple A subset of ASN.1 containing entry name, accession number, description and sequence, similar to the current ASN.1 output of readseq
clustal
aln
multiple Clustal multiple sequence format.
codata multiple CODATA format.
debug multiple EMBOSS sequence object report for debugging showing all available fields. Not all fields will contain data - this depends very much on the input format used.
embl
em
multiple EMBL entry format with available fields filled in and others with no infomation omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.
fasta
pearson
multiple Standard Pearson FASTA format, but with the accession number included after the identifier if available.
fitch multiple Fitch format
gcg
gcg8
single Wisconsin Package GCG 9.x and 10.x format with the sequence type on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data.
genbank
gb
multiple GENBANK entry format with available fields filled in and others with no infomation omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.
gff multiple GFF format. Normally used as a pure feature format, but can hold the sequence as part of the structured header.
hennig86 multiple Hennig86 format
ig multiple Intelligenetics format, as used by the Intelligenetics package
jackknifer multiple Jackknifer format
jackknifernon multiple Jackknifernon format
mega multiple Mega format
meganon multiple Meganon format
msf multiple Wisconsin Package GCG's MSF multiple sequence format.
nbrf
pir
multiple NBRF (PIR) format, as used in the PIR database sequence files.
ncbi multiple NCBI style FASTA format with the database name, entry name and accession number separated by pipe ("|") characters.
nexus
paup
multiple Nexus/PAUP format
nexusnon
paupnon
multiple Nexusnon/PAUPnon format
phylip multiple PHYLIP interleaved format.
phylipnon multiple PHYLIP non-interleaved format that was used in Phylip version 3.2. Also called phylip3 for back compatibility with earlier EMBOSS versions.
selex multiple SELEX format.
staden single This format is actually obsolete, the latest version of the Staden package does not support it anymore. Staden format is a just the sequence in simple text with, optionally, comments at any position in the sequence. When EMBOSS reads in "staden" format, it recognizes only a comment at the top of the sequence but considers comments inside the sequence as part of the sequence. Some alternative nucleotide ambiguity codes are used and must be converted.
strider multiple DNA strider format
swiss
sw
multiple SwisProt entry format with available fields filled in and others with no infomation omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.
text
plain
raw
single Plain sequence, no annotation or heading.
treecon multiple Treecon format

Creating a sequence

When typing a sequence in to a sequence editor, such as mse, the sequence editor should save the sequence to a file in a recognised format.

If you are creating a sequence by typing it into a text editor, then the best format is probably fasta format. Simply start the entry with a title line. This title line starts with a > character followed by the ID name of the sequence then any other comments. Subsequent lines contain the sequence. Many sequence entries can follow each other in a single file.

If you are truely masochistic, you will have typed your sequence into a word-processor. Don't do it again! If you click on the 'File' button and then on 'Save As..' you should be able to save your sequence as 'Text'. If you are lucky, you now have a sequence in 'plain' format.


Changing the format

To convert the sequences in the file 'myfile.seq' into the format 'embl' in the new file 'myfile2.seq', run either:

seqret myfile.seq embl::myfile2.seq
or
seqret myfile.seq myfile2.seq -osf embl
('-osf' is an abbreviation for '-osformat')

These two commands are exactly equivalent.

Input sequence command-line qualifiers

There are other command-line qualifiers that change the behaviour of the sequence input.

  -sbegin	integer		first base used
  -send		integer		last base used, default=seq length
  -sreverse	boolean		reverse (if DNA)
  -sask		boolean		ask for begin/end/reverse
  -snucleotide	boolean		sequence is nucleotide
  -sprotein	boolean		sequence is protein
  -slower	boolean		make lower case
  -supper	boolean		make upper case
  -sformat	string		input sequence format
  -sopenfile	string		input filename
  -sdbname	string		database name
  -sid		string		entryname
  -ufo		string		UFO features
  -fformat	string		features format
  -fopenfile	string		features file name

Output sequence command-line qualifiers

There are other command-line qualifiers that change the behaviour of the sequence output.

  -osformat           string     output sequence file format
  -osextension        string     file name extension
  -osname             string     base file name
  -osdirectory        bool       output sequence file directory
  -osdbname           string     database name to add
  -ossingle           bool       create a separate output file for each entry
  -oufo               string     feature file to create
  -offormat           string     features format
  -ofname             string     features file name
  -ofdirectory        string     features output directory

Future directions

More formats, both for input and for output, can be easily added, so suggestions are always welcome.