Sequence Formats

Contents


Sequences

Before reading the rest of this document, please note:
Microsoft WORD format is not a sequence format.

Sequences can be read and written in a variety of formats. These can be very confusing for users, but EMBOSS aims to make life easier by automatically recognising the sequence format on input.

That means that if you are converting from using another sequencing package to EMBOSS and you have your existing sequences in a format that is specific for that package, for example GCG format, you will have no problem reading them in.

If you don't hold your sequence in a recognised standard format, you will not be able to analyse your sequence easily.


What a sequence format is NOT

When we talk about 'sequence format' we are NOT talking about any sort of program-specific format like a word processor format or text formatting language , so we are not talking about things like: 'NOTEPAD', 'WORD', 'WORDPAD', 'PostScript', 'PDF', 'RTF', 'TeX', 'HTML'

If you have somehow managed to type a sequence into a word-processor (!) you should:

Now, repeat after me:
Microsoft WORD format is not a sequence format

EMBOSS programs will not read in anything which is held in Microsoft WORD files.


What a sequence format IS

Sequence formats are ASCII TEXT.

They are the required arrangement of characters, symbols and keywords that specify what things such as the sequence, ID name, comments, etc. look like in the sequence entry and where in the entry the program should look to find them.

There are generally no hidden, unprintable 'control' characters in any sequence format (there are none in those that EMBOSS supports). All standard sequence formats can be printed out or viewed simply by displaying their file.


Why so many formats?

There are at least a couple of dozen sequence formats in existence at the moment. Some are much more common than others.

Formats were designed so as to be able to hold the sequence data and other information about the sequence.

Nearly every sequence analysis package written since programs were first used to read and write sequences has invented its own format. Except for EMBOSS.

Nearly every collection of sequences that dares call itself a database has stored its data in its own format.


Identification

A sequence does not require any sort of identification, but it certainly helps!

Most sequence formats include at least one form of ID name, usually placed somewhere at the top of the sequence format.

The simple format fasta has the ID name as the first word on its title line. For example the ID name 'xyz':

>xyz some other comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

IDs and Accessions

An entry in a database must have some way of being uniquely identified in that database. Most sequence databases have two such identifiers for each sequence - an ID name and an Accession number.

Why are there two such identifiers? The ID name was originally intended to be a human-readable name that had some indication of the function of its sequence. In EMBL and GenBank the first two (or three) letters indicated the species and the rest indicated the function, for example 'hsfau' is the 'Homo Sapiens FAU pseudogene'. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Instead, the Accession numbers were used as the ID name. Therefore you will now find ID names like 'AF061303', the same as the Accession number for that sequence in EMBL.

ID names are not guaranteed to remain the same between different versions of a database (although in practice they usually do).

Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the rest of the life of the database. If two sequences are merged into one, then the new sequence will get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers.

EMBL, GenBank and SwissProt share an Accession numbering scheme - an Accession number uniquely identifies a sequence within these three databases.


Annotation and Features

Most formats allow you to hold other description, annotation and comments, for example fasta format holds comments in the title line:

>xyz some other comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

Other formats have specific fields for holding information such as references, keywords, associated entries in other databases and feature tables


The Sequence

Nucleotide (DNA or RNA) sequences are usually stored in the IUBMB standard codes.

Similarly, protein sequences are usually stored in the IUPAC standard one-letter codes.

For example, fasta format holds the sequence as anything after the '>' line until the next entry starts:

>xyz some other comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

There are exceptions to this code, for example, staden format uses non-standard ambiguity codes.


Sequence Database Formats

Some of the most widespread sequence formats apart from fasta are those used by the major sequence databases.


Sequence Files

Files can hold sequences in standard recognised formats.

Files can also hold sequences in non-standard unrecognisable ways. Do not expect EMBOSS to be able to read your sequences held in a word-processor format file. EMBOSS is not a word-processor!


Multiple sequences

Some sequence formats can hold multiple sequences in one file. The details of how many sequences are held in one file differs between formats, but they either allow many sequences to be concatenated one after the other, or they hold the sequences together in some sort of aligned set of sequences.

Other formats, such as gcg, plain and staden formats can only hold one sequence per file. An attempt to concatenate several sequences in one file leaves the results as a mess that makes it impossible to decide where the sequences start and end or what is annotation and what is sequence.

These single formats therefore cause problems when there are multiple sequences to write out because a single file containing multiple sequences in that format is invalid. When these formats are specified for output, an EMBOSS program will allow you to write many sequences to one file, but EMBOSS programs will not be able to reliably read in the resulting mess.

It you really wish to write multiple sequences out in formats that can not cope with multiple sequences, you are advised to add the global qualifier -ossingle on the command line. This will force the EMBOSS program to ignore the given output file name and will generate its own file names. One sequence will be written to each such file. These file names are made from the sequence ID name, with the name of the format as the extension (e.g. hsfau.gcg).

This is not ideal. Preferably, you should stay away from formats that can't cope with multiple sequences in a file.


Input Sequence Formats

To date, the following sequence formats are accepted as input.

By default, (i.e. if no format is explicitly specified) EMBOSS tries each format in turn until one succeeds.

Input FormatAuto NucProFeatGap MultiDescription
gcg
gcg8
Yes YesYesNoYes No GCG 9.x and 10.x format with the format and sequence type identified on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data.
embl
em
Yes YesNoYesYes No EMBL entry format, including all the fields in the latest release format. The Staden package and others use EMBL or similar formats for sequence data.
swiss
sw
swissprot
Yes NoYesYesYes No SWISSPROT entry format, including all the fields in the latest release format.
nbrf
pir
Yes YesYesYesYes No NBRF (PIR) format, as used in the PIR database sequence files. This format was used for some years as an interchange format with the reference data followed by the sequence data. This unofficial PIR format is what EMBOSS supports. If there is enough interest, we can also use NBRF database format with separate files for sequence (the main EMBOSS input/output) and for features. Documentation of this format is hard to find, but we do have a copy from PIR. The sequence files include the ID and description but no citation or feature information.
pdb Yes NoYesNoNo No PDB protein databank format ATOM lines
pdbseq Yes NoYesNoNo No PDB protein databank format SEQRES lines
pdbnuc No YesNoNoNo No PDB protein databank format nucleotide ATOM lines
pdbnucseq No YesNoNoNo No PDB protein databank format nucleotide SEQRES lines
fasta
ncbi
Yes YesYesNoYes No FASTA format with optional accession number and database name in NCBI style included as part of the sequence identifier. eg
>database|accession|id description
or
>name description
or
>name accession description

gifasta No YesYesNoYes No FASTA format including NCBI-style GIs (alias)
pearson Yes YesYesNoYes No FASTA format with no further processing of the "ID" eg:
>name description
Used where fasta or ncbi format interprets the ID in an unwanted way, this format skips the further ID parsing stage of reading these files.
fastq Yes YesNoNoNo No FASTQ short read format ignoring quality scores
fastq-sanger No YesNoNoNo No FASTQ short read format with phred quality
fastq-illumina No YesNoNoNo No FASTQ Illumina 1.3 short read format
fastq-solexa No YesNoNoNo No FASTQ Solexa/Illumina 1.0 short read format
genbank
gb
ddbj
Yes YesNoYesYes No GENBANK entry format, including the feature table..
refseqp No NoYesYesYes No Refseq protein entry format
genpept No NoYesYesYes No Refseq protein entry format (alias)
codata Yes YesYesYesYes No Codata entry format
strider Yes YesYesNoYes No DNA strider output format
clustal
aln
Yes YesYesNoYes No ClustalW ALN (multiple alignment) format.
phylip Yes YesYesNoYes Yes Phylip interleaved and non-interleaved formats
phylipnon No YesYesNoYes Yes Phylip non-interleaved format
Yes YesYesNoYes No ACEDB sequence format
dbid No YesYesNoYes No FASTA format variant with Database name first, then ID name then an optional accession number eg:
>database name description
or
>database name accession description
msf Yes YesYesNoYes No Wisconsin Package GCG MSF (mutiple sequence file) file format
hennig86 Yes YesYesNoYes No Hennig86 output format
jackknifer Yes YesYesNoYes No Jackknifer interleaved and non-interleaved formats
nexus
paup
Yes YesYesNoYes No Nexus/paup interleaved format
treecon Yes YesYesNoYes No Treecon output format
mega Yes YesYesNoYes No Mega interleaved and non-interleaved formats
igstrict Yes YesYesNoYes No Intelligenetics sequence format strict parser
ig No YesYesNoYes No Intelligenetics sequence format
staden No YesYesNoYes No This format is actually obsolete, the latest version of the Staden package does not support it anymore (see "experiment" format for the new Staden package format). Staden format was a just the sequence in simple text with, optionally, comments at any position in the sequence. When EMBOSS reads in "staden" format, it recognizes a comment at the top of the sequence as the sequence dientifier and removes any comments inside the sequence. Some alternative nucleotide ambiguity codes are used and should be converted.
text
plain
No YesYesNoYes No Plain text. This is the format with no format. The whole of the file is read in as a sequence. No attempt is made to parse the file contents in any way.

Anything is acceptable in this format. This means that any character will be included in the sequence, even digits and punctuation. Use this format only when you are sure that the input sequence file is correct and contains only what you want to be considered as your 'sequence'.

gff2 Yes YesYesYesYes No GFF feature file with sequence in the header Normally used as a pure feature format, but can hold the sequence as part of the structured header.
gff3
gff
Yes YesYesYesYes No GFF3 feature file with sequence
stockholm
pfam
Yes YesYesNoYes No Stockholm (pfam) format
selex No YesYesNoYes No SELEX format is used by Sean Eddy's HMMER package. It can store RNA secondary structure as part of the sequence annotation.
fitch Yes YesYesNoYes No Fitch program format
mase No YesYesNoYes No Mase program format
raw Yes YesYesNoNo No Like text/plain format except that it removes any whitespace or digits, accepts only alphabetic characters and rejects anything else. This means that it is safer to use this format than plain format. If you have digits and spaces or TAB characters, these are removed and ignored. If you have other non-alphabetic characters (for example, punctuation characters), then the sequence will be rejected as erroneous. Gap characters, '-', and translated STOP codon characters '*' are legal.
experiment Yes YesYesNoYes No The Staden package stores single sequencing experiment reads in a format derived from EMBL. All EMBL tags are allowed, plus many extras. Unusually, the extra tags are allowed to continue beyond the '//' line which only marks the end of the sequence. The "EX" experiment line is used to create a sequence description. Accuracy values are stored, or at least the largest value for each sequence position. To date no EMBOSS program is using these values.
abi Yes YesYesNoYes No ABI trace file format. This is the format of file produced by ABI sequencing machines. It contains the 'trace data' i.e. the probabilities of the 4 bases along the sequencing run, together with the sequence, as deduced from that data. The sequence information is what is normally read in and used by EMBOSS programs, although the trace data is available and may be utilised by some specialised EMBOSS programs.
The code for this is heavily based on David Mathog's fortran library with a description of ABI trace file format (abi.txt):
ftp://saf.bio.caltech.edu/pub/software/molbio/abitools.zip
Special FormatDescription
asis This is not so much a sequence format as a quick way of entering a sequence on the command line, but it is included here for completeness. Where a filename would normally be given, in asis format there is the sequence itself. An example would be:
asis::atacgcagttatctgaccat
In 'asis' format the name is the sequence so no file needs to be opened. This is a special case. This syntax can be very useful for generating command lines.

Output Sequence Formats

To date, the following sequence formats are available as output.

Some sequence formats can hold multiple sequences in one file, these are marked as multiple in the following table.

Other formats, such as GCG, plain and staden formats can only hold one sequence per file, these are marked as single.

Output Format SingleSave NucProFeatGap MultiDescription
gcg
gcg8
NoNo YesYesNoYes No Wisconsin Package GCG 9.x and 10.x format with the sequence type on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data.
embl
em
emblnew
NoNo YesNoYesYes No EMBL entry format with available fields filled in and others with no infomation omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.
swiss
sw
swissprot
swissnew
swnew
swissprotnew
NoNo NoYesYesYes No Swissprot entry format with available fields filled in and others with no infomation omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.
fasta
pearson
NoNo YesYesNoYes No Standard Pearson FASTA format, but with the accession number included after the identifier if available.
ncbi NoNo YesYesNoYes No NCBI style FASTA format with the database name, entry name and accession number separated by pipe ("|") characters.
gifasta NoNo YesYesNoYes No NCBI fasta format with NCBI-style IDs using GI number
nbrf
pir
NoNo YesYesYesYes No NBRF/PIR entry format, as used in the PIR database sequence files.
genbank
gb
ddbj
refseq
NoNo YesNoNoYes No GENBANK entry format with available fields filled in and others with no infomation omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.
gff2 NoNo YesYesYesYes No GFF format. Normally used as a pure feature format, but can hold the sequence as part of the structured header.
gff3
gff
NoNo YesYesYesYes No GFF3 feature file with sequence in FASTA format after
ig NoNo YesYesNoYes No Intelligenetics sequence format, as used by the Intelligenetics package
codata NoNo YesYesNoYes No Codata entry format
strider NoNo YesYesNoYes No DNA strider output format
acedb NoNo YesYesNoYes No ACEDB sequence format
experiment NoNo YesYesNoYes No Staden experiment file
staden NoNo YesYesNoYes No Old staden package sequence format. This format is actually obsolete, the latest version of the Staden package does not support it anymore. Staden format is a just the sequence in simple text with, optionally, comments at any position in the sequence. When EMBOSS reads in "staden" format, it recognizes only a comment at the top of the sequence but considers comments inside the sequence as part of the sequence. Some alternative nucleotide ambiguity codes are used and must be converted.
text
plain
raw
NoNo YesYesNoYes No Plain sequence, no annotation or heading.
fitch NoNo YesYesNoYes No Fitch program format
msf NoYes YesYesNoYes No Wisconsin Package GCG MSF (mutiple sequence file) file format
clustal
aln
NoYes YesYesNoYes No Clustalw multiple alignment format
selex NoYes YesYesNoYes No Selex format
phylip NoYes YesYesNoYes Yes Phylip interleaved format
phylipnon
phylip3
NoYes YesYesNoYes No PHYLIP non-interleaved format that was used in Phylip version 3.2. Also called phylip3 for back compatibility with earlier EMBOSS versions.
asn1 NoNo YesYesNoYes No A subset of NCBI ASN.1 containing entry name, accession number, description and sequence, similar to the current ASN.1 output of readseq
hennig86 NoYes YesYesNoYes No Hennig86 output format
mega NoYes YesYesNoYes No Mega interleaved output format
meganon NoYes YesYesNoYes No Mega non-interleaved output format
nexus
paup
NoYes YesYesNoYes No Nexus/paup interleaved format
nexusnon
paupnon
NoYes YesYesNoYes No Nexus/paup non-interleaved format
jackknifer NoYes YesYesNoYes No Jackknifer output interleaved format
jackknifernon NoYes YesYesNoYes No Jackknifer output non-interleaved format
treecon NoYes YesYesNoYes No Treecon output format
mase NoNo YesYesNoYes No Mase program format
dasdna NoNo YesNoNoYes No DASDNA DAS nucleotide-only sequence
das NoNo YesYesNoYes No DASSEQUENCE DAS any sequence
fastq-sanger
fastq
NoNo YesNoNoNo No FASTQ short read format with phred quality
fastq-illumina NoNo YesNoNoNo No FASTQ Illumina 1.3 short read format
fastq-solexa NoNo YesNoNoNo No FASTQ Solexa/Illumina 1.0 short read format
debug NoNo YesYesNoYes No EMBOSS sequence object report for debugging showing all available fields. Not all fields will contain data - this depends very much on the input format used.

Creating a sequence

When typing a sequence in to a sequence editor, such as mse, the sequence editor should save the sequence to a file in a recognised format.

If you are creating a sequence by typing it into a text editor, then the best format is probably fasta format. Simply start the entry with a title line. This title line starts with a > character followed by the ID name of the sequence then any other comments. Subsequent lines contain the sequence. Many sequence entries can follow each other in a single file.

If you are truely masochistic, you will have typed your sequence into a word-processor. Don't do it again! If you click on the 'File' button and then on 'Save As..' you should be able to save your sequence as 'Text'. If you are lucky, you now have a sequence in 'plain' format.


Changing the format

To convert the sequences in the file 'myfile.seq' into the format 'embl' in the new file 'myfile2.seq', run either:

seqret myfile.seq embl::myfile2.seq
or
seqret myfile.seq myfile2.seq -osf embl
('-osf' is an abbreviation for '-osformat')

These two commands are exactly equivalent.

Input sequence command-line qualifiers

There are other command-line qualifiers that change the behaviour of the sequence input.

  -sbegin	integer		first base used
  -send		integer		last base used, default=seq length
  -sreverse	boolean		reverse (if DNA)
  -sask		boolean		ask for begin/end/reverse
  -snucleotide	boolean		sequence is nucleotide
  -sprotein	boolean		sequence is protein
  -slower	boolean		make lower case
  -supper	boolean		make upper case
  -sformat	string		input sequence format
  -sopenfile	string		input filename
  -sdbname	string		database name
  -sid		string		entryname
  -ufo		string		UFO features
  -fformat	string		features format
  -fopenfile	string		features file name

Output sequence command-line qualifiers

There are other command-line qualifiers that change the behaviour of the sequence output.

  -osformat           string     output sequence file format
  -osextension        string     file name extension
  -osname             string     base file name
  -osdirectory        bool       output sequence file directory
  -osdbname           string     database name to add
  -ossingle           bool       create a separate output file for each entry
  -oufo               string     feature file to create
  -offormat           string     features format
  -ofname             string     features file name
  -ofdirectory        string     features output directory

Future directions

More formats, both for input and for output, can be easily added, so suggestions are always welcome.