Before reading the rest of this document, please note:
Microsoft WORD format is not a sequence format.
Sequences can be read and written in a variety of formats. These can be very confusing for users, but EMBOSS aims to make life easier by automatically recognising the sequence format on input.
That means that if you are converting from using another sequencing package to EMBOSS and you have your existing sequences in a format that is specific for that package, for example GCG format, you will have no problem reading them in.
If you don't hold your sequence in a recognised standard format, you will not be able to analyse your sequence easily.
When we talk about 'sequence format' we are NOT talking about any sort of program-specific format like a word processor format or text formatting language , so we are not talking about things like: 'NOTEPAD', 'WORD', 'WORDPAD', 'PostScript', 'PDF', 'RTF', 'TeX', 'HTML'
If you have somehow managed to type a sequence into a word-processor (!) you should:
Now, repeat after me:
Microsoft WORD format is not a sequence format
EMBOSS programs will not read in anything which is held in Microsoft WORD files.
Sequence formats are ASCII TEXT.
They are the required arrangement of characters, symbols and keywords that specify what things such as the sequence, ID name, comments, etc. look like in the sequence entry and where in the entry the program should look to find them.
There are generally no hidden, unprintable 'control' characters in any sequence format (there are none in those that EMBOSS supports). All standard sequence formats can be printed out or viewed simply by displaying their file.
There are at least a couple of dozen sequence formats in existence at the moment. Some are much more common than others.
Formats were designed so as to be able to hold the sequence data and other information about the sequence.
Nearly every sequence analysis package written since programs were first used to read and write sequences has invented its own format. Except for EMBOSS.
Nearly every collection of sequences that dares call itself a database has stored its data in its own format.
A sequence does not require any sort of identification, but it certainly helps!
Most sequence formats include at least one form of ID name, usually placed somewhere at the top of the sequence format.
The simple format fasta has the ID name as the first word on its title line. For example the ID name 'xyz':
>xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcatt
An entry in a database must have some way of being uniquely identified in that database. Most sequence databases have two such identifiers for each sequence - an ID name and an Accession number.
Why are there two such identifiers? The ID name was originally intended to be a human-readable name that had some indication of the function of its sequence. In EMBL and GenBank the first two (or three) letters indicated the species and the rest indicated the function, for example 'hsfau' is the 'Homo Sapiens FAU pseudogene'. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Instead, the Accession numbers were used as the ID name. Therefore you will now find ID names like 'AF061303', the same as the Accession number for that sequence in EMBL.
ID names are not guaranteed to remain the same between different versions of a database (although in practice they usually do).
Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the rest of the life of the database. If two sequences are merged into one, then the new sequence will get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers.
EMBL, GenBank and SwissProt share an Accession numbering scheme - an Accession number uniquely identifies a sequence within these three databases.
Most formats allow you to hold other description, annotation and comments, for example fasta format holds comments in the title line:
>xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcatt
Other formats have specific fields for holding information such as references, keywords, associated entries in other databases and feature tables
Nucleotide (DNA or RNA) sequences are usually stored in the IUBMB standard codes.
Similarly, protein sequences are usually stored in the IUPAC standard one-letter codes.
For example, fasta format holds the sequence as anything after the '>' line until the next entry starts:
>xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcatt
There are exceptions to this code, for example, staden format uses non-standard ambiguity codes.
Some of the most widespread sequence formats apart from fasta are those used by the major sequence databases.
Files can hold sequences in standard recognised formats.
Files can also hold sequences in non-standard unrecognisable ways. Do not expect EMBOSS to be able to read your sequences held in a word-processor format file. EMBOSS is not a word-processor!
Some sequence formats can hold multiple sequences in one file. The details of how many sequences are held in one file differs between formats, but they either allow many sequences to be concatenated one after the other, or they hold the sequences together in some sort of aligned set of sequences.
Other formats, such as gcg, plain and staden formats can only hold one sequence per file. An attempt to concatenate several sequences in one file leaves the results as a mess that makes it impossible to decide where the sequences start and end or what is annotation and what is sequence.
These single formats therefore cause problems when there are multiple sequences to write out because a single file containing multiple sequences in that format is invalid. When these formats are specified for output, an EMBOSS program will allow you to write many sequences to one file, but EMBOSS programs will not be able to reliably read in the resulting mess.
It you really wish to write multiple sequences out in formats that can not cope with multiple sequences, you are advised to add the global qualifier -ossingle on the command line. This will force the EMBOSS program to ignore the given output file name and will generate its own file names. One sequence will be written to each such file. These file names are made from the sequence ID name, with the name of the format as the extension (e.g. hsfau.gcg).
This is not ideal. Preferably, you should stay away from formats that can't cope with multiple sequences in a file.
To date, the following sequence formats are accepted as input.
By default, (i.e. if no format is explicitly specified) EMBOSS tries each format in turn until one succeeds.
Input Format | Auto | Nuc | Pro | Feat | Gap | Multi | Description |
---|---|---|---|---|---|---|---|
gcg gcg8 | Yes | Yes | Yes | No | Yes | No | GCG 9.x and 10.x format with the format and sequence type identified on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data. |
embl em | Yes | Yes | No | Yes | Yes | No | EMBL entry format, including all the fields in the latest release format. The Staden package and others use EMBL or similar formats for sequence data. |
swiss sw swissprot | Yes | No | Yes | Yes | Yes | No | SWISSPROT entry format, including all the fields in the latest release format. |
nbrf pir | Yes | Yes | Yes | Yes | Yes | No | NBRF (PIR) format, as used in the PIR database sequence files. This format was used for some years as an interchange format with the reference data followed by the sequence data. This unofficial PIR format is what EMBOSS supports. If there is enough interest, we can also use NBRF database format with separate files for sequence (the main EMBOSS input/output) and for features. Documentation of this format is hard to find, but we do have a copy from PIR. The sequence files include the ID and description but no citation or feature information. |
pdb | Yes | No | Yes | No | No | No | PDB protein databank format ATOM lines |
pdbseq | Yes | No | Yes | No | No | No | PDB protein databank format SEQRES lines |
pdbnuc | No | Yes | No | No | No | No | PDB protein databank format nucleotide ATOM lines |
pdbnucseq | No | Yes | No | No | No | No | PDB protein databank format nucleotide SEQRES lines |
fasta ncbi | Yes | Yes | Yes | No | Yes | No |
FASTA format with optional accession number and database name in NCBI
style included as part of the sequence identifier.
eg
>database|accession|id description or >name description or >name accession description |
gifasta | No | Yes | Yes | No | Yes | No | FASTA format including NCBI-style GIs (alias) |
pearson | Yes | Yes | Yes | No | Yes | No |
FASTA format with no further processing of the "ID" eg:
>name description Used where fasta or ncbi format interprets the ID in an unwanted way, this format skips the further ID parsing stage of reading these files. |
fastq | Yes | Yes | No | No | No | No | FASTQ short read format ignoring quality scores |
fastq-sanger | No | Yes | No | No | No | No | FASTQ short read format with phred quality |
fastq-illumina | No | Yes | No | No | No | No | FASTQ Illumina 1.3 short read format |
fastq-solexa | No | Yes | No | No | No | No | FASTQ Solexa/Illumina 1.0 short read format |
genbank gb ddbj | Yes | Yes | No | Yes | Yes | No | GENBANK entry format, including the feature table.. |
refseqp | No | No | Yes | Yes | Yes | No | Refseq protein entry format |
genpept | No | No | Yes | Yes | Yes | No | Refseq protein entry format (alias) |
codata | Yes | Yes | Yes | Yes | Yes | No | Codata entry format |
strider | Yes | Yes | Yes | No | Yes | No | DNA strider output format |
clustal aln | Yes | Yes | Yes | No | Yes | No | ClustalW ALN (multiple alignment) format. |
phylip | Yes | Yes | Yes | No | Yes | Yes | Phylip interleaved and non-interleaved formats |
phylipnon | No | Yes | Yes | No | Yes | Yes | Phylip non-interleaved format |
Yes | Yes | Yes | No | Yes | No | ACEDB sequence format | |
dbid | No | Yes | Yes | No | Yes | No |
FASTA format variant with Database name first, then ID name then an optional
accession number eg:
>database name description or >database name accession description |
msf | Yes | Yes | Yes | No | Yes | No | Wisconsin Package GCG MSF (mutiple sequence file) file format |
hennig86 | Yes | Yes | Yes | No | Yes | No | Hennig86 output format |
jackknifer | Yes | Yes | Yes | No | Yes | No | Jackknifer interleaved and non-interleaved formats |
nexus paup | Yes | Yes | Yes | No | Yes | No | Nexus/paup interleaved format |
treecon | Yes | Yes | Yes | No | Yes | No | Treecon output format |
mega | Yes | Yes | Yes | No | Yes | No | Mega interleaved and non-interleaved formats |
igstrict | Yes | Yes | Yes | No | Yes | No | Intelligenetics sequence format strict parser |
ig | No | Yes | Yes | No | Yes | No | Intelligenetics sequence format |
staden | No | Yes | Yes | No | Yes | No |
This format is actually obsolete, the latest version of the Staden package does not
support it anymore (see "experiment" format for the new Staden package
format). Staden format was a just the sequence in simple text
with, optionally, comments |
text plain | No | Yes | Yes | No | Yes | No |
Plain text.
This is the format with no format. The whole of the file is
read in as a sequence.
No attempt is made to parse the file contents in any way.
Anything is acceptable in this format. This means that any character will be included in the sequence, even digits and punctuation. Use this format only when you are sure that the input sequence file is correct and contains only what you want to be considered as your 'sequence'. |
gff2 | Yes | Yes | Yes | Yes | Yes | No | GFF feature file with sequence in the header Normally used as a pure feature format, but can hold the sequence as part of the structured header. |
gff3 gff | Yes | Yes | Yes | Yes | Yes | No | GFF3 feature file with sequence |
stockholm pfam | Yes | Yes | Yes | No | Yes | No | Stockholm (pfam) format |
selex | No | Yes | Yes | No | Yes | No | SELEX format is used by Sean Eddy's HMMER package. It can store RNA secondary structure as part of the sequence annotation. |
fitch | Yes | Yes | Yes | No | Yes | No | Fitch program format |
mase | No | Yes | Yes | No | Yes | No | Mase program format |
raw | Yes | Yes | Yes | No | No | No | Like text/plain format except that it removes any whitespace or digits, accepts only alphabetic characters and rejects anything else. This means that it is safer to use this format than plain format. If you have digits and spaces or TAB characters, these are removed and ignored. If you have other non-alphabetic characters (for example, punctuation characters), then the sequence will be rejected as erroneous. Gap characters, '-', and translated STOP codon characters '*' are legal. |
experiment | Yes | Yes | Yes | No | Yes | No | The Staden package stores single sequencing experiment reads in a format derived from EMBL. All EMBL tags are allowed, plus many extras. Unusually, the extra tags are allowed to continue beyond the '//' line which only marks the end of the sequence. The "EX" experiment line is used to create a sequence description. Accuracy values are stored, or at least the largest value for each sequence position. To date no EMBOSS program is using these values. |
abi | Yes | Yes | Yes | No | Yes | No |
ABI trace file format. This is the format of file produced by ABI
sequencing machines. It contains the 'trace data' i.e. the
probabilities of the 4 bases along the sequencing run, together with the
sequence, as deduced from that data. The sequence information is what
is normally read in and used by EMBOSS programs, although the trace data
is available and may be utilised by some specialised EMBOSS programs.
The code for this is heavily based on David Mathog's fortran library with a description of ABI trace file format (abi.txt): ftp://saf.bio.caltech.edu/pub/software/molbio/abitools.zip |
Special Format | Description | asis |
This is not so much a sequence format as a quick way of entering a
sequence on the command line, but it is included here for completeness.
Where a filename would normally be given, in asis format there is
the sequence itself.
An example would be:
asis::atacgcagttatctgaccat In 'asis' format the name is the sequence so no file needs to be opened. This is a special case. This syntax can be very useful for generating command lines. |
---|
To date, the following sequence formats are available as output.
Some sequence formats can hold multiple sequences in one file, these are marked as multiple in the following table.
Other formats, such as GCG, plain and staden formats can only hold one sequence per file, these are marked as single.
Output Format | Single | Save | Nuc | Pro | Feat | Gap | Multi | Description |
---|---|---|---|---|---|---|---|---|
gcg gcg8 | No | No | Yes | Yes | No | Yes | No | Wisconsin Package GCG 9.x and 10.x format with the sequence type on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data. |
embl em emblnew | No | No | Yes | No | Yes | Yes | No | EMBL entry format with available fields filled in and others with no infomation omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence. |
swiss sw swissprot swissnew swnew swissprotnew | No | No | No | Yes | Yes | Yes | No | Swissprot entry format with available fields filled in and others with no infomation omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence. |
fasta pearson | No | No | Yes | Yes | No | Yes | No | Standard Pearson FASTA format, but with the accession number included after the identifier if available. |
ncbi | No | No | Yes | Yes | No | Yes | No | NCBI style FASTA format with the database name, entry name and accession number separated by pipe ("|") characters. |
gifasta | No | No | Yes | Yes | No | Yes | No | NCBI fasta format with NCBI-style IDs using GI number |
nbrf pir | No | No | Yes | Yes | Yes | Yes | No | NBRF/PIR entry format, as used in the PIR database sequence files. |
genbank gb ddbj refseq | No | No | Yes | No | No | Yes | No | GENBANK entry format with available fields filled in and others with no infomation omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence. |
gff2 | No | No | Yes | Yes | Yes | Yes | No | GFF format. Normally used as a pure feature format, but can hold the sequence as part of the structured header. |
gff3 gff | No | No | Yes | Yes | Yes | Yes | No | GFF3 feature file with sequence in FASTA format after |
ig | No | No | Yes | Yes | No | Yes | No | Intelligenetics sequence format, as used by the Intelligenetics package |
codata | No | No | Yes | Yes | No | Yes | No | Codata entry format |
strider | No | No | Yes | Yes | No | Yes | No | DNA strider output format |
acedb | No | No | Yes | Yes | No | Yes | No | ACEDB sequence format |
experiment | No | No | Yes | Yes | No | Yes | No | Staden experiment file |
staden | No | No | Yes | Yes | No | Yes | No |
Old staden package sequence format.
This format is actually obsolete, the latest version of the Staden package does not
support it anymore. Staden format is a just the sequence in simple text
with, optionally, comments |
text plain raw | No | No | Yes | Yes | No | Yes | No | Plain sequence, no annotation or heading. |
fitch | No | No | Yes | Yes | No | Yes | No | Fitch program format |
msf | No | Yes | Yes | Yes | No | Yes | No | Wisconsin Package GCG MSF (mutiple sequence file) file format |
clustal aln | No | Yes | Yes | Yes | No | Yes | No | Clustalw multiple alignment format |
selex | No | Yes | Yes | Yes | No | Yes | No | Selex format |
phylip | No | Yes | Yes | Yes | No | Yes | Yes | Phylip interleaved format |
phylipnon phylip3 | No | Yes | Yes | Yes | No | Yes | No | PHYLIP non-interleaved format that was used in Phylip version 3.2. Also called phylip3 for back compatibility with earlier EMBOSS versions. |
asn1 | No | No | Yes | Yes | No | Yes | No | A subset of NCBI ASN.1 containing entry name, accession number, description and sequence, similar to the current ASN.1 output of readseq |
hennig86 | No | Yes | Yes | Yes | No | Yes | No | Hennig86 output format |
mega | No | Yes | Yes | Yes | No | Yes | No | Mega interleaved output format |
meganon | No | Yes | Yes | Yes | No | Yes | No | Mega non-interleaved output format |
nexus paup | No | Yes | Yes | Yes | No | Yes | No | Nexus/paup interleaved format |
nexusnon paupnon | No | Yes | Yes | Yes | No | Yes | No | Nexus/paup non-interleaved format |
jackknifer | No | Yes | Yes | Yes | No | Yes | No | Jackknifer output interleaved format |
jackknifernon | No | Yes | Yes | Yes | No | Yes | No | Jackknifer output non-interleaved format |
treecon | No | Yes | Yes | Yes | No | Yes | No | Treecon output format |
mase | No | No | Yes | Yes | No | Yes | No | Mase program format |
dasdna | No | No | Yes | No | No | Yes | No | DASDNA DAS nucleotide-only sequence |
das | No | No | Yes | Yes | No | Yes | No | DASSEQUENCE DAS any sequence |
fastq-sanger fastq | No | No | Yes | No | No | No | No | FASTQ short read format with phred quality |
fastq-illumina | No | No | Yes | No | No | No | No | FASTQ Illumina 1.3 short read format |
fastq-solexa | No | No | Yes | No | No | No | No | FASTQ Solexa/Illumina 1.0 short read format |
debug | No | No | Yes | Yes | No | Yes | No | EMBOSS sequence object report for debugging showing all available fields. Not all fields will contain data - this depends very much on the input format used. |
When typing a sequence in to a sequence editor, such as mse, the sequence editor should save the sequence to a file in a recognised format.
If you are creating a sequence by typing it into a text editor, then the best format is probably fasta format. Simply start the entry with a title line. This title line starts with a > character followed by the ID name of the sequence then any other comments. Subsequent lines contain the sequence. Many sequence entries can follow each other in a single file.
If you are truely masochistic, you will have typed your sequence into a word-processor. Don't do it again! If you click on the 'File' button and then on 'Save As..' you should be able to save your sequence as 'Text'. If you are lucky, you now have a sequence in 'plain' format.
To convert the sequences in the file 'myfile.seq' into the format 'embl' in the new file 'myfile2.seq', run either:
seqret myfile.seq embl::myfile2.seq
or
seqret myfile.seq myfile2.seq -osf embl
('-osf' is an abbreviation for '-osformat')
These two commands are exactly equivalent.
There are other command-line qualifiers that change the behaviour of the sequence input.
-sbegin integer first base used -send integer last base used, default=seq length -sreverse boolean reverse (if DNA) -sask boolean ask for begin/end/reverse -snucleotide boolean sequence is nucleotide -sprotein boolean sequence is protein -slower boolean make lower case -supper boolean make upper case -sformat string input sequence format -sopenfile string input filename -sdbname string database name -sid string entryname -ufo string UFO features -fformat string features format -fopenfile string features file name
There are other command-line qualifiers that change the behaviour of the sequence output.
-osformat string output sequence file format -osextension string file name extension -osname string base file name -osdirectory bool output sequence file directory -osdbname string database name to add -ossingle bool create a separate output file for each entry -oufo string feature file to create -offormat string features format -ofname string features file name -ofdirectory string features output directory
More formats, both for input and for output, can be easily added, so suggestions are always welcome.