Sequences can be in databases or in files. Less common sources of sequences like programs and URLs (web addresses) can also be specified in USAs.
USAs can specify a single sequence or many sequences.
In general, a USA specifies:
Of these only the 'file or database' part is necessary, if the format is omitted, then EMBOSS expects Fasta format, but if this fails will it check many other formats. If the 'entry' part is omitted, then all of the entries in the file or database are read in.
The most common ways of specifying a sequence are to type the name of the file that the sequence is in, or to type 'db:entry', where 'db' is the name of a database and 'entry' is either the sequence's ID name in that database or its Accession number in the database.
For example:
If the format is omitted, then EMBOSS expects Fasta format, but if this fails will it check the other formats for you until it can read in the sequence. If the format is not recognised, it will fail with an error message.
It is not necessary to specify the format of entries in a sequence database because the configuration files in EMBOSS that have specified where and what the database are, have also specified their format. EMBOSS therefore knows the formats of all its databases already.
Sequences in 'plain' format (no annotation, title or comments, just the sequence) are sometimes not recognised by EMBOSS. This is one of the few cases where use of the format in a USA is required. At any other time, specifying the format will merely speed up the program as the programs will not have to try all possible formats until the format is recognised.
There is no standard system of naming databases in EMBOSS. This is because total control over the database setup has been given to your local EMBOSS administrator (the person who set up EMBOSS at your site).
Dot characters, '.', are not legal in database names. When EMBOSS finds a dot character, it assumes that a file name is being referred to.
You can easily find out what the local database name are by running showdb. This will give a table of the database names, whether they are protein or nucleic and the types of access that is possible (more on this below).
It is very likely that one or more of the following major databases will have been set up:
Abbreviations of these names may also be available, for example 'em' for 'embl'.
For example: 'embl:hsfau' or 'swissprot:100K_RAT'.
EMBOSS will try searching for your specified sequence by both the accession number field and the ID name field. There is no need to specify whether you have given the accession number or the ID name.
The case of the database name and entry are not significant, they can be in either upper or lower-case. For example: 'EM:AF061303' is the same as 'em:af061303'.
You cannot specify a sequence in EMBOSS by giving just the ID name or Accession number - you must specify the database name in the 'database:entry' syntax. You cannot therefore just specify 'X65923' and expect EMBOSS to know what this is - it will actually assume that 'X65923' is the name of a database or a file and will fail when this fails to work.
Why are there two such identifiers? The ID name was originally intended to be a human-readable name that had some indication of the function of its sequence. In EMBL and GenBank the first two (or three) letters indicated the species and the rest indicated the function, for example 'hsfau' is the 'Homo Sapiens FAU pseudogene'. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Instead, the Accession numbers started to be also assigned as the ID name. Therefore you will now find ID names like 'AF061303', the same as the Accession number for that sequence in EMBL.
ID names are not guaranteed to remain the same between different versions of a database (although in practice they usually do).
Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the rest of the life of the database. If two sequences are merged into one, then the new sequence will get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers.
EMBL, GenBank and SwissProt share an Accession numbering scheme - an Accession number uniquely identifies a sequence within these three databases.
It is also common to need to refer to a set of wildcarded entry-names in a database, for example 'swissprot:*_human' refers to all the human entries in SwissProt. (Strictly, it is all the entries in SwissProt whose names end in '_human'.) A single character can be specified as wild-carded by using a '?' character.
Because EMBOSS is very flexible in the ways in which databases can be set up, not all databases will be searchable by all types of sequence specifications.
Databases that are set up to access (for example) a web site to return a single entry will probably not be set up to return either wildcarded entry name spcifications or complete databases (it takes a long time to transfer large databases across the Internet!)
The program showdb will give a list of the available databases, together with the ways in which they can be accessed. It will show these under the three columns, headed 'ID', 'Query and 'All'.
Ideally all of the databases available on your site will be available in all three ways, but this is not the best of all posssible worlds and so you might like to check how you can access the databases by running showdb and having a look.
To avoid this, these characters need to be hidden in quotes or preceded by a backslash on the UNIX command line. For example:
% seqret "embl:*"
or
% seqret embl:\*
Quoting of wildcard characters is only required on the command-line. It is not required when giving a reply to a prompt from a program or when filling in a field on a GUI's form. For example:
% seqret Reads and writes (returns) sequences Input sequence(s): embl:* etc.
The case of the filename is significant, as always in UNIX. 'FRED.SEQ' is not the same filename as 'fred.seq'.
If just the name of the file containing multiple sequences is specified, then all the sequences in that file will be read in. This is the equivalent of specifying 'filename:*'. For example 'myclones.seq' is the same thing as 'myclones.seq:*'.
For example: 'myfile.fasta:xyz_123' - the sequence in the file 'myfile.fasta' whose ID name is 'xyz_123'.
You cannot specify a sequence in EMBOSS by giving just the ID name - you must specify the file name in the 'filename:ID' syntax.
To help GCG users, an additional syntax is allowed where the entry name is enclosed in curly brackets 'file{entry}. (This needs to be escaped on the command line as 'file\{entry\}.) This allows MSF files to be specified as "pileup.msf{*}" although a simple "pileup.msf" would work just as well.
To specify wildcarded sequence names, use the wildcard characters '*' and '?', just as for database entries. For example, 'myfile.fasta:IXI*' will read in all of the sequences in the file 'myfile.fasta' whose ID name starts with 'IXI'. The wildcard characters need to be hidden in quotes or preceded by a backslash on the UNIX command line.
Instead of containing the sequences themselves, a List File contains "references" to sequences using any valid USA - so, for example, you might include database entries, the names of files containing sequences, or even the names of other list files. For example, here's a valid list file, called seq.list:
opsd_abyko.fasta sw:opsd_xenla sw:opsd_c* @another_list
This looks a bit odd, but it's really very straightforward; the file contains:
Notice the '@' in front of the last entry. This is the way you tell EMBOSS that this file is a List File, not a regular sequence file. Alternatively, you can use the specifer 'list:'. These two format specifiers are synonymous; the '@' version is derived from the syntax of the DEC-VMS command interpreter and has come to indicate a List File in many subsequent computer systems.
Blank lines and lines starting with a '#' character are ignored in List Files.
The syntax is 'asis::sequence', for example: 'asis::atgctagcttagctgac' for the sequence 'atgctagcttagctgac'.
N.B. 'asis' only specifies one sequence at a time. The sequence has no ID name or title.
'getz -e [embl-id:paamir] |'
The pipe character "|" causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry PAAMIR from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way.
Most sequence specifications will use these identifiers, so these are the default ways of indicating sequence entries. There are, however, other ways in which people might like to specify sequences.
There are many defined data fields in sequence database entries.
A typical sequence entry in
EMBL format is:
ID HSFAU standard; DNA; UNC; 518 BP. AC X65923; SV X65923.1 DE H.sapiens fau mRNA KW fau gene. OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. SQ Sequence 518 BP; 125 A; 139 C; 148 G; 106 T; 0 other;
It is usual to refer to sequences by their unique ID name or accession number or sequence version number (the 'ID', 'AC' and 'SV' lines, above). It is also useful to be able to find sequences that contain words occurring in their short description field (the 'DE' line), their Keyword field (the 'KW' line) or the Organism fields (the 'OS' and 'OC' lines).
Some of these data fields can be used to specify entries in EMBOSS.
Some search field values are unique, like the ID name or Accession numbers. Some are not unique, like the words in the description, or organism name. Searching with a non-unique search field value, for example the organism name will probably find more than one match. In this case you will get more than one sequence entry returned. This is a similar case to specifying a wildcarded ID name and getting many matching entries returned.
You explicitly specify which field type you are searching by using one of the following Search Field names, together with the data to search for.
Name | Searches for |
---|---|
acc | Accession number |
des | Description |
id | ID name |
key | Keyword |
org | Organism Name |
sv | Sequence Version/GI Number |
You can specify the type of field in the database you wish to search by adding a field name to the database name. (e.g. embl-des:fau).
'myfile.seq-id' could be a valid file name, so the notation is a little different when specifying a Search Field in a file, you use a ':' instead of a '-', for example, 'myclones.seq:des:fau'
You cannot search by multiple search fields. You can only specify one Search Field at a time.
The 'id' and 'acc' Search Fields can normally be omitted. If no search field is specified, (for example 'embl:hsfau'), then the default is to search for a match in both the 'id' and 'acc' fields.
Missing description, keyword, organism, or sequence version fields cause queries to fail if they are used on inappropriate data. In other words, if the file or database you are searching doesn't contain the field you are searching for, you will get an error message like: "Error: Unable to read sequence 'xxx.seq:org:homo'"
If the 'acc' and 'id' search fields are omitted, then EMBOSS will search by testing for both accession and id fields. Specifying the 'acc' and 'id' search fields will make accessing the sequences slightly faster, but they are not required.
It is not necessary for you to use this syntax for 'id' and 'acc' specification, but EMBOSS programs report USAs in this style, so do not get alarmed when you see it.
The KEY field contains the words and phrases that classify the entry by form and function, as specified by the database curators.
The DES field is the brief one-line description of the sequence entry. (This is the title line in simple sequence formats, such as FASTA format).
Searches in these fields are by word. For example 'embl-des:fau'. If you wish to search for part of a word, use an asterisk to indicate a wildcard. For example: 'embl-des:h*emoglobin'
The definition of a 'word' in KEY and ORG searches is anything that matches the complete region between the semicolons ';' delimiting the sections of these fields. This includes spaces.
So 'embl-key:"fau gene" would match the entry 'HSFAU' displayed above, as would 'embl-key:fau*', but 'embl-key:fau' would not match it. Similarly, 'embl-org:"homo sapiens (human)"' and 'embl-org:*human*' and 'embl-org:hominidae' would match this entry, but 'embl-org:human' would not match it as the 'word' that contains "human" is "Homo sapiens (human)". The search 'embl-org:homo' would match as the word "Homo" occurs in its own field at the end of the second 'OC' line.
The definition of a 'word' is much more intuitive in DES searches - a 'word' is bounded by spaces and other non-alphanumeric characters. Words start with a letter or number, and end with a letter or number. SRS typically does the same, but allows a single quote at the end. This catches words such as 3' and 5' but is a problem with some quoted text.
So in the entry 'HSFAU' displayed above, 'embl-des:fau' and 'embl-des:sapiens' match. ("H.sapiens" is not a word - it is split into the words 'H' and 'sapiens' because the dot '.' is not an alphanumeric character.)
Phrases don't work for the DES field; it is word based, so the search 'embl-des:"fau mRNA"' will fail.
The searches are case-insensitive. 'Human' is the same as 'human'.
The Organism name, Keyword and Description line fields in a sequence entry contain words which are almost certainly not unique. Searching for an organism word like 'homo' will find many matches, all of which will be returned.
Both Sequence Version identifiers and GI Numbers (see below) share the 'sv' field in USAs.
A sequence may be unambiguously identified by the Sequence Version, for example: 'embl-sv:X65923.1'.
Be careful, that may be a false sense of security! In February 1999, everything in DDBJ/EMBL/GenBank got version 1, even if it was the 1st or 10 version for a given sequence.
For example, AC000003 shows version 1:
ID HSAC00003 standard; DNA; HUM; 122228 BP. XX AC AC000003; XX SV AC000003.1 XX DT 01-OCT-1996 (Rel. 49, Created) DT 07-MAR-2000 (Rel. 63, Last updated, Version 6)
but is really the third sequence version (3rd gi) for that record.
See: http://www.ncbi.nlm.nih.gov:80/entrez/sutils/girevhist.cgi?val=AC000003
...and of course, the Version on the DT line has nothing to do with the sequence version (SV or or VERSION lines -- that would be too simple!
But, if after Feb 1999 the author had updated the sequence of AC000003, then that new one would be version 2 (AC000003.2) and it is a *lot* easier for a human to track sequence version changes when you see the incremental increase -- but just because you are looking at SV X00001.1 it doesn't mean you have the first version the databases (DDBJ/EMBL/GenBank) have ever seen.
For example:
VERSION AF181452.1 GI:6017929 ^^^^^^^^^^ ^^^^^^^^^^ Compound NCBI GI Accession Identifier Number
The NCBI GI identifier of the VERSION line serves as a method for identifying the sequence data that has existed for a database entry over time. GI identifiers are numeric values of one or more digits. Since they are integer keys, they are less human-friendly than the Accession.Version system described above. If the sequence changes, a new integer GI will be assigned.
Why are both these methods for identifying the version of the sequence associated with a database entry in use? For two reasons:
Both Sequence Version identifiers (see above) and GI Numbers share the 'sv' field in USAs.
A sequence may be unambiguously identified by the GI Number, for example: 'genbank-sv:6017929'.
If the 'start' or 'end' position is given as a negative number, then the position is counted from the end of the sequence. For example: 'myfile.fasta[-10:-1]' is the last 10 bases.
If '[start:end:r]' is given at the end of the USA, then nucleotide sequenced are reverse-complemented. For example: 'myfile.fasta[1:-1:r]' is the whole sequence reverse-complemented.
The full syntax of the possible USAs are:
Mandatory parts of the USAs are givin in bold text.
'asis' :: Sequence [start : end : reverse] or Format :: '@' ListFile [start : end : reverse] or Format :: 'list' : ListFile [start : end : reverse] or Format :: Database : Entry [start : end : reverse] or Format :: Database - SearchField : Word [start : end : reverse] or Format :: File : Entry [start : end : reverse] or Format :: File : SearchField : Word [start : end : reverse] or Format :: Program Program-parameters '|' [start : end : reverse]
It can be in upper or lower case.
If the '[start : end : reverse]' USA subsequence specifier is used or '-sbegin' or '-send' or other command-line qualifiers which affect the input sequence are given, then these affect all USAs given in the List File (unless these USAs have their own '[start : end : reverse]' USA subsequence specifier).
Database names may have Search Field names appended to them (for example 'embl-des', 'embl-id').
The Entry may be wildcarded, so 'hs*' will match all ID names starting with 'hs'.
'*' indicates that all entries in the Database or File will be read.
There may be restrictions on certain databases preventing access to a single entry, wildcarded entries or reading in all entries. This is a consequence of the way some databases are accessed.
A Database or File location must be given as part of a USA that has an Entry; you cannot give an Entry name on its own. (i.e. you cannot give just an Accession number or ID name and expect EMBOSS to deduce which database they might refer to. You cannot give the ID name of a sequence in a file and expect EMBOSS to deduce that this is the ID name of a sequence in a file and that all files should be tested to see if they contain it.)
Name | Searches for |
---|---|
acc | Accession number |
des | Description |
id | ID name |
key | Keyword |
org | Organism Name |
sv | Sequence Version/GI Number |
Words may be wildcarded.
Words in ORG and KEY fields may contain spaces because the complete key-phrase or organism classification level is indexed as one 'word'.
Words in the DES field contain only alphanumeric characters and thus end at spaces or other non-alphanumeric characters.
Words in ID and ACC fields are the same thing as 'Entries' above.
Use of this USA subsequence specifier is equivalent to using the '-sbegin' or '-send' or '-sreverse' command-line qualifiers.
Each of the above can have '[start : end]' or '[start : end : r]' appended to them.
The 'file' and 'dbname' forms of USA can have 'format::' in front of them (although a database knows which format it is and so this is redundant and error-prone)
Type | Example | Description |
---|---|---|
filename | xxx.seq | A sequence file "xxx.seq" in any format |
format::filename | fasta::xxx.seq | A sequence file "xxx.seq" in fasta format |
db:IDname | embl:paamir | EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database |
db:AccessionNumber | embl:X13776 | EMBL entry X13776, using whatever access method is defined locally for the EMBL database and searching by accession number and entry name (X13776 is the accession number in this case) |
db-acc:AccessionNumber | embl-acc:X13776 | EMBL entry X13776, using whatever access method is defined locally for the EMBL database and searching by accession number only |
db-id:IDname | embl-id:paamir | EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database, and searching by ID only |
db-searchfield:word | embl-des:lectin | EMBL entries containing the word 'lectin' in the Description line |
db-searchfield:wildcard-word | embl-org:*human* | EMBL entries containing the wildcarded word 'human' in the Organism fields |
db:wildcard-ID | embl:paami* | EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical order, using whatever access method is defined locally for the EMBL database |
db or db:* | embl or EMBL:* | All sequences in the EMBL database |
@listfile | @mylist | Reads file mylist and uses each line as a separate USA. List files can contain references to other lists files or any other standard USA. |
list:listfile | list:mylist | Same as "@mylist" above |
'program parameters |' | 'getz -e [embl-id:paamir] |' | The pipe character "|" causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry PAAMIR from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way. |
asis::sequence | asis::atacgcagttatctgaccat | So far the shortest USA we could invent. In 'asis' format the name is the sequence so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines. |
You can specify the format to read in by 'format::stdin', for example: 'gcg::stdin'.
If they are at the start of the command-line before any sequence input parameters, then they refer to all parameters. If they occur after a sequence input parameter, then they refer to that parameter. If they have a number after then e.g. '-sbegin3' then they refer to the parameter whose ordinal number in the list of parameters is that number.
Use of '-sformat' is exactly equivalent to using 'format::' in a USA.
An input database entry could be specified using one or more of '-sdbname' and '-sid'.
An input file entry could be specified using one or more of '-sopenfile' and '-sid'.
EMBOSS guesses whether a sequence is Nucleic or Protein by the proportion of characters in the sequence. It is usually correct, but might make a mistake with an unusually ambiguous Nucleic sequence.
You can force EMBOSS to accept that the sequence is Nucleic or Protein using '-snucleotide' or '-sprotein'.
The command-line qualifiers can be abreviated as long as the qualifier is unique, e.g. '-sb'.
-sbegin integer first base used -send integer last base used, def=seq length -sreverse bool reverse (if DNA) -sask bool ask for begin/end/reverse -snucleotide bool sequence is nucleotide -sprotein bool sequence is protein -slower bool make lower case -supper bool make upper case -sformat string input sequence format -sopenfile string input filename -sdbname string database name -sid string entryname
filename
or
format::filename
The latter causes the sequence to be written in the specified format.
N.B. UNIX filenames CAN contain space characters and punctuation characters, but you will rapidly get into trouble it you specify names with these in. Stick to alphanumeric characters, '-' and '.'.
You can specify the format to output to the screen in by 'format::stdout', for example: 'gcg::stdout'.
The command-line qualifier '-ossingle' may be useful - it allows you to write out several sequences, but it writes out each sequence to a separate file. The name of the file is constructed from the ID name of the sequence being written and the extension is the format, so the sequence with the ID name 'IXI_567' being written in 'gcg' format would be written to the file 'IXI_567.gcg'.
The output filename could be specified using one or more of '-osextension' and '-osname'.
'-osdbname' specifies a database name and ':' be prepended to the entry names in the output.
-osformat string output seq format -osextension string file name extension -osname string base file name -osdbname string database name to add -ossingle bool separate file for each entry