Uniform Sequence Address

The Uniform Sequence Address, or USA, is a standard way of specifying a sequence to be read into a program in EMBOSS.

Sequences can be in databases or in files. Less common sources of sequences like programs and URLs (web addresses) can also be specified in USAs.

USAs can specify a single sequence or many sequences.

In general, a USA specifies:

What sequence format to expect
What file or database to open
What entry to look for

Of these only the 'file or database' part is necessary, if the format is omitted, then EMBOSS expects Fasta format, but if this fails will it check many other formats. If the 'entry' part is omitted, then all of the entries in the file or database are read in.

The most common ways of specifying a sequence are to type the name of the file that the sequence is in, or to type 'db:entry', where 'db' is the name of a database and 'entry' is either the sequence's ID name in that database or its Accession number in the database.

For example:

database:accession

embl:X65923

database:id

swissprot:100k_rat

file name

myfile.seq

Specifying the Format

The specified format can be any one of the available EMBOSS formats

If the format is omitted, then EMBOSS expects Fasta format, but if this fails will it check the other formats for you until it can read in the sequence. If the format is not recognised, it will fail with an error message.

It is not necessary to specify the format of entries in a sequence database because the configuration files in EMBOSS that have specified where and what the database are, have also specified their format. EMBOSS therefore knows the formats of all its databases already.

Sequences in 'plain' format (no annotation, title or comments, just the sequence) are sometimes not recognised by EMBOSS. This is one of the few cases where use of the format in a USA is required. At any other time, specifying the format will merely speed up the program as the programs will not have to try all possible formats until the format is recognised.

Databases

The name of any of the available databases in your installation of EMBOSS can be used.

There is no standard system of naming databases in EMBOSS. This is because total control over the database setup has been given to your local EMBOSS administrator (the person who set up EMBOSS at your site).

Dot characters, '.', are not legal in database names. When EMBOSS finds a dot character, it assumes that a file name is being referred to.

You can easily find out what the local database name are by running showdb. This will give a table of the database names, whether they are protein or nucleic and the types of access that is possible (more on this below).

It is very likely that one or more of the following major databases will have been set up:

embl - nucleic sequences from the EBI
genbank - nucleic sequence from the NCBI
swissprot - protein sequences from the EBI/ExPASy
pir - protein sequences from the NBRF

Abbreviations of these names may also be available, for example 'em' for 'embl'.

Specifying a Database Entry

The simplest way to specify a database entry is to type 'db:entry', where 'db' is the name of a database and 'entry' is either the sequence's ID name in that database or its Accession number in the database.

For example: 'embl:hsfau' or 'swissprot:100K_RAT'.

EMBOSS will try searching for your specified sequence by both the accession number field and the ID name field. There is no need to specify whether you have given the accession number or the ID name.

The case of the database name and entry are not significant, they can be in either upper or lower-case. For example: 'EM:AF061303' is the same as 'em:af061303'.

You cannot specify a sequence in EMBOSS by giving just the ID name or Accession number - you must specify the database name in the 'database:entry' syntax. You cannot therefore just specify 'X65923' and expect EMBOSS to know what this is - it will actually assume that 'X65923' is the name of a database or a file and will fail when this fails to work.

Ids and Accessions

An entry in a database must have some way of being uniquely identified in that database. Most sequence databases have two such identifiers for each sequence - an ID name and an Accession number.

Why are there two such identifiers? The ID name was originally intended to be a human-readable name that had some indication of the function of its sequence. In EMBL and GenBank the first two (or three) letters indicated the species and the rest indicated the function, for example 'hsfau' is the 'Homo Sapiens FAU pseudogene'. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Instead, the Accession numbers started to be also assigned as the ID name. Therefore you will now find ID names like 'AF061303', the same as the Accession number for that sequence in EMBL.

ID names are not guaranteed to remain the same between different versions of a database (although in practice they usually do).

Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the rest of the life of the database. If two sequences are merged into one, then the new sequence will get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers.

EMBL, GenBank and SwissProt share an Accession numbering scheme - an Accession number uniquely identifies a sequence within these three databases.

Specifying a Set of Database Entries

It is common to need to run a program to search all the entries in a database for something. This can be done by just giving the name of the database, for example 'embl' refers to all of the entries in the EMBL database. It is more common however to explicitly indicate with a asterisk that all of the entries are required, for example: 'embl:*' also refers to all of the entries in the EMBL database.

It is also common to need to refer to a set of wildcarded entry-names in a database, for example 'swissprot:*_human' refers to all the human entries in SwissProt. (Strictly, it is all the entries in SwissProt whose names end in '_human'.) A single character can be specified as wild-carded by using a '?' character.

Restrictions on accessing databases

Although specifying a complete database and specifying wildcarded entry names in a database both refer to many entries in a database and both use asterisks in the USA specification, they are implemented in the EMBOSS programs in a very different way. Reading all the entries in the database requires the program to start at the begining of the database and read each next entry in turn. Reading wildcarded entries requires an index of entry ID names and accession numbers to be available that can be queried to return the positions in the database of those entries whose names match the wildcarded specification.

Because EMBOSS is very flexible in the ways in which databases can be set up, not all databases will be searchable by all types of sequence specifications.

Databases that are set up to access (for example) a web site to return a single entry will probably not be set up to return either wildcarded entry name spcifications or complete databases (it takes a long time to transfer large databases across the Internet!)

The program showdb will give a list of the available databases, together with the ways in which they can be accessed. It will show these under the three columns, headed 'ID', 'Query and 'All'.

'ID' allows the programs to extract a single explicitly named entry from the database, e.g.: embl:x13776
'Query' indicates that programs can extract a set of matching wildcard entry names, e.g.: swissprot:pax*_human
'All' allows the programs to analyse all the entries in the database sequentially, e.g.: embl:*

Ideally all of the databases available on your site will be available in all three ways, but this is not the best of all posssible worlds and so you might like to check how you can access the databases by running showdb and having a look.

Quoting on the UNIX command-line

Be aware that using '*' or '?' on the UNIX command-line causes problems. UNIX tries to interpret the word containing the '*' or '?' as a wildcarded filename to be matched to existing files. When this fails UNIX gives an error message without running the program.

To avoid this, these characters need to be hidden in quotes or preceded by a backslash on the UNIX command line. For example:

% seqret "embl:*"
or
% seqret embl:\*

Quoting of wildcard characters is only required on the command-line. It is not required when giving a reply to a prompt from a program or when filling in a field on a GUI's form. For example:

% seqret
Reads and writes (returns) sequences
Input sequence(s): embl:*
etc.

Specifying a Sequence File

The name of any of your files containing sequences can be used. The sequence must be in one of the formats that EMBOSS recognises.

The case of the filename is significant, as always in UNIX. 'FRED.SEQ' is not the same filename as 'fred.seq'.

Multiple Sequence Files

One or more sequences can usually be held in the same file. There are some restrictions; some formats, such as 'gcg', 'plain', 'raw', 'staden' have no indication of where the sequence ends and the next sequence starts. With all other sequence formats, there is no problem with having several sequences concatenated in the same file.

If just the name of the file containing multiple sequences is specified, then all the sequences in that file will be read in. This is the equivalent of specifying 'filename:*'. For example 'myclones.seq' is the same thing as 'myclones.seq:*'.

Specifying One or More File Entries

The simplest way to specify a single specific sequence in a file containing multiple sequences, is to type 'filename:entry', where 'filename' is the name of a file and 'entry' is the sequence's ID name or Accession number in that file.

For example: 'myfile.fasta:xyz_123' - the sequence in the file 'myfile.fasta' whose ID name is 'xyz_123'.

You cannot specify a sequence in EMBOSS by giving just the ID name - you must specify the file name in the 'filename:ID' syntax.

To help GCG users, an additional syntax is allowed where the entry name is enclosed in curly brackets 'file{entry}. (This needs to be escaped on the command line as 'file\{entry\}.) This allows MSF files to be specified as "pileup.msf{*}" although a simple "pileup.msf" would work just as well.

To specify wildcarded sequence names, use the wildcard characters '*' and '?', just as for database entries. For example, 'myfile.fasta:IXI*' will read in all of the sequences in the file 'myfile.fasta' whose ID name starts with 'IXI'. The wildcard characters need to be hidden in quotes or preceded by a backslash on the UNIX command line.

Specifying a Set of Files

You can specify a wildcarded set of file names to be read in by using the wild card characters '*' and '?', just as for database entries. For example, 'myfile.*' will read in all sequences in the files whose names start with 'myfile.'. The wildcard characters need to be hidden in quotes or preceded by a backslash on the UNIX command line.

Specifying a List File

You may know List Files by their name in the Staden Package: 'fofn' or 'File of File Names'.

Instead of containing the sequences themselves, a List File contains "references" to sequences using any valid USA - so, for example, you might include database entries, the names of files containing sequences, or even the names of other list files. For example, here's a valid list file, called seq.list:

opsd_abyko.fasta
sw:opsd_xenla
sw:opsd_c*
@another_list

This looks a bit odd, but it's really very straightforward; the file contains:

opsd_abyko.fasta - the name of a sequence file.
sw:opsd_xenla - a specific sequence in the SwissProt database
sw:opsd_c* - all the sequences in SwissProt whose ID names start with ``opsd_c''
another_list - the name of a second, nested, list file

Notice the '@' in front of the last entry. This is the way you tell EMBOSS that this file is a List File, not a regular sequence file. Alternatively, you can use the specifer 'list:'. These two format specifiers are synonymous; the '@' version is derived from the syntax of the DEC-VMS command interpreter and has come to indicate a List File in many subsequent computer systems.

Blank lines and lines starting with a '#' character are ignored in List Files.

Specifying a sequence "As Is"

The simplest USA format is 'asis' format. This is used to specify a sequence immediately without it having to be in a file or database.

The syntax is 'asis::sequence', for example: 'asis::atgctagcttagctgac' for the sequence 'atgctagcttagctgac'.

N.B. 'asis' only specifies one sequence at a time. The sequence has no ID name or title.

Programs

An unusual way of getting a sequence is to run a program to extract it from some other system. This is done by specifying the programs's name and its sequence access details. These must be followed by a '|' character. For example:

'getz -e [embl-id:paamir] |'

The pipe character "|" causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry PAAMIR from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way.

Specifying Search Fields

So far we have just been specifying individual sequences in files or databases by using their ID name or their Accession numbers.

Most sequence specifications will use these identifiers, so these are the default ways of indicating sequence entries. There are, however, other ways in which people might like to specify sequences.

There are many defined data fields in sequence database entries.
A typical sequence entry in EMBL format is:

ID   HSFAU      standard; DNA; UNC; 518 BP.
AC   X65923;
SV   X65923.1
DE   H.sapiens fau mRNA
KW   fau gene.
OS   Homo sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
SQ   Sequence 518 BP; 125 A; 139 C; 148 G; 106 T; 0 other;

It is usual to refer to sequences by their unique ID name or accession number or sequence version number (the 'ID', 'AC' and 'SV' lines, above). It is also useful to be able to find sequences that contain words occurring in their short description field (the 'DE' line), their Keyword field (the 'KW' line) or the Organism fields (the 'OS' and 'OC' lines).

Some of these data fields can be used to specify entries in EMBOSS.

Some search field values are unique, like the ID name or Accession numbers. Some are not unique, like the words in the description, or organism name. Searching with a non-unique search field value, for example the organism name will probably find more than one match. In this case you will get more than one sequence entry returned. This is a similar case to specifying a wildcarded ID name and getting many matching entries returned.

You explicitly specify which field type you are searching by using one of the following Search Field names, together with the data to search for.

Name	Searches for
acc	Accession number
des	Description
id	ID name
key	Keyword
org	Organism Name
sv	Sequence Version/GI Number

You can specify the type of field in the database you wish to search by adding a field name to the database name. (e.g. embl-des:fau).

'myfile.seq-id' could be a valid file name, so the notation is a little different when specifying a Search Field in a file, you use a ':' instead of a '-', for example, 'myclones.seq:des:fau'

You cannot search by multiple search fields. You can only specify one Search Field at a time.

The 'id' and 'acc' Search Fields can normally be omitted. If no search field is specified, (for example 'embl:hsfau'), then the default is to search for a match in both the 'id' and 'acc' fields.

Missing description, keyword, organism, or sequence version fields cause queries to fail if they are used on inappropriate data. In other words, if the file or database you are searching doesn't contain the field you are searching for, you will get an error message like: "Error: Unable to read sequence 'xxx.seq:org:homo'"

ACC and ID

Using 'database-acc:number' or 'file:acc:number' is a way of telling EMBOSS that it need not try to search for the entry by testing both the ID name field and the Accession number field; it only needs to find the entry by Accession number.

If the 'acc' and 'id' search fields are omitted, then EMBOSS will search by testing for both accession and id fields. Specifying the 'acc' and 'id' search fields will make accessing the sequences slightly faster, but they are not required.

It is not necessary for you to use this syntax for 'id' and 'acc' specification, but EMBOSS programs report USAs in this style, so do not get alarmed when you see it.

ORG, KEY and DES

The ORG fields contain the full organism classification names.

The KEY field contains the words and phrases that classify the entry by form and function, as specified by the database curators.

The DES field is the brief one-line description of the sequence entry. (This is the title line in simple sequence formats, such as FASTA format).

Searches in these fields are by word. For example 'embl-des:fau'. If you wish to search for part of a word, use an asterisk to indicate a wildcard. For example: 'embl-des:h*emoglobin'

The definition of a 'word' in KEY and ORG searches is anything that matches the complete region between the semicolons ';' delimiting the sections of these fields. This includes spaces.

So 'embl-key:"fau gene" would match the entry 'HSFAU' displayed above, as would 'embl-key:fau*', but 'embl-key:fau' would not match it. Similarly, 'embl-org:"homo sapiens (human)"' and 'embl-org:*human*' and 'embl-org:hominidae' would match this entry, but 'embl-org:human' would not match it as the 'word' that contains "human" is "Homo sapiens (human)". The search 'embl-org:homo' would match as the word "Homo" occurs in its own field at the end of the second 'OC' line.

The definition of a 'word' is much more intuitive in DES searches - a 'word' is bounded by spaces and other non-alphanumeric characters. Words start with a letter or number, and end with a letter or number. SRS typically does the same, but allows a single quote at the end. This catches words such as 3' and 5' but is a problem with some quoted text.

So in the entry 'HSFAU' displayed above, 'embl-des:fau' and 'embl-des:sapiens' match. ("H.sapiens" is not a word - it is split into the words 'H' and 'sapiens' because the dot '.' is not an alphanumeric character.)

Phrases don't work for the DES field; it is word based, so the search 'embl-des:"fau mRNA"' will fail.

The searches are case-insensitive. 'Human' is the same as 'human'.

The Organism name, Keyword and Description line fields in a sequence entry contain words which are almost certainly not unique. Searching for an organism word like 'homo' will find many matches, all of which will be returned.

SV

Sequence Versions are formed from the accession number and followed by a '.' and then the number of previous releases there have been of this sequence. (e.g. 'X65923.1'). It makes it possible to find the current version of any sequence and to guess the SV of all previous versions.

Both Sequence Version identifiers and GI Numbers (see below) share the 'sv' field in USAs.

A sequence may be unambiguously identified by the Sequence Version, for example: 'embl-sv:X65923.1'.

Be careful, that may be a false sense of security! In February 1999, everything in DDBJ/EMBL/GenBank got version 1, even if it was the 1st or 10 version for a given sequence.

For example, AC000003 shows version 1:

ID   HSAC00003  standard; DNA; HUM; 122228 BP.
XX
AC   AC000003;
XX
SV   AC000003.1
XX
DT   01-OCT-1996 (Rel. 49, Created)
DT   07-MAR-2000 (Rel. 63, Last updated, Version 6)

but is really the third sequence version (3rd gi) for that record.
See: http://www.ncbi.nlm.nih.gov:80/entrez/sutils/girevhist.cgi?val=AC000003

...and of course, the Version on the DT line has nothing to do with the sequence version (SV or or VERSION lines -- that would be too simple!

But, if after Feb 1999 the author had updated the sequence of AC000003, then that new one would be version 2 (AC000003.2) and it is a *lot* easier for a human to track sequence version changes when you see the incremental increase -- but just because you are looking at SV X00001.1 it doesn't mean you have the first version the databases (DDBJ/EMBL/GenBank) have ever seen.

GI Number

GI numbers are assigned to entries in GenBank and other sequence databases originating from the NCBI. They are an integer key for identifying the entry version.

For example:

VERSION     AF181452.1  GI:6017929
            ^^^^^^^^^^  ^^^^^^^^^^
            Compound    NCBI GI
            Accession   Identifier
            Number

The NCBI GI identifier of the VERSION line serves as a method for identifying the sequence data that has existed for a database entry over time. GI identifiers are numeric values of one or more digits. Since they are integer keys, they are less human-friendly than the Accession.Version system described above. If the sequence changes, a new integer GI will be assigned.

Why are both these methods for identifying the version of the sequence associated with a database entry in use? For two reasons:

Some data sources processed by NCBI for incorporation into its Entrez sequence retrieval system do not version their own sequences.
GIs provide a uniform, integer identifier system for every sequence NCBI has processed. Some products and systems derived from (or reliant upon) NCBI products and services prefer to use these integer identifiers because they can all be processed in the same manner.

Both Sequence Version identifiers (see above) and GI Numbers share the 'sv' field in USAs.

A sequence may be unambiguously identified by the GI Number, for example: 'genbank-sv:6017929'.

Start, End, Reverse

Any USA can take a specification of the start and end of the sequence formed from '[start:end]' at the end of the USA, for example 'myfile.fasta[20:45]' - the sequences in the file 'myfile.fasta' starting at 20 and ending at position 45.

If the 'start' or 'end' position is given as a negative number, then the position is counted from the end of the sequence. For example: 'myfile.fasta[-10:-1]' is the last 10 bases.

If '[start:end:r]' is given at the end of the USA, then nucleotide sequenced are reverse-complemented. For example: 'myfile.fasta[1:-1:r]' is the whole sequence reverse-complemented.

The Full USA syntax

You can use upper or lower case to specify a format, database, accession number, ID name, DES word, ORG word, KEY word or Sequence Version. You must specify a file name in the correct case, of course.

The full syntax of the possible USAs are:
Mandatory parts of the USAs are givin in bold text.


'asis' :: Sequence [start : end : reverse]

or

Format :: '@' ListFile [start : end : reverse]

or

Format :: 'list' : ListFile [start : end : reverse]

or

Format :: Database : Entry [start : end : reverse]

or

Format :: Database - SearchField : Word [start : end : reverse]

or

Format :: File : Entry [start : end : reverse]

or

Format :: File : SearchField : Word [start : end : reverse]

or

Format :: Program Program-parameters '|' [start : end : reverse]

Sequence

An explicit sequence, for example: 'atgctgacgatgcg' or 'TPRPGKNTEARLNCF'.

It can be in upper or lower case.

Format

This is one of the valid sequence formats. The sequence format may usually be omitted when reading in a sequence; EMBOSS will try all known sequence formats until it can read the sequence.

ListFile

A file of USAs. One USA per line. Either '@' or 'list:' are required before the List File name to indicate that it is a List File. List Files may be nested.

If the '[start : end : reverse]' USA subsequence specifier is used or '-sbegin' or '-send' or other command-line qualifiers which affect the input sequence are given, then these affect all USAs given in the List File (unless these USAs have their own '[start : end : reverse]' USA subsequence specifier).

Database

This must be a valid database name. If the name is not a valid database, a file with the same name is looked for.

Database names may have Search Field names appended to them (for example 'embl-des', 'embl-id').

File

This is a filename. The filename may be wildcarded.

Entry

This specifies the ID name or Accession number of one or more sequences in a database or file to read in. If it omitted, then all the files in the Database or File will be read.

The Entry may be wildcarded, so 'hs*' will match all ID names starting with 'hs'.

'*' indicates that all entries in the Database or File will be read.

There may be restrictions on certain databases preventing access to a single entry, wildcarded entries or reading in all entries. This is a consequence of the way some databases are accessed.

A Database or File location must be given as part of a USA that has an Entry; you cannot give an Entry name on its own. (i.e. you cannot give just an Accession number or ID name and expect EMBOSS to deduce which database they might refer to. You cannot give the ID name of a sequence in a file and expect EMBOSS to deduce that this is the ID name of a sequence in a file and that all files should be tested to see if they contain it.)

SearchField

This is the name of one of the available search fields:

Name	Searches for
acc	Accession number
des	Description
id	ID name
key	Keyword
org	Organism Name
sv	Sequence Version/GI Number

Word

This is the 'word' to search for in the Search Field.

Words may be wildcarded.

Words in ORG and KEY fields may contain spaces because the complete key-phrase or organism classification level is indexed as one 'word'.

Words in the DES field contain only alphanumeric characters and thus end at spaces or other non-alphanumeric characters.

Words in ID and ACC fields are the same thing as 'Entries' above.

Program and Program-parameters

This specifies any name of a program on the current path together with any parameters it might take in order to specify one or more entries.

[start : end : reverse]

Any USA may optionally take this subsequence specifier after the main body of the USA, either in the form '[start : end]' or in the form '[start : end : r]', where 'start' and 'end' are the required start and end positions. Negative positions count from the end of the sequence.

Use of this USA subsequence specifier is equivalent to using the '-sbegin' or '-send' or '-sreverse' command-line qualifiers.

Example USAs

The following are valid USAs for sequences.

file
file:entry
file:searchfield:word
dbname
dbname:entry
dbname-searchfield:word
@listfile
list::listfile
asis::sequence

Each of the above can have '[start : end]' or '[start : end : r]' appended to them.

The 'file' and 'dbname' forms of USA can have 'format::' in front of them (although a database knows which format it is and so this is redundant and error-prone)

Type	Example	Description
filename	xxx.seq	A sequence file "xxx.seq" in any format
format::filename	fasta::xxx.seq	A sequence file "xxx.seq" in fasta format
db:IDname	embl:paamir	EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database
db:AccessionNumber	embl:X13776	EMBL entry X13776, using whatever access method is defined locally for the EMBL database and searching by accession number and entry name (X13776 is the accession number in this case)
db-acc:AccessionNumber	embl-acc:X13776	EMBL entry X13776, using whatever access method is defined locally for the EMBL database and searching by accession number only
db-id:IDname	embl-id:paamir	EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database, and searching by ID only
db-searchfield:word	embl-des:lectin	EMBL entries containing the word 'lectin' in the Description line
db-searchfield:wildcard-word	embl-org:human	EMBL entries containing the wildcarded word 'human' in the Organism fields
db:wildcard-ID	embl:paami*	EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical order, using whatever access method is defined locally for the EMBL database
db or db:*	embl or EMBL:*	All sequences in the EMBL database
@listfile	@mylist	Reads file mylist and uses each line as a separate USA. List files can contain references to other lists files or any other standard USA.
list:listfile	list:mylist	Same as "@mylist" above
'program parameters \|'	'getz -e [embl-id:paamir] \|'	The pipe character "\|" causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry PAAMIR from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way.
asis::sequence	asis::atacgcagttatctgaccat	So far the shortest USA we could invent. In 'asis' format the name is the sequence so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines.

Sequence Input

The USA of a input sequence specification can be any valid USA.

The file 'stdin'

There is a 'magic' filename that you can give whenever an input filename is requested. It is 'stdin'. If you enter this name, then the resulting sequence will not come from a file called 'stdin', it will be read from the keyboard. This is only useful when you wish to type the sequence immediately, or are 'piping' the results from a previous program into the current program.

You can specify the format to read in by 'format::stdin', for example: 'gcg::stdin'.

Sequence Input Command-line qualifiers

The following command-line specifiers can refer to one input sequence parameter or to all.

If they are at the start of the command-line before any sequence input parameters, then they refer to all parameters. If they occur after a sequence input parameter, then they refer to that parameter. If they have a number after then e.g. '-sbegin3' then they refer to the parameter whose ordinal number in the list of parameters is that number.

Use of '-sformat' is exactly equivalent to using 'format::' in a USA.

An input database entry could be specified using one or more of '-sdbname' and '-sid'.

An input file entry could be specified using one or more of '-sopenfile' and '-sid'.

EMBOSS guesses whether a sequence is Nucleic or Protein by the proportion of characters in the sequence. It is usually correct, but might make a mistake with an unusually ambiguous Nucleic sequence.

You can force EMBOSS to accept that the sequence is Nucleic or Protein using '-snucleotide' or '-sprotein'.

The command-line qualifiers can be abreviated as long as the qualifier is unique, e.g. '-sb'.

  -sbegin             integer    first base used
  -send               integer    last base used, def=seq length
  -sreverse           bool       reverse (if DNA)
  -sask               bool       ask for begin/end/reverse
  -snucleotide        bool       sequence is nucleotide
  -sprotein           bool       sequence is protein
  -slower             bool       make lower case
  -supper             bool       make upper case
  -sformat            string     input sequence format
  -sopenfile          string     input filename
  -sdbname            string     database name
  -sid                string     entryname

Sequence Output

The USA of a output sequence specification is simple. As programs can only write to files, the specification is either:

filename

format::filename

The latter causes the sequence to be written in the specified format.

N.B. UNIX filenames CAN contain space characters and punctuation characters, but you will rapidly get into trouble it you specify names with these in. Stick to alphanumeric characters, '-' and '.'.

The file 'stdout'

There is a 'magic' filename that you can give whenever an output filename is requested. It is 'stdout'. If you enter this name, then the resulting sequence will not go to a file called 'stdout', it will be printed on the screen. This is useful for whe you wish to know the results immediately, or are testing various ways of running a program and wish to quickly see the results.

You can specify the format to output to the screen in by 'format::stdout', for example: 'gcg::stdout'.

Multiple sequence output

One or more sequences can usually be written to the same file. There are some restrictions; some formats, such as 'gcg', 'plain', 'raw', 'staden' have no indication of where the sequence ends and the next sequence starts. They cannot, therefore, be used when writing out several sequences to the same file. (Strictly speaking, you can specify this, but you just get a single concatenated resulting sequence.)

The command-line qualifier '-ossingle' may be useful - it allows you to write out several sequences, but it writes out each sequence to a separate file. The name of the file is constructed from the ID name of the sequence being written and the extension is the format, so the sequence with the ID name 'IXI_567' being written in 'gcg' format would be written to the file 'IXI_567.gcg'.

Sequence Output Command-line qualifiers

Use of '-osformat' is exactly equivalent to using 'format::' in a USA.

The output filename could be specified using one or more of '-osextension' and '-osname'.

'-osdbname' specifies a database name and ':' be prepended to the entry names in the output.

  -osformat           string     output seq format
  -osextension        string     file name extension
  -osname             string     base file name
  -osdbname           string     database name to add
  -ossingle           bool       separate file for each entry

Uniform Sequence Address

Contents

database:accession

database:id

file name