Sequence Databases

 

Contents

The emboss.default file

The Uniform Sequence Address syntax includes a database name which the AJAX library uses to retrieve single sequences, to search for a subset of sequences, or to process an entire database.

Sequence databases can be in a variety of formats, and accessed by a variety of methods. These are defined through a set of control files, the site wide file 'emboss.default' which must be saved in the emboss/ directory under the main distribution.

In addition to reading database definitions from the 'emboss.default' file, EMBOSS also reads definition from the file '.embossrc' in your personal home directory. You can test database definitions in your own '~/.embossrc' file before adding them to the site-wide 'emboss.default' file.

The first thing each EMBOSS program does when it starts running is to read in the 'emboss.default' (and then the '~/.embossrc' file, if it exists). This means that any changes to these definition files take effect as soon as they are made.

The EMBOSS distribution includes a sample set of small databases called tsw, tembl, and so on defined in the 'emboss.default' file (This is distributed as the file 'emboss.default.template' which then has to be renamed 'emboss.default' before EMBOSS can use it).

The example file

Part of the 'emboss.default.template' is show below:


#SET emboss_tempdata path_to_directory_$EMBOSS/test

# Logfile - set this to a file that any user can append to
# and EMBOSS applications will automatically write log information

#SET emboss_logfile /packages/emboss/emboss/log
 
# swissprot (Puffer fish entries)
# =========

DB tsw [ 
	type: P 
	dir: $emboss_tempdata/swiss
	method: emblcd 
	format: swiss 
	release: 36
	fields: "sv des org key"
	comment: "Swissprot native format indexed by dbiflat" 
]

# swnew (Puffer fish entries)
# =====

DB tswnew [ 
	type: P 
	dir: $emboss_tempdata/swnew
	method: emblcd 
	format: swiss 
	release: 37
	fields: "sv des org key"
	comment: "Swissnew native format indexed by dbiflat" 
]

# wormpep (cosmid ZK637)
# =======

DB twp [ 
	type: P 
	dir: $emboss_tempdata/wormpep
	method: emblcd 
	format: fasta 
	release: 16
	fields: "des"
	comment: "Wormpep Fasta format file indexed by dbifasta" 
]

# embl (worm cosmid ZK637 and a few other entries)
# ====

DB tembl [ 
	type: N 
	dir: $emboss_tempdata/embl
	method: emblcd 
	format: embl 
	release: 57
	fields: "sv des org key"
	comment: "EMBL native format indexed by dbiflat" 
]

# pir (cytochrome C plus first entries in other divisions)
# ===

DB tpir [ 
	type: P 
	dir: $emboss_tempdata/pir
	method: gcg
	file: pir*.seq
	format: nbrf
	fields: "des org key"
	comment: "PIR in 4 files in GCG format indexed by dbigcg" 
]

# genbank (Remote access to a SRS server)
# =======

DB tgb [ 
	type: N 
	method: srswww 
	format: genbank
	url: "http://www.cbr.nrc.ca/srs6bin/cgi-bin/wgetz"
	dbalias: genbank
	fields: "sv des org key"
	comment: "Genbank from a remote SRS server" 
]

# genbank (the first few entries from several sub-section files)
# =======

DB tgenbank [ 
	type: N 
	dir: $emboss_tempdata/genbank
	method: emblcd 
	format: genbank 
	release: 01
	fields: "sv des org key"
	comment: "GenBank native format indexed by dbiflat" 
]

Sites should leave these in the file for testing purposes, and add as many of their own databases as they need.

Databases are usually local, and need to have the full database files plus some indexing or query method to extract single entries or to query by ID or (in most cases) accession number.

'emboss.default' file Syntax

The syntax for specifying a database in the 'emboss.default' file is as follows:

Blank lines and Comments

Blank lines are ignored.

Comments start with a '#' character in the first position of a line.

For example:
# this is a comment

Variable Definitions

Variables may be set with the keyword 'SETENV', (usually shortened to 'SET' or 'ENV' - these are the same), followed by the variable name, then the value to set it to.

For example:
SET dbdir /data/sequencedbs

This variable may now be used in the rest of the file 'emboss.default' by preceding it with a '$'

For example:
file: $dbdir

The name of the variable is case-insensitive when used within the file 'emboss.default'.

Global Variables

Because the file 'emboss.default' is the first thing that an EMBOSS program reads in when it starts, it can contain not only the database definitions, but also useful global settings that influence the behaviour of all EMBOSS programs.

When variables are set as these Global variables, they must be given UPPERCASE names.

The Global variables can also be set in the UNIX session as well as in the file 'emboss.default' by defining an 'environment variable' with the command 'setenv NAME value', where 'NAME' is the name of the variable and 'value' is the value you wish to set it to.

Some of the EMBOSS Global Variables are Boolean - they can only be turned on by setting them to '1', or "Y". (They are off by default.) Others set the location of various files or directories or specify the default value of things.

There should be no need to set any of these to change the default behaviour of EMBOSS, but you may wish to set some to customise your copy of EMBOSS.

WARNING Some of these will make EMBOSS unusable! For example:
SET EMBOSS_HELP 1
will make all EMBOSS programs only display their help when they are run.
We don't know what use people will make of features like this, but we are sure that if we didn't allow it, someone would request it :-)

Global VariableDescription
EMBOSS_ACDROOT The root ACD directory. EMBOSS should find this automatically.
EMBOSS_ACDPROMPTS The number of prompts for a value before failure. The default is 2.
EMBOSS_DATA The data directory. EMBOSS should find this automatically.
EMBOSS_PROXY Sets the default proxy server. SET EMBOSS_PROXY "proxy.mydomain.com:8888" applies to all HTTP access. If a database uses an internal server, you can turn off the proxy routing with the database attribute 'proxy: ":"' (see below) to allow it to go directly to the internal server.
EMBOSS_LOGFILE Specify the log-file path and name. If this is not specified use of the programs will not be logged.
EMBOSS_FORMAT Specify the expected input sequence format. The default is to test all formats in turn (except for 'plain') until one is read successfully. If this variable is set to a value then ONLY the specified sequence format will be expected and no tests for any other formats will be done, although you can always still specify the format in the USA as: 'fasta::filename'
EMBOSS_OUTFORMAT Change the default output sequence format to be other than 'fasta'
EMBOSS_OUTFEATFORMAT Change the default output feature format to be other than 'GFF'
EMBOSS_OUTDIRECTORY All EMBOSS output files now have a default output directory (required by some webservices implementations that run in the 'wrong'default directory). If this variable is set to the name of a directory then it becomes the default output directory for outfile, align, report, graph, sequence and feature output. (The output directory can also be set from the command line (or as an ACD attribute) using the associated qualifier -odirectory (outfile), -rdirectory (report) -adirectory (align) -gdirectory (Graph and graphxy) -osdirectory (sequence) or -ofdirectory (featout).)
EMBOSS_GRAPHICS Set name of the default output graphics device.
EMBOSS_AUTO If this is set TRUE, all programs act as if they have '-auto' set on the command-line. They will not display their one-line description, they use default qualifier values, if required prompts are missing, they fail.
EMBOSS_DEBUG If this is set TRUE, all programs act as if they have '-debug' set on the command-line. They create a 'programname.dbg' file of debugging information.
EMBOSS_STDOUT If this is set TRUE, all programs act as if they have '-stdout' set on the command-line. They write all output to 'stdout' (the screen) instead of prompting for output file names.
EMBOSS_FILTER If this is set TRUE, all programs act as if they have '-filter' set on the command-line. They act as if '-stdout' and '-auto' are set and they read input files from 'stdin' (the keyboard).
EMBOSS_WARNING If this is set TRUE, all programs act as if they have '-warning' set on the command-line. They report warnings.
EMBOSS_ERROR If this is set TRUE, all programs act as if they have '-error' set on the command-line. They report errors.
EMBOSS_FATAL If this is set TRUE, all programs act as if they have '-fatal' set on the command-line. They report fatal errors.
EMBOSS_DIE If this is set TRUE, all programs act as if they have '-die' set on the command-line. They report deaths.
EMBOSS_HELP If this is set TRUE, all programs act as if they have '-help' set on the command-line. They will only display some help on the program.
EMBOSS_ACDPRETTY If this is set TRUE, all programs act as if they have '-acdpretty' set on the command-line. They will only print out a file 'programname.acdpretty with the ACD information for the program.
EMBOSS_ACDLOG If this is set TRUE, all programs act as if they have '-acdlog' set on the command-line. They will only print out a file 'programname.acdlog with the ACD processing log for the program.
EMBOSS_ACDTABLE If this is set TRUE, all programs act as if they have '-acdtable' set on the command-line. They will only print out a file 'programname.acdtable with the HTML table of options for the program.
EMBOSS_NAMDEBUG (for progtrammers only)
If this is set TRUE, it turns on logging of processing of emboss.default and .embossrc files in ajnam.c (this happens before the -debug command line switch is processed).
EMBOSS_NAMVALID (for progtrammers only)
If this is set TRUE, it turns on additional validation of DBNAME definitions - but as these are done by the application 'showdb' there is normally no need to set this variable. The validation adds an overhead to every database definition.
EMBOSS_DOCROOT If this is set to the name of a directory, then tfm will look in that directory for the applications' documentation.

INCLUDE

This command allows you to include a subsidiary file as part of the text of the main 'emboss.default' file at the position of the 'INCLUDE' command.

For example, to include the contents of the file "project_databases.def":
INCLUDE "project_databases.def"

Database

An EMBOSS sequence database is a collection of sequence entries available in a known sequence format. An EMBOSS sequence database is not a restriction enzyme database or a set of protein 3D structure files.

EMBOSS should be able to extract entries from the database using one of a number of ways. Depending on whether a single entry, a wild-card specified set of entries or all of the database entries are required, there may be different methods used to extract the entries.

The database may be in a remote server on the Internet, or it may be in a set of files on the local machine. If it is held locally, it may be a simple multiple sequence file, or it may be an indexed file.

It may be an SRS server, either local or on the Internet. This server may hold sequence databases whose format is unknown to EMBOSS, in which case you can specify that the format must first be converted to 'fasta' format when serving the files to EMBOSS.

If the database is held as indexed, local files, it may be a native database-format (EMBL, Swissprot, etc.) file as distributed by the EBI, or NCBI. It may be a file formatted for GCG. It may be a set of Blast database indexes.

If the database is on an SRS server or uses local indexed files, then queries may be made not only by ID name and Access number but also (depending on the way it has been indexed) as Description line words, Sequence Versions (or GI numbers), Keywords or Organism names.

It may be something like a Sybase or Oracle relational database, accessed by using a locally written program.

Database Specification Syntax

A database is specified by 'DBNAME' (usually shortened to 'DB'), then the database name, followed by the 'key: value' attributes that specify that database inside a pair of square brackets.

For example the database 'genome' (without the attributes) is:

DB genome [
	key: value	
	key: value
	key: value
	key: value
	key: value
]

Attributes

The 'key: value' pairs in a 'DB' structure can be specified either on separate lines or separated by spaces on the same line.

If the 'value' part of the attribute contains spaces then it should be quoted, to prevent it being prematurely terminated at the first space.

e.g. key: "value with many words in"

Each database must have attributes that specify what is is and how to access it.

This information is given as a set of pairs of 'key:' and 'value' attributes. These attributes are held in the 'DB' definition structure (see above).

The minimum set of attribute keys are 'method:' and 'format:' - these two are mandatory. It is also normal (but not mandatory) to specify the 'type:' attribute.

Some forms of 'method:' require subsidiary attributes giving further information on how to access the data. The following table is the set of available attributes.

KeyValueDescription
format
formatentry
formatquery
formatall
a valid sequence format name The 'format:' attribute tells EMBOSS what sequence format to expect when reading entries from the database.
This attribute is mandatory.
If you need to specify different formats for any of the different access methods (see below), then you may use the variants of 'method:' with the suffix 'entry', 'query' or 'all'

e.g. format: ncbi

type 'N' or 'P' This specifies whether the database is nucleic or protein.
Although it is not strictly required, it is normal to specify the type of the database as this will normally be known. If it is necessary to not specify the type then this will be determined by the EMBOSS applications when they read sequences in. (You will get error messages when you run 'showdb' as this doesn't read in sequences.)
The value 'N' specifies a nucleic database, 'P' specifies a protein database.

e.g. type: N

fields One or more of: sv, des, org, key This specifies which search fields have been indexed and are available for searching with.
It is assumed that Accession number and ID name are always available when a database is set up. The way you have set up the database may also allow access by one or more of these values. sv - Sequence Version or GI Number, des - Description line word, org - Organism's taxonomic classification. key - Keywords. The access methods 'srs', 'srsfasta' and 'srswww' allow access to these search fields, the methods 'emblcd' and 'gcg' may or may not have some or all of these fields indexed, depending on the parameters given to the programs 'dbiflat' and 'dbigcg'. The programs 'dbiblast' and 'dbifasta' only allows you to select any of 'sv', 'des' and 'acc' (the default).
See the USA documentation for details of using these.

e.g. fields: "sv des org key"

directory any valid directory path This specifies the directory to look in to find files that have been specified with the 'file:' attribute.
It also specifies which directory to look in to find indexes and files produced by the dbi* programs.
It is only required with the access methods 'direct', 'gcg', emblcd' and 'blast' (see 'Access methods' below).
It is common to use variables (defined using 'SET') to specify part or all of the path.
The attribute key 'directory:' is commonly abbreviated to 'dir:'.

e.g. directory: $dbdir/genomes

filename A file name (may be wildcarded) This specifies the sequence file(s) to read in when accessing the database.
It is only required with the access method 'direct' (see 'Access methods' below).
It may also be used with the access methods 'gcg', emblcd' and 'blast' to indicate which files should be included back in after using the 'exclude:' attribute to specify which indexed files should be ignored. (See 'exclude:' below)
The files may be wild-carded using '*'.
The attribute key 'filename:' is commonly abbreviated to 'file:'.

e.g. file: pir*.seq

exclude A file name (may be wildcarded) This is used to exclude a subset of files from consideration.
To exclude certain files, specify "exclude: *file*".
This is used in conjunction with 'file:' to specify a subset of files in a directory.
'Exclude:' is checked first, then the rest of the files are included with 'file:'.

The files searched are therefore: - the files in the directory specified by 'dir:' - but not the 'exclude:' files (if any) - but include back the 'file:' files (if any)

e.g. exclude: mouse.*

If you have indexed all of the file in the EMBL database, then you can specify subsets using the same set of files and indices as:

DB embl [
  type: N
  format: embl
  method: emblcd
  dir: /data/embl
  comment: "All of EMBL"
]

DB emblminus [
  type: N
  format: embl
  method: emblcd
  dir: /data/embl
  exclude est*.dat
  comment: "EMBL without the ESTs"
]

DB emblhumest [
  type: N
  format: embl
  method: emblcd
  dir: /data/embl
  exclude *.dat
  file: est_hum*.dat
  comment: "EMBL human ESTs"
]

DB human [
  type: N
  format: embl
  method: emblcd
  dir: /data/embl
  exclude *.dat
  file: hum*.dat
  comment: "EMBL human"
]
indexdirectory any valid directory path This specifies the directory to look in to find the index files (produced by the dbi* programs) if this is different to the directory specified by 'directory:'.
It is sensible to hold the indices in a different directory to the one holding the sequence database files when you have many sequence databases in the same directory, because the indices for every database all have the same names (acnum.hit, acnum.trg, division.lkp, etc.) and these would be over-written if you have indexed several databases in the same directory. In this case, you should create the indices in a different directory (a subdirectory?) for each database. That way the index files will not become confused. These index directories can now be specified using the attribute 'indexdirectory:', while the directory containing the sequence data files can stiil be specified using 'dir:'.
It is only used with the access methods 'gcg', emblcd', 'blast' (see 'Access methods' below).
It is common to use variables (defined using 'SET') to specify part or all of the path.
The attribute key 'indexdirectory:' is commonly abbreviated to 'indexdir:'.

e.g. indexdir: $dbdir/genomes/embl

url Any valid URL This specifies the URL (WWW address) to use when getting sequences from remote Web sites.
It is only required with the access methods 'srswww' and 'url' (see 'Access methods and formats' below).
In method 'srswww' the SRS commands '-e+-ascii' are appended to the given URL (these extract the complete entry from SRS with no HTML formatting). The database (or the name specified in a 'dbalias' attribute) and entry name Accession number (or Sequence version, GI number, Description, Organism, or Key-word) are then appended to create a functional SRS query line.

e.g. url: "http://www.cbr.nrc.ca/srs6bin/cgi-bin/wgetz"

In method 'url' the URL is expected to contain one or more instances of the character pair '%s' - each of these pairs are replaced by the value of the ID name when this database is accessed. Any HTML formatting will be stripped from the resulting web page.

e.g. url: "http://www.ebi.ac.uk/htbin/emblfetch?%s"

e.g. url: "http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=s&form=6&dopt=g&html=no&uid=%s"

The URL must begin with "http://" and have a lower case host address.

proxy host:port In the access methods 'srswww' and 'url', you can specify a proxy host and port to use when accessing the URL.

e.g. proxy: "proxy.mydomain.com:8888"

If there is a global variable EMBOSS_PROXY defined in the 'emboss.default' file (See Variable Definitions, above), then the attribute

proxy: ":"

will turn off proxy access for this database. This is useful if the database is on an internal server.

app
appentry
appquery
appall
Any script or program name This specifies the name of an external (i.e. non-EMBOSS) program or script (applicaation) that should be run to extract the sequence from the database.

This application can be in the user's path or have an explicit path provided.

The database and entry name will be appended to the application command as 'application dbname:entry'. Both ID and Accession number can be used to specify the entry.

Alternatively, if the app: attribute value contains the character pair '%s', it is replaced by the value of the ID name or Accession number when this database is accessed.

This attribute is only required with the access method 'app' (see 'Access methods' below).
If you need to specify different applications for any of the different access methods (see below), then you may use the variants of 'app:' with the suffix 'entry', 'query' or 'all'.

e.g. app: efetch

e,g, app: "getz [embl:%s]"

dbalias The name of a database in SRS This is used to specify the name of a database at a SRS site where the name differs from the name that EMBOSS is using.
It is only required with the access methods 'srswww', srsfasta' and 'srs' (see 'Access methods' below).

e.g. dbalias: emblnew

comment Any text This is a simple comment, to describe the database. It is displayed in 'showdb'.

e.g. comment: "This is my subset of refseq"

release Any text This is the release number or date. It is displayed in 'showdb'.
Note that unless you are zealous in updating 'release:' values, this will rapidly become out of synch with the actual data.
(I wouldn't use this attribute. - GWW)

The dbi* indexing programs ask for the 'database name', 'release number' and 'index date'. These are stored in the division.lkp file (one of the index files). This information is NOT available to EMBOSS programs. This information is not reported by showdb. They are part of the 'emblcd/staden' index file format, but EMBOSS does not use them. If other software uses the index files (the ACEDB efetch program, or maybe the Staden package) they may be used there.

e.g. release: "89.0 (Oct 2001)"

method
methodall
methodentry
methodquery
One of: srs, srsfasta, srswww, url, app, external, direct, emblcd, gcg, blast This specifies the method used to access the database. (See next section).
This field is mandatory - there must be at least one form of the 'method' key specified. More than one different type of method key can be specified.

If 'method:' is specified, then this is the default method covering all forms of access ('query', 'entry' or 'all'). Specific methods for the 'query', 'entry' or 'all' forms of access (i.e. 'methodquery:', 'methodentry:' or 'methodall:') should be specified explicitly if you wish to have several ways of accessing the data.

e.g. method: emblcd

Access methods

There are 3 types of database access:

There are many available methods for accessing databases. Some of these only allow you access by a subset of these methods. For example, if you use a web server to get databases entries, this is suitable for getting single entries. It may allow you to do queries returning more than one entry. It will probably not allow you to pull across a complete database and you wouldn't want to anyhow as this would take a long time over the networks.

Some access methods may be unavailable. For example, a flat file database with no index is only useful for reading all entries ('all'), while a remote database may only provide single entries ('entry'). In this case you would wish to access the remote database for single entries and the local one for reading sequentiall through all of the data.

You can specify an EMBOSS database that accesses many different data sources, depending on which type (entry, query, all) of access is required.

EMBOSS databases can be defined as all using the same access method, with the attribute method, or using up to 3 different methods with special suffixes methodentry, methodquery or methodall.

For example:


#EMBL with SRS index files and directly reading for all entries
#methodquery defines the method for both query and entry access
#methodall provides the method for reading all entries.

DB srsembl [ 
# the sequences are nucleic
	type: N 
# the sequence entries are in 'embl' format
	format: embl
# you can specify any description comment
	comment: 'EMBL using getz'
# you can specify which release this is
	release: "61"

# use the 'srs' method for both query and entry access
	methodquery: srs
# the database is called 'embl' in the local SRS server
	dbalias: embl 

# to sequentially read the whole database, use the 'direct' method
	methodall: direct 
# the database files are in this directory
	dir: /nfs/data/embl/  
# read these files
	file: *.dat
]

In addition, each access method needs to know something about the database. What is needed will be different for each method, although there is, of course, much overlap between them. This information is specified by using the 'key: value' attributes (see above).

The database definition attributes used depend on the access method and also on the query level.

For example, EMBL entries could be read by:

EMBOSS Database Access Methods
MethodScopeComments
EMBLCD * Uses an EMBLCD index from the programs 'dbiflat' (flatfiles - database native format files) or 'dbifasta' (fasta format files). This can cope with all levels of access. Queries use the index files, reading all entries uses the list of files in the division.lkp file and opens each in turn.

Supports queries by id, acc, sv, key, org and des. (NB not by 'key' and 'org' if the database was indexed by 'dbifasta' because there is no way to find these in the Fasta format description line.)

The directory containing the sequence files and indices to be read must be specified using the 'directory:' attribute.

If the indices are in a directory other than the one containing the sequence files, then the index directory can be explicitly set using the 'indexdirectory:' attribute.

The available fields should be specified using the 'fields:' attribute if more than just the default ID name and Accession number fields have been indexed.

A wildcard search for unique fields (id or sv), or any search for acc, des, org or key is type 'query' and returns a list of entries. A search for a single id or sv is of type 'entry' and will find the first match in the index and assume no other matches. The ID has to be unique in an EMBLCD database.

For example:

DB mydb [
  type: N
  method: emblcd
  format: embl
  fields: "sv des org key"
  directory: /data/embl
]   
SRS * This calls 'getz' locally, using the "-e" switch to return whole entries in original format. It is expected that 'getz' is on the path.

Supports queries by id, acc, sv, key, org and des.

If the SRS server has a different name for this database than the one that EMBOSS will use, then you must specify it using the 'dbalias:' attribute.

EMBOSS expects the SRS local access program to be called 'getz', but you can explicitly override this using the 'app:' attribute. This can be used to call 'getz' using its explicit path, rather than relying on 'getz' being on the path.

Database definitions using "method: srs" should also specify "methodall: direct" plus "directory:" and "file:" for reading all entries directly. This is much faster than using getz to read and format all entries (unless the database is very small).

For example:

DB mydb [
  type: N
  format: embl
  method: srs
  dbalias: embl
  fields: "sv des org key"

# define 'all' access method
  methodall: direct
  directory: /data/embl
  file: *.seq   
]   
SRSFASTA * As for SRS, but uses "getz -d -sf fasta" to read the sequence in fasta format. For databases like dbEST.reports where EMBOSS does not understand the entry format but SRS can convert it to FASTA. As the database format is not understood by EMBOSS, a search of the entire database would be forced to use getz to convert each entry, which will be rather slow.

Supports queries by id, acc, sv, key, org and des.

If the SRS server has a different name for this database than the one that EMBOSS will use, then you must specify it using the 'dbalias:' attribute.

EMBOSS expects the SRS local access program to be called 'getz', but you can explicitly override this using the 'app:' attribute. This can be used to call 'getz' using its explicit path, rather than relying on 'getz' being on the path.

Database definitions need to specify "methodall: direct" plus "directory:" and "file:" to read all entries directly. This is much faster than using getz to read and format all entries.

For example:

DB mydb [
  type: N
  format: fasta
  method: srsfasta
  dbalias: embl
  fields: "sv des org key"

# define 'all' access method
  methodall: direct
  directory: /data/embl
  file: *.seq 
]   

SRSWWW single entry Uses a defined SRS WWW server to read a single entry. Could be useful, for example, to get the GenBank version of an EMBL entry. Wildcard entry names are not allowed because of the way SRSWWW splits the output into blocks.

Supports queries by id, acc, sv, key, org and des.

If the SRS server has a different name for this database than the one that EMBOSS will use, then you must specify it using the 'dbalias:' attribute.

The remote SRS web server must be specified using the 'url:' attribute.

Database definitions should define this as "methodentry" or "methodquery" to avoid returning the entire database. Failure to do so could lead to a request to return the entire database. Although an SRS web server can cope with this, EMBOSS will then have the entire web page in memory and will strip out HTML tags before trying to read the first entry.

For example:

DB mydb [
  type: N
  format: embl
  methodquery: srswww
  dbalias: embl
  fields: "sv des org key"
  url: http://srs.redbrick.ac.uk/srs6bin/cgi-bin/wgetz"

# define 'all' access method
  methodall: direct
  directory: /data/embl
  file: *.seq 
]   
BLAST * Uses an EMBLCD index from the program 'dbiblast'. This can cope with all levels of access. Queries use the index files, reading all entries uses the list of files in the division.lkp file and opens each in turn. The blast database can be DNA or protein, produced by formatdb, pressdb or setdb, with or without the original FASTA format file.

N.B. dbiblast can't use the new style of Blast indices. You must create the old style of Blast indices by adding -A F to the formatdb command line.

Supports queries by id, acc, sv, and des. (Not by 'key' and 'org' because there is no way to find these in the BLAST database description line).

The directory containing the BLAST index files (*.nin, *,pin, *.nhr, *.nsq, *.phr, pin, psq, etc,) and the index files produced by dbiblast must be specified using the 'directory:' attribute.

If the dbiblast indices are in a directory other than the one containing the BLAST index files, then the dbiblast index directory can be explicitly set using the 'indexdirectory:' attribute.

The available fields should be specified using the 'fields:' attribute if more than just the default ID name and Accession number fields have been indexed.

A wildcard search for unique fields (id or sv), or any search for acc, des, org or key is type 'query' and returns a list of entries. A search for a single id or sv is of type 'entry' and will find the first match in the index and assume no other matches. The ID has to be unique in an EMBLCD database.

For example:

DB mydb [
  type: N
  format: embl
  method: blast
  fields: "sv des"
  directory: /data/embl
]   
GCG * Uses an EMBLCD index from the program 'dbigcg' to access a database reformatted for GCG 8, 9 or 10 by GCG programs such as embltogcg. As only the .ref and .seq files are used, the EBI's "GCG" distribution of the database can be used with 'dbigcg' without the need to run "embltogcg". This can cope with all levels of access. Queries use the index files, reading all entries uses the list of files in the division.lkp file and opens each in turn.

Supports queries by id, acc, sv, key, org and des.

The directory containing the sequence files and indices to be read must be specified using the 'directory:' attribute.

If the indices are in a directory other than the one containing the sequence files, then the index directory can be explicitly set using the 'indexdirectory:' attribute.

The available fields should be specified using the 'fields:' attribute if more than just the default ID name and Accession number fields have been indexed.

A wildcard search for unique fields (id or sv), or any search for acc, des, org or key is type 'query' and returns a list of entries. A search for a single id or sv is of type 'entry' and will find the first match in the index and assume no other matches. The ID has to be unique in an EMBLCD database.

For example:

DB mydb [
  type: N
  format: embl
  method: gcg
  fields: "sv des org key"
  directory: /data/gcg/gcgembl
]   
DIRECT all Opens the database file(s) and returns each entry sequentially.

This method assumes there is no indexing done on the data, so it can only process 'all' entries - you should explicitly set up other methods for "entry" and "query" access to the same database if these are required.

The directory containing the sequence files to be read must be specified using the 'directory:' attribute.

The files to be read must be specified using the 'file:' attribute.

You may use the 'exclude:' attribute to exclude some selected files from consideration.

EMBL can be defined as "*.dat" to avoid adding the explicit filenames: est18, hum3, htg2, and so on for each new release.

For example:

DB mydb [
  type: N
  format: embl
  methodall: direct
  directory: /data/embl
  file: *.seq
]   
URL single entry Uses any other Web server (for example the EBI's emblfetch or swissfetch queries) to return an entry. I expect problems with the HTML produced by these servers, but I hope EMBOSS sequence reading routines can cope with most results.

The remote web server's URL must be specified using the 'url:' attribute.

This URL is expected to contain one or more instances of the character pair '%s' - each of these pairs are replaced by the value of the ID name when this database is accessed. Any HTML formatting will be stripped from the resulting web page.

For example:

DB mydb [
  type: N
  format: embl
  methodentry: url
  url: "http://server.commercial.com/cgi-bin/getseq?%s&format=embl"
]   
APP
EXTERNAL
* Run an external application or a simple script which returns one/more/all entries. The application can be in the user's path or have an explicit path provided.

The database definition must have "app:" defined to specify the application command. This can of course be a site-written script.

The database and entry name will be appended to the application command as 'application dbname:entry'. Both ID and Accession number can be used to specify the entry.

Alternatively, if the app: attribute value contains the character pair '%s', it is replaced by the value of the ID name or Accession number when this database is accessed.

You can also use GCG's typedata as an external application, to save reindexing a GCG database.

This could be a good way to search a set of databases, for example to get the first entry from SwissNew...SwissProt...TrEmbl...TrEmblNew with the ID or accnumber or PID as the "entryname". (See 'EMBOSS database farms', below)

'EXTERNAL' is the same thing as 'APP', but it is obsolete and its use is discouraged.

For example:

DB mydb [
  type: N
  format: embl
  method: app
  app: "/usr/local/bin/accessdb -db embl -query %s"
]   

CORBA may possibly be implemented as an access method in the future. It will talk to an external CORBA client, probably in Java, which will talk to a CORBA server somewhere. I see no way to do this directly in GNU library licensed, free, ANSI C code, but an external client will be OK. The database definition will include the IOR information and anything else the CORBA client needs to know. We plan to use one client for each IDL if no single standard appears.

There are many commented-out examples of database specifications in the 'emboss.default' file. If in doubt, contact the EMBOSS mailing lists.

Testing your database definitions

When you have finished defining your database, run showdb and you should see it appearing in the listing of databases. If it shows the line "Warning: Bad database definition" or if it doesn't show the database then something is seriously wrong with your definition. Go back to it and check things.

If showdb displays your database, check that all of your required access methods have 'OK' in them. If something is not 'OK' then maybe you need to add another access method.

Just because showdb says that it can find your database definition does NOT mean that the database is working correctly. showdb does not attempt to extract any entries from your database.

You must now try to extract one or more known entries from the database using seqret. If you get errors, you should check that the database is set up correctly and defined correctly.

Things to check:

EMBOSS database farms

Currently there is no simple way of defining several data sources that could be defined as a single, composite database.

The closest you can come is to define a database that calls an application that can return sequences from any one of a set of previously-defined EMBOSS databases.

The following script was written by Simon Andrews. You may prefer to write your own solutions.

It is here for you to save.

Simon says:

"To use this simply copy and paste the text of the script to a file on your system, then make sure that this file is readable and executable by everyone (chmod 755 filename). The comments in the script tell you what changes you need to make to the script itself, and the format of the entry you need to create in emboss.default.

It will work with seqret (and will output any format you like), and can also be used as part of a USA for any of the standard EMBOSS programs.

The script requires a unix-like OS, but could trivially be adapted to run under Win32 if anyone is running EMBOSS under windows."


#!/usr/bin/perl -w
#
# change the above line to match the location of perl on your system
#


use strict;

# EMBOSS farm file script
#
# Written by Simon Andrews
# simon.andrews@bbsrc.ac.uk
# Dec 2001
#
# This script allows you to set up a farm
# of EMBOSS databases which can be queried
# by a single instance of seqret.  The
# program must be accompanied by an entry
# in emboss.default which looks like this:
#
# DB name_of_database [
#       type: N (or P if we're dealing with proteins)
#       method: app
#       format: fasta
#       app: "/path/to/this/emboss_farm.script"
#       comment: "Whatever text you'd like to see in showdb" 
# ]
#

# First we need to set a few preferences
#
# What is the full path to seqret?
# If you are sure that seqret will always
# be somewhere in your path, then you can
# just leave this as 'seqret'.

my $seqret_path = 'seqret';


# Now we need to know the names of the
# databases you'd like included in the
# search.  These must be dabases which
# have already been indexed, and installed
# correctly into emboss.default.  Simply
# enter the database names between the
# brackets, separated by spaces.

my @databases = qw(dbase1 dbase2 dbase3);


##### End of bits which need to be edited #########

my ($reference) = @ARGV;

if ($reference =~ /:(.+)$/){
  $reference = $1;            
}

else {
  die "\n*** FARM ERROR *** Couldn't get accession after : from
$reference\n\n";
}


foreach my $database (@databases){

  my $sequence = `$seqret_path $database:$reference fasta::stdout 2>/dev/null`;

  if ($sequence){
        print $sequence;
        exit;
  }

}

warn "\n*** FARM ERROR *** Couldn't find $reference in any of '@databases'\n\n";


Last edited: 26 February 2003 - Gary Williams