Sequence Databases |
Sequence databases can be in a variety of formats, and accessed by a variety of methods. These are defined through a set of control files, the site wide file 'emboss.default' which must be saved in the emboss/ directory under the main distribution.
In addition to reading database definitions from the 'emboss.default' file, EMBOSS also reads definition from the file '.embossrc' in your personal home directory. You can test database definitions in your own '~/.embossrc' file before adding them to the site-wide 'emboss.default' file.
The first thing each EMBOSS program does when it starts running is to read in the 'emboss.default' (and then the '~/.embossrc' file, if it exists). This means that any changes to these definition files take effect as soon as they are made.
The EMBOSS distribution includes a sample set of small databases called tsw, tembl, and so on defined in the 'emboss.default' file (This is distributed as the file 'emboss.default.template' which then has to be renamed 'emboss.default' before EMBOSS can use it).
#SET emboss_tempdata path_to_directory_$EMBOSS/test # Logfile - set this to a file that any user can append to # and EMBOSS applications will automatically write log information #SET emboss_logfile /packages/emboss/emboss/log # swissprot (Puffer fish entries) # ========= DB tsw [ type: P dir: $emboss_tempdata/swiss method: emblcd format: swiss release: 36 fields: "sv des org key" comment: "Swissprot native format indexed by dbiflat" ] # swnew (Puffer fish entries) # ===== DB tswnew [ type: P dir: $emboss_tempdata/swnew method: emblcd format: swiss release: 37 fields: "sv des org key" comment: "Swissnew native format indexed by dbiflat" ] # wormpep (cosmid ZK637) # ======= DB twp [ type: P dir: $emboss_tempdata/wormpep method: emblcd format: fasta release: 16 fields: "des" comment: "Wormpep Fasta format file indexed by dbifasta" ] # embl (worm cosmid ZK637 and a few other entries) # ==== DB tembl [ type: N dir: $emboss_tempdata/embl method: emblcd format: embl release: 57 fields: "sv des org key" comment: "EMBL native format indexed by dbiflat" ] # pir (cytochrome C plus first entries in other divisions) # === DB tpir [ type: P dir: $emboss_tempdata/pir method: gcg file: pir*.seq format: nbrf fields: "des org key" comment: "PIR in 4 files in GCG format indexed by dbigcg" ] # genbank (Remote access to a SRS server) # ======= DB tgb [ type: N method: srswww format: genbank url: "http://www.cbr.nrc.ca/srs6bin/cgi-bin/wgetz" dbalias: genbank fields: "sv des org key" comment: "Genbank from a remote SRS server" ] # genbank (the first few entries from several sub-section files) # ======= DB tgenbank [ type: N dir: $emboss_tempdata/genbank method: emblcd format: genbank release: 01 fields: "sv des org key" comment: "GenBank native format indexed by dbiflat" ] |
Sites should leave these in the file for testing purposes, and add as many of their own databases as they need.
Databases are usually local, and need to have the full database files plus some indexing or query method to extract single entries or to query by ID or (in most cases) accession number.
Comments start with a '#' character in the first position of a line.
For example:
# this is a comment
For example:
SET dbdir /data/sequencedbs
This variable may now be used in the rest of the file 'emboss.default' by preceding it with a '$'
For example:
file: $dbdir
The name of the variable is case-insensitive when used within the file 'emboss.default'.
When variables are set as these Global variables, they must be given UPPERCASE names.
The Global variables can also be set in the UNIX session as well as in the file 'emboss.default' by defining an 'environment variable' with the command 'setenv NAME value', where 'NAME' is the name of the variable and 'value' is the value you wish to set it to.
Some of the EMBOSS Global Variables are Boolean - they can only be turned on by setting them to '1', or "Y". (They are off by default.) Others set the location of various files or directories or specify the default value of things.
There should be no need to set any of these to change the default behaviour of EMBOSS, but you may wish to set some to customise your copy of EMBOSS.
WARNING Some of these will make EMBOSS unusable! For example:
SET EMBOSS_HELP 1
will make all EMBOSS programs only display their help when they are run.
We don't know what use people will make of features like this, but we
are sure that if we didn't allow it, someone would request it :-)
Global Variable | Description |
---|---|
EMBOSS_ACDROOT | The root ACD directory. EMBOSS should find this automatically. |
EMBOSS_ACDPROMPTS | The number of prompts for a value before failure. The default is 2. |
EMBOSS_DATA | The data directory. EMBOSS should find this automatically. |
EMBOSS_PROXY | Sets the default proxy server. SET EMBOSS_PROXY "proxy.mydomain.com:8888" applies to all HTTP access. If a database uses an internal server, you can turn off the proxy routing with the database attribute 'proxy: ":"' (see below) to allow it to go directly to the internal server. |
EMBOSS_LOGFILE | Specify the log-file path and name. If this is not specified use of the programs will not be logged. |
EMBOSS_FORMAT | Specify the expected input sequence format. The default is to test all formats in turn (except for 'plain') until one is read successfully. If this variable is set to a value then ONLY the specified sequence format will be expected and no tests for any other formats will be done, although you can always still specify the format in the USA as: 'fasta::filename' |
EMBOSS_OUTFORMAT | Change the default output sequence format to be other than 'fasta' |
EMBOSS_OUTFEATFORMAT | Change the default output feature format to be other than 'GFF' |
EMBOSS_OUTDIRECTORY | All EMBOSS output files now have a default output directory (required by some webservices implementations that run in the 'wrong'default directory). If this variable is set to the name of a directory then it becomes the default output directory for outfile, align, report, graph, sequence and feature output. (The output directory can also be set from the command line (or as an ACD attribute) using the associated qualifier -odirectory (outfile), -rdirectory (report) -adirectory (align) -gdirectory (Graph and graphxy) -osdirectory (sequence) or -ofdirectory (featout).) |
EMBOSS_GRAPHICS | Set name of the default output graphics device. |
EMBOSS_AUTO | If this is set TRUE, all programs act as if they have '-auto' set on the command-line. They will not display their one-line description, they use default qualifier values, if required prompts are missing, they fail. |
EMBOSS_DEBUG | If this is set TRUE, all programs act as if they have '-debug' set on the command-line. They create a 'programname.dbg' file of debugging information. |
EMBOSS_STDOUT | If this is set TRUE, all programs act as if they have '-stdout' set on the command-line. They write all output to 'stdout' (the screen) instead of prompting for output file names. |
EMBOSS_FILTER | If this is set TRUE, all programs act as if they have '-filter' set on the command-line. They act as if '-stdout' and '-auto' are set and they read input files from 'stdin' (the keyboard). |
EMBOSS_WARNING | If this is set TRUE, all programs act as if they have '-warning' set on the command-line. They report warnings. |
EMBOSS_ERROR | If this is set TRUE, all programs act as if they have '-error' set on the command-line. They report errors. |
EMBOSS_FATAL | If this is set TRUE, all programs act as if they have '-fatal' set on the command-line. They report fatal errors. |
EMBOSS_DIE | If this is set TRUE, all programs act as if they have '-die' set on the command-line. They report deaths. |
EMBOSS_HELP | If this is set TRUE, all programs act as if they have '-help' set on the command-line. They will only display some help on the program. |
EMBOSS_ACDPRETTY | If this is set TRUE, all programs act as if they have '-acdpretty' set on the command-line. They will only print out a file 'programname.acdpretty with the ACD information for the program. |
EMBOSS_ACDLOG | If this is set TRUE, all programs act as if they have '-acdlog' set on the command-line. They will only print out a file 'programname.acdlog with the ACD processing log for the program. |
EMBOSS_ACDTABLE | If this is set TRUE, all programs act as if they have '-acdtable' set on the command-line. They will only print out a file 'programname.acdtable with the HTML table of options for the program. |
EMBOSS_NAMDEBUG | (for progtrammers only) If this is set TRUE, it turns on logging of processing of emboss.default and .embossrc files in ajnam.c (this happens before the -debug command line switch is processed). |
EMBOSS_NAMVALID | (for progtrammers only) If this is set TRUE, it turns on additional validation of DBNAME definitions - but as these are done by the application 'showdb' there is normally no need to set this variable. The validation adds an overhead to every database definition. |
EMBOSS_DOCROOT | If this is set to the name of a directory, then tfm will look in that directory for the applications' documentation. |
For example, to include the contents of the file "project_databases.def":
INCLUDE "project_databases.def"
EMBOSS should be able to extract entries from the database using one of a number of ways. Depending on whether a single entry, a wild-card specified set of entries or all of the database entries are required, there may be different methods used to extract the entries.
The database may be in a remote server on the Internet, or it may be in a set of files on the local machine. If it is held locally, it may be a simple multiple sequence file, or it may be an indexed file.
It may be an SRS server, either local or on the Internet. This server may hold sequence databases whose format is unknown to EMBOSS, in which case you can specify that the format must first be converted to 'fasta' format when serving the files to EMBOSS.
If the database is held as indexed, local files, it may be a native database-format (EMBL, Swissprot, etc.) file as distributed by the EBI, or NCBI. It may be a file formatted for GCG. It may be a set of Blast database indexes.
If the database is on an SRS server or uses local indexed files, then queries may be made not only by ID name and Access number but also (depending on the way it has been indexed) as Description line words, Sequence Versions (or GI numbers), Keywords or Organism names.
It may be something like a Sybase or Oracle relational database, accessed by using a locally written program.
For example the database 'genome' (without the attributes) is:
DB genome [ key: value key: value key: value key: value key: value ]
If the 'value' part of the attribute contains spaces then it should be quoted, to prevent it being prematurely terminated at the first space.
e.g. key: "value with many words in"
Each database must have attributes that specify what is is and how to access it.
This information is given as a set of pairs of 'key:' and 'value' attributes. These attributes are held in the 'DB' definition structure (see above).
The minimum set of attribute keys are 'method:' and 'format:' - these two are mandatory. It is also normal (but not mandatory) to specify the 'type:' attribute.
Some forms of 'method:' require subsidiary attributes giving further information on how to access the data. The following table is the set of available attributes.
Key | Value | Description |
---|---|---|
format
formatentry formatquery formatall |
a valid sequence format name | The 'format:' attribute tells EMBOSS what sequence format to expect when
reading entries from the database.
This attribute is mandatory. If you need to specify different formats for any of the different access methods (see below), then you may use the variants of 'method:' with the suffix 'entry', 'query' or 'all' e.g. format: ncbi |
type | 'N' or 'P' | This specifies whether the database is nucleic or protein.
Although it is not strictly required, it is normal to specify the type of the database as this will normally be known. If it is necessary to not specify the type then this will be determined by the EMBOSS applications when they read sequences in. (You will get error messages when you run 'showdb' as this doesn't read in sequences.) The value 'N' specifies a nucleic database, 'P' specifies a protein database. e.g. type: N |
fields | One or more of: sv, des, org, key | This specifies which search fields have been indexed and are available for searching with.
It is assumed that Accession number and ID name are always available when a database is set up. The way you have set up the database may also allow access by one or more of these values. sv - Sequence Version or GI Number, des - Description line word, org - Organism's taxonomic classification. key - Keywords. The access methods 'srs', 'srsfasta' and 'srswww' allow access to these search fields, the methods 'emblcd' and 'gcg' may or may not have some or all of these fields indexed, depending on the parameters given to the programs 'dbiflat' and 'dbigcg'. The programs 'dbiblast' and 'dbifasta' only allows you to select any of 'sv', 'des' and 'acc' (the default). See the USA documentation for details of using these. e.g. fields: "sv des org key" |
directory | any valid directory path | This specifies the directory to look in to find files that have been
specified with the 'file:' attribute.
It also specifies which directory to look in to find indexes and files produced by the dbi* programs. It is only required with the access methods 'direct', 'gcg', emblcd' and 'blast' (see 'Access methods' below). It is common to use variables (defined using 'SET') to specify part or all of the path. The attribute key 'directory:' is commonly abbreviated to 'dir:'. e.g. directory: $dbdir/genomes |
filename | A file name (may be wildcarded) | This specifies the sequence file(s) to read in when accessing the database.
It is only required with the access method 'direct' (see 'Access methods' below). It may also be used with the access methods 'gcg', emblcd' and 'blast' to indicate which files should be included back in after using the 'exclude:' attribute to specify which indexed files should be ignored. (See 'exclude:' below) The files may be wild-carded using '*'. The attribute key 'filename:' is commonly abbreviated to 'file:'. e.g. file: pir*.seq |
exclude | A file name (may be wildcarded) | This is used to exclude a subset of files from consideration.
To exclude certain files, specify "exclude: *file*". This is used in conjunction with 'file:' to specify a subset of files in a directory. 'Exclude:' is checked first, then the rest of the files are included with 'file:'. The files searched are therefore: - the files in the directory specified by 'dir:' - but not the 'exclude:' files (if any) - but include back the 'file:' files (if any) e.g. exclude: mouse.* If you have indexed all of the file in the EMBL database, then you can specify subsets using the same set of files and indices as:
DB embl [ type: N format: embl method: emblcd dir: /data/embl comment: "All of EMBL" ] DB emblminus [ type: N format: embl method: emblcd dir: /data/embl exclude est*.dat comment: "EMBL without the ESTs" ] DB emblhumest [ type: N format: embl method: emblcd dir: /data/embl exclude *.dat file: est_hum*.dat comment: "EMBL human ESTs" ] DB human [ type: N format: embl method: emblcd dir: /data/embl exclude *.dat file: hum*.dat comment: "EMBL human" ] |
indexdirectory | any valid directory path |
This specifies the directory to look in to find the index files
(produced by the dbi* programs) if this is different to the directory
specified by 'directory:'.
It is sensible to hold the indices in a different directory to the one holding the sequence database files when you have many sequence databases in the same directory, because the indices for every database all have the same names (acnum.hit, acnum.trg, division.lkp, etc.) and these would be over-written if you have indexed several databases in the same directory. In this case, you should create the indices in a different directory (a subdirectory?) for each database. That way the index files will not become confused. These index directories can now be specified using the attribute 'indexdirectory:', while the directory containing the sequence data files can stiil be specified using 'dir:'. It is only used with the access methods 'gcg', emblcd', 'blast' (see 'Access methods' below). It is common to use variables (defined using 'SET') to specify part or all of the path. The attribute key 'indexdirectory:' is commonly abbreviated to 'indexdir:'. e.g. indexdir: $dbdir/genomes/embl |
url | Any valid URL | This specifies the URL (WWW address) to use when getting sequences from
remote Web sites.
It is only required with the access methods 'srswww' and 'url' (see 'Access methods and formats' below). In method 'srswww' the SRS commands '-e+-ascii' are appended to the given URL (these extract the complete entry from SRS with no HTML formatting). The database (or the name specified in a 'dbalias' attribute) and entry name Accession number (or Sequence version, GI number, Description, Organism, or Key-word) are then appended to create a functional SRS query line. e.g. url: "http://www.cbr.nrc.ca/srs6bin/cgi-bin/wgetz" In method 'url' the URL is expected to contain one or more instances of the character pair '%s' - each of these pairs are replaced by the value of the ID name when this database is accessed. Any HTML formatting will be stripped from the resulting web page. e.g. url: "http://www.ebi.ac.uk/htbin/emblfetch?%s" e.g. url: "http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=s&form=6&dopt=g&html=no&uid=%s" The URL must begin with "http://" and have a lower case host address. |
proxy | host:port |
In the access methods 'srswww' and 'url', you can specify a proxy host
and port to use when accessing the URL.
e.g. proxy: "proxy.mydomain.com:8888" If there is a global variable EMBOSS_PROXY defined in the 'emboss.default' file (See Variable Definitions, above), then the attribute proxy: ":" will turn off proxy access for this database. This is useful if the database is on an internal server. |
app
appentry appquery appall |
Any script or program name | This specifies the name of an external (i.e. non-EMBOSS) program or
script (applicaation) that should be run to extract the sequence from the database.
This application can be in the user's path or have an explicit path provided. The database and entry name will be appended to the application command as 'application dbname:entry'. Both ID and Accession number can be used to specify the entry. Alternatively, if the app: attribute value contains the character pair '%s', it is replaced by the value of the ID name or Accession number when this database is accessed.
This attribute is only required with the access method 'app' (see
'Access methods' below).
e.g. app: efetch e,g, app: "getz [embl:%s]" |
dbalias | The name of a database in SRS | This is used to specify the name of a database at a SRS site where the
name differs from the name that EMBOSS is using.
It is only required with the access methods 'srswww', srsfasta' and 'srs' (see 'Access methods' below). e.g. dbalias: emblnew |
comment | Any text | This is a simple comment, to describe the database. It is displayed in 'showdb'.
e.g. comment: "This is my subset of refseq" |
release | Any text | This is the release number or date. It is displayed in 'showdb'.
Note that unless you are zealous in updating 'release:' values, this will rapidly become out of synch with the actual data. (I wouldn't use this attribute. - GWW) The dbi* indexing programs ask for the 'database name', 'release number' and 'index date'. These are stored in the division.lkp file (one of the index files). This information is NOT available to EMBOSS programs. This information is not reported by showdb. They are part of the 'emblcd/staden' index file format, but EMBOSS does not use them. If other software uses the index files (the ACEDB efetch program, or maybe the Staden package) they may be used there. e.g. release: "89.0 (Oct 2001)" |
method
methodall methodentry methodquery |
One of: srs, srsfasta, srswww, url, app, external, direct, emblcd, gcg, blast |
This specifies the method used to access the database. (See next section).
This field is mandatory - there must be at least one form of the 'method' key specified. More than one different type of method key can be specified. If 'method:' is specified, then this is the default method covering all forms of access ('query', 'entry' or 'all'). Specific methods for the 'query', 'entry' or 'all' forms of access (i.e. 'methodquery:', 'methodentry:' or 'methodall:') should be specified explicitly if you wish to have several ways of accessing the data. e.g. method: emblcd |
There are many available methods for accessing databases. Some of these only allow you access by a subset of these methods. For example, if you use a web server to get databases entries, this is suitable for getting single entries. It may allow you to do queries returning more than one entry. It will probably not allow you to pull across a complete database and you wouldn't want to anyhow as this would take a long time over the networks.
Some access methods may be unavailable. For example, a flat file database with no index is only useful for reading all entries ('all'), while a remote database may only provide single entries ('entry'). In this case you would wish to access the remote database for single entries and the local one for reading sequentiall through all of the data.
You can specify an EMBOSS database that accesses many different data sources, depending on which type (entry, query, all) of access is required.
EMBOSS databases can be defined as all using the same access method, with the attribute method, or using up to 3 different methods with special suffixes methodentry, methodquery or methodall.
For example:
#EMBL with SRS index files and directly reading for all entries #methodquery defines the method for both query and entry access #methodall provides the method for reading all entries. DB srsembl [ # the sequences are nucleic type: N # the sequence entries are in 'embl' format format: embl # you can specify any description comment comment: 'EMBL using getz' # you can specify which release this is release: "61" # use the 'srs' method for both query and entry access methodquery: srs # the database is called 'embl' in the local SRS server dbalias: embl # to sequentially read the whole database, use the 'direct' method methodall: direct # the database files are in this directory dir: /nfs/data/embl/ # read these files file: *.dat ] |
In addition, each access method needs to know something about the database. What is needed will be different for each method, although there is, of course, much overlap between them. This information is specified by using the 'key: value' attributes (see above).
The database definition attributes used depend on the access method and also on the query level.
For example, EMBL entries could be read by:
Method | Scope | Comments |
---|---|---|
EMBLCD | * |
Uses an EMBLCD index from the programs 'dbiflat' (flatfiles - database
native format files) or 'dbifasta' (fasta format files). This can cope
with all levels of access. Queries use the index files, reading all
entries uses the list of files in the division.lkp file and opens each
in turn.
Supports queries by id, acc, sv, key, org and des. (NB not by 'key' and 'org' if the database was indexed by 'dbifasta' because there is no way to find these in the Fasta format description line.) The directory containing the sequence files and indices to be read must be specified using the 'directory:' attribute. If the indices are in a directory other than the one containing the sequence files, then the index directory can be explicitly set using the 'indexdirectory:' attribute. The available fields should be specified using the 'fields:' attribute if more than just the default ID name and Accession number fields have been indexed. A wildcard search for unique fields (id or sv), or any search for acc, des, org or key is type 'query' and returns a list of entries. A search for a single id or sv is of type 'entry' and will find the first match in the index and assume no other matches. The ID has to be unique in an EMBLCD database. For example:
DB mydb [ type: N method: emblcd format: embl fields: "sv des org key" directory: /data/embl ] |
SRS | * |
This calls 'getz' locally, using the "-e" switch to return whole entries in
original format. It is expected that 'getz' is on the path.
Supports queries by id, acc, sv, key, org and des. If the SRS server has a different name for this database than the one that EMBOSS will use, then you must specify it using the 'dbalias:' attribute. EMBOSS expects the SRS local access program to be called 'getz', but you can explicitly override this using the 'app:' attribute. This can be used to call 'getz' using its explicit path, rather than relying on 'getz' being on the path. Database definitions using "method: srs" should also specify "methodall: direct" plus "directory:" and "file:" for reading all entries directly. This is much faster than using getz to read and format all entries (unless the database is very small). For example:
DB mydb [ type: N format: embl method: srs dbalias: embl fields: "sv des org key" # define 'all' access method methodall: direct directory: /data/embl file: *.seq ] |
SRSFASTA | * |
As for SRS, but uses "getz -d -sf fasta" to read the sequence in fasta
format. For databases like dbEST.reports where EMBOSS does not
understand the entry format but SRS can convert it to FASTA. As the
database format is not understood by EMBOSS, a search of the entire
database would be forced to use getz to convert each entry, which will
be rather slow.
Supports queries by id, acc, sv, key, org and des. If the SRS server has a different name for this database than the one that EMBOSS will use, then you must specify it using the 'dbalias:' attribute. EMBOSS expects the SRS local access program to be called 'getz', but you can explicitly override this using the 'app:' attribute. This can be used to call 'getz' using its explicit path, rather than relying on 'getz' being on the path. Database definitions need to specify "methodall: direct" plus "directory:" and "file:" to read all entries directly. This is much faster than using getz to read and format all entries. For example:
DB mydb [ type: N format: fasta method: srsfasta dbalias: embl fields: "sv des org key" # define 'all' access method methodall: direct directory: /data/embl file: *.seq ] |
SRSWWW | single entry |
Uses a defined SRS WWW server to read a single entry. Could be useful,
for example, to get the GenBank version of an EMBL entry. Wildcard
entry names are not allowed because of the way SRSWWW splits the
output into blocks.
Supports queries by id, acc, sv, key, org and des. If the SRS server has a different name for this database than the one that EMBOSS will use, then you must specify it using the 'dbalias:' attribute. The remote SRS web server must be specified using the 'url:' attribute. Database definitions should define this as "methodentry" or "methodquery" to avoid returning the entire database. Failure to do so could lead to a request to return the entire database. Although an SRS web server can cope with this, EMBOSS will then have the entire web page in memory and will strip out HTML tags before trying to read the first entry. For example:
DB mydb [ type: N format: embl methodquery: srswww dbalias: embl fields: "sv des org key" url: http://srs.redbrick.ac.uk/srs6bin/cgi-bin/wgetz" # define 'all' access method methodall: direct directory: /data/embl file: *.seq ] |
BLAST | * |
Uses an EMBLCD index from the program 'dbiblast'. This can cope with
all levels of access. Queries use the index files, reading all entries
uses the list of files in the division.lkp file and opens each in turn.
The blast database can be DNA or protein, produced by formatdb, pressdb
or setdb, with or without the original FASTA format file.
N.B. dbiblast can't use the new style of Blast indices. You must create the old style of Blast indices by adding -A F to the formatdb command line. Supports queries by id, acc, sv, and des. (Not by 'key' and 'org' because there is no way to find these in the BLAST database description line). The directory containing the BLAST index files (*.nin, *,pin, *.nhr, *.nsq, *.phr, pin, psq, etc,) and the index files produced by dbiblast must be specified using the 'directory:' attribute. If the dbiblast indices are in a directory other than the one containing the BLAST index files, then the dbiblast index directory can be explicitly set using the 'indexdirectory:' attribute. The available fields should be specified using the 'fields:' attribute if more than just the default ID name and Accession number fields have been indexed. A wildcard search for unique fields (id or sv), or any search for acc, des, org or key is type 'query' and returns a list of entries. A search for a single id or sv is of type 'entry' and will find the first match in the index and assume no other matches. The ID has to be unique in an EMBLCD database. For example:
DB mydb [ type: N format: embl method: blast fields: "sv des" directory: /data/embl ] |
GCG | * |
Uses an EMBLCD index from the program 'dbigcg' to access a database
reformatted for GCG 8, 9 or 10 by GCG programs such as embltogcg. As
only the .ref and .seq files are used, the EBI's "GCG" distribution of
the database can be used with 'dbigcg' without the need to run
"embltogcg". This can cope with all levels of access. Queries use the
index files, reading all entries uses the list of files in the
division.lkp file and opens each in turn.
Supports queries by id, acc, sv, key, org and des. The directory containing the sequence files and indices to be read must be specified using the 'directory:' attribute. If the indices are in a directory other than the one containing the sequence files, then the index directory can be explicitly set using the 'indexdirectory:' attribute. The available fields should be specified using the 'fields:' attribute if more than just the default ID name and Accession number fields have been indexed. A wildcard search for unique fields (id or sv), or any search for acc, des, org or key is type 'query' and returns a list of entries. A search for a single id or sv is of type 'entry' and will find the first match in the index and assume no other matches. The ID has to be unique in an EMBLCD database. For example:
DB mydb [ type: N format: embl method: gcg fields: "sv des org key" directory: /data/gcg/gcgembl ] |
DIRECT | all |
Opens the database file(s) and returns each entry sequentially.
This method assumes there is no indexing done on the data, so it can only process 'all' entries - you should explicitly set up other methods for "entry" and "query" access to the same database if these are required. The directory containing the sequence files to be read must be specified using the 'directory:' attribute. The files to be read must be specified using the 'file:' attribute. You may use the 'exclude:' attribute to exclude some selected files from consideration. EMBL can be defined as "*.dat" to avoid adding the explicit filenames: est18, hum3, htg2, and so on for each new release. For example:
DB mydb [ type: N format: embl methodall: direct directory: /data/embl file: *.seq ] |
URL | single entry |
Uses any other Web server (for example the EBI's emblfetch or
swissfetch queries) to return an entry. I expect problems with the
HTML produced by these servers, but I hope EMBOSS sequence reading
routines can cope with most results.
The remote web server's URL must be specified using the 'url:' attribute. This URL is expected to contain one or more instances of the character pair '%s' - each of these pairs are replaced by the value of the ID name when this database is accessed. Any HTML formatting will be stripped from the resulting web page. For example:
DB mydb [ type: N format: embl methodentry: url url: "http://server.commercial.com/cgi-bin/getseq?%s&format=embl" ] |
APP
EXTERNAL |
* |
Run an external application or a simple script which returns
one/more/all entries. The application can be in the user's path or
have an explicit path provided.
The database definition must have "app:" defined to specify the application command. This can of course be a site-written script. The database and entry name will be appended to the application command as 'application dbname:entry'. Both ID and Accession number can be used to specify the entry. Alternatively, if the app: attribute value contains the character pair '%s', it is replaced by the value of the ID name or Accession number when this database is accessed. You can also use GCG's typedata as an external application, to save reindexing a GCG database. This could be a good way to search a set of databases, for example to get the first entry from SwissNew...SwissProt...TrEmbl...TrEmblNew with the ID or accnumber or PID as the "entryname". (See 'EMBOSS database farms', below) 'EXTERNAL' is the same thing as 'APP', but it is obsolete and its use is discouraged. For example:
DB mydb [ type: N format: embl method: app app: "/usr/local/bin/accessdb -db embl -query %s" ] |
CORBA may possibly be implemented as an access method in the future. It will talk to an external CORBA client, probably in Java, which will talk to a CORBA server somewhere. I see no way to do this directly in GNU library licensed, free, ANSI C code, but an external client will be OK. The database definition will include the IOR information and anything else the CORBA client needs to know. We plan to use one client for each IDL if no single standard appears.
There are many commented-out examples of database specifications in the 'emboss.default' file. If in doubt, contact the EMBOSS mailing lists.
If showdb displays your database, check that all of your required access methods have 'OK' in them. If something is not 'OK' then maybe you need to add another access method.
Just because showdb says that it can find your database definition does NOT mean that the database is working correctly. showdb does not attempt to extract any entries from your database.
You must now try to extract one or more known entries from the database using seqret. If you get errors, you should check that the database is set up correctly and defined correctly.
Things to check:
The closest you can come is to define a database that calls an application that can return sequences from any one of a set of previously-defined EMBOSS databases.
The following script was written by Simon Andrews. You may prefer to write your own solutions.
Simon says:
"To use this simply copy and paste the text of the script to a file on your system, then make sure that this file is readable and executable by everyone (chmod 755 filename). The comments in the script tell you what changes you need to make to the script itself, and the format of the entry you need to create in emboss.default.
It will work with seqret (and will output any format you like), and can also be used as part of a USA for any of the standard EMBOSS programs.
The script requires a unix-like OS, but could trivially be adapted to run under Win32 if anyone is running EMBOSS under windows."
#!/usr/bin/perl -w # # change the above line to match the location of perl on your system # use strict; # EMBOSS farm file script # # Written by Simon Andrews # simon.andrews@bbsrc.ac.uk # Dec 2001 # # This script allows you to set up a farm # of EMBOSS databases which can be queried # by a single instance of seqret. The # program must be accompanied by an entry # in emboss.default which looks like this: # # DB name_of_database [ # type: N (or P if we're dealing with proteins) # method: app # format: fasta # app: "/path/to/this/emboss_farm.script" # comment: "Whatever text you'd like to see in showdb" # ] # # First we need to set a few preferences # # What is the full path to seqret? # If you are sure that seqret will always # be somewhere in your path, then you can # just leave this as 'seqret'. my $seqret_path = 'seqret'; # Now we need to know the names of the # databases you'd like included in the # search. These must be dabases which # have already been indexed, and installed # correctly into emboss.default. Simply # enter the database names between the # brackets, separated by spaces. my @databases = qw(dbase1 dbase2 dbase3); ##### End of bits which need to be edited ######### my ($reference) = @ARGV; if ($reference =~ /:(.+)$/){ $reference = $1; } else { die "\n*** FARM ERROR *** Couldn't get accession after : from $reference\n\n"; } foreach my $database (@databases){ my $sequence = `$seqret_path $database:$reference fasta::stdout 2>/dev/null`; if ($sequence){ print $sequence; exit; } } warn "\n*** FARM ERROR *** Couldn't find $reference in any of '@databases'\n\n"; |
Last edited: 26 February 2003 - Gary Williams