|
EMBOSS: C2 Sequence Databases
|
Databases are defined in the control files
"emboss.default" and "$HOME/.embossrc"
The keys to database definition are the query level and the
access method. Database entries can be referred to at 3 levels:
"id" asks for a specific entry, "query" asks for a
wildcard entry name or some other query mechanism that can return 1 or
more entries. "all" reads every entry from the database.
Access methods can handle one or more of these levels. For example,
query methods using URLs can usually only read single entries while
databases defined as flat file without indexing can only be used in
practice to read all entries.
Current formats:
- DIRECT (flat files in fasta, embl or other simple formats)
- SRS (getz at "id" or "query" level, usually
DIRECT for "all") returning original format entries with
"getz -e"
- SRSFASTA (as SRS but returning FASTA format with "getz -d -sf
fasta" to allow SRS to manage format conversion. Useful for
e.g. dbEST.reports files)
- SRSWWW (uses local or remote SRSWWW server, for example to read
single GenBank entries)
- URL any URL including the entry name, for example the EBI
emblfetch server.
- EXTERNAL any application with the entry name in the command line
to be run via a fork.
- NBRF using the .ref and .seq files to establish methods for
reading data and sequence in separate files (GCG 10.0 is a clear
target format here subject to confirmation that the format is in the
public domain). Index access can use the NBRF index files.
- EMBLCD uses the EMBL CD indices to retrieve entries and handle
wild card queries. This indexing method is already extensively used at
the Sanger Centre.
Planned formats
- BLAST using the blast2 (formatdb) index files to return single
entries, all entries and wildcards. This may mean generating missing
index files (for non-NCBI format input).
- BLAST1 using the blast1 (setdb/pressdb) index files and possibly a
fasta file. This may mean generating EMBOSS specific index files for
fast searches by ID and accessin number.
- GCG subject to confirmation that the format is in the public
domain, or using public domain code such as that in FASTA together
with some other indexing method (SRS or EMBOSS specific).
- CORBA will talk to a CORBA client (probably in Java) running
separately and read from a CORBA server somewhere. First trials will
use the EBI CORBA server and the SRS 6.0 Object Server.
Issues:
- Some care is needed to make query level work so that access
methods can continue when the next sequence is requested. At the ACD
processing stage the first sequence must be returned. Later calls will
request any other sequences from the same "input
stream".
- Queries are identified by wildcards in the name, using "*"
or "?" as for SRS (and VMS).
- The definition of "query level" is part of the USA. It
must be possible to identify the query level from the USA when
databases are involved to know which access method to use.
- Can CORBA be easily supported? There have many problems
trying to use CORBA from C and avoiding proprietary ORBs but
there appear to be usable free ORBS available now. A Java client
communicating with an EMBOSS application is another method we can try.
Ideally, sites with CORBA servers would provide client code.
-
Database subsets are not fully supported yet. We can add these for
many of the database access methods as simple lists of files to
be included. We can make subindices for some methods, for example
the EMBL-CD index files.
- We cannot make subsets of SRS indexed databases
unless we allow expensive SRS queries to make the subsets each time
we fetch an entry.
- We cannot make subsets of blast1 or blast2 databases where all
the data is in one file, but we can allow sites to make subset
databases in blast1 or blast2 format and use them directly.
- Database supersets are not supported yet. There are interesting
possibilities, for example:
- EMBL and EMBLNEW with entries in EMBL flagged to be ignored and
entries in EMBLNEW replacing them. We could keep a list of ignored
entries to be updated with database updates, or make a small change to
the original entries without rewriting the entire file.
- SPTREMBL and SWISSPROT and SWISSNEW could be managed in the same way.
Other points
- The USA should specify "dbname-id:name" or "dbname-acc:name"
otherwise it checks for both "id" and "acc" if available.
- Other search options could be useful, e.g. "des" could search
descriptions in FASTA format too.
At present, "des" is checked for but not
yet used. Some searches could become very complicated.
- When a database (or its index) is being updated, a test for a
"lock" file would be helpful. By default, this should be in the
database index directory.
- If existing database formats are inefficient, an EMBOSS index
format could be a useful addition. Areas where this could help
include:
- Including "id" and "acc" in a single index
- Removing deleted entries, for example a list of valid entries
to check against a full parse of the database.
- Indexing a database and update (EMBL and EMBLNEW)
together.