EMBOSS: C2 Sequence Databases

Databases are defined in the control files "emboss.default" and "$HOME/.embossrc"

The keys to database definition are the query level and the access method. Database entries can be referred to at 3 levels: "id" asks for a specific entry, "query" asks for a wildcard entry name or some other query mechanism that can return 1 or more entries. "all" reads every entry from the database. Access methods can handle one or more of these levels. For example, query methods using URLs can usually only read single entries while databases defined as flat file without indexing can only be used in practice to read all entries.

Current formats:

DIRECT (flat files in fasta, embl or other simple formats)
SRS (getz at "id" or "query" level, usually DIRECT for "all") returning original format entries with "getz -e"
SRSFASTA (as SRS but returning FASTA format with "getz -d -sf fasta" to allow SRS to manage format conversion. Useful for e.g. dbEST.reports files)
SRSWWW (uses local or remote SRSWWW server, for example to read single GenBank entries)
URL any URL including the entry name, for example the EBI emblfetch server.
EXTERNAL any application with the entry name in the command line to be run via a fork.
NBRF using the .ref and .seq files to establish methods for reading data and sequence in separate files (GCG 10.0 is a clear target format here subject to confirmation that the format is in the public domain). Index access can use the NBRF index files.
EMBLCD uses the EMBL CD indices to retrieve entries and handle wild card queries. This indexing method is already extensively used at the Sanger Centre.

Planned formats

BLAST using the blast2 (formatdb) index files to return single entries, all entries and wildcards. This may mean generating missing index files (for non-NCBI format input).
BLAST1 using the blast1 (setdb/pressdb) index files and possibly a fasta file. This may mean generating EMBOSS specific index files for fast searches by ID and accessin number.
GCG subject to confirmation that the format is in the public domain, or using public domain code such as that in FASTA together with some other indexing method (SRS or EMBOSS specific).
CORBA will talk to a CORBA client (probably in Java) running separately and read from a CORBA server somewhere. First trials will use the EBI CORBA server and the SRS 6.0 Object Server.

Issues:

Some care is needed to make query level work so that access methods can continue when the next sequence is requested. At the ACD processing stage the first sequence must be returned. Later calls will request any other sequences from the same "input stream".
Queries are identified by wildcards in the name, using "*" or "?" as for SRS (and VMS).
The definition of "query level" is part of the USA. It must be possible to identify the query level from the USA when databases are involved to know which access method to use.
Can CORBA be easily supported? There have many problems trying to use CORBA from C and avoiding proprietary ORBs but there appear to be usable free ORBS available now. A Java client communicating with an EMBOSS application is another method we can try. Ideally, sites with CORBA servers would provide client code.
Database subsets are not fully supported yet. We can add these for many of the database access methods as simple lists of files to be included. We can make subindices for some methods, for example the EMBL-CD index files.
- We cannot make subsets of SRS indexed databases unless we allow expensive SRS queries to make the subsets each time we fetch an entry.
- We cannot make subsets of blast1 or blast2 databases where all the data is in one file, but we can allow sites to make subset databases in blast1 or blast2 format and use them directly.
Database supersets are not supported yet. There are interesting possibilities, for example:
- EMBL and EMBLNEW with entries in EMBL flagged to be ignored and entries in EMBLNEW replacing them. We could keep a list of ignored entries to be updated with database updates, or make a small change to the original entries without rewriting the entire file.
- SPTREMBL and SWISSPROT and SWISSNEW could be managed in the same way.

Other points

The USA should specify "dbname-id:name" or "dbname-acc:name" otherwise it checks for both "id" and "acc" if available.
Other search options could be useful, e.g. "des" could search descriptions in FASTA format too. At present, "des" is checked for but not yet used. Some searches could become very complicated.
When a database (or its index) is being updated, a test for a "lock" file would be helpful. By default, this should be in the database index directory.
If existing database formats are inefficient, an EMBOSS index format could be a useful addition. Areas where this could help include:
1. Including "id" and "acc" in a single index
2. Removing deleted entries, for example a list of valid entries to check against a full parse of the database.
3. Indexing a database and update (EMBL and EMBLNEW) together.