EMBOSS: C2 Sequence Databases


Databases are defined in the control files "emboss.default" and "$HOME/.embossrc"

The keys to database definition are the query level and the access method. Database entries can be referred to at 3 levels: "id" asks for a specific entry, "query" asks for a wildcard entry name or some other query mechanism that can return 1 or more entries. "all" reads every entry from the database. Access methods can handle one or more of these levels. For example, query methods using URLs can usually only read single entries while databases defined as flat file without indexing can only be used in practice to read all entries.

Current formats:

Planned formats

Issues:

  1. Some care is needed to make query level work so that access methods can continue when the next sequence is requested. At the ACD processing stage the first sequence must be returned. Later calls will request any other sequences from the same "input stream".
  2. Queries are identified by wildcards in the name, using "*" or "?" as for SRS (and VMS).
  3. The definition of "query level" is part of the USA. It must be possible to identify the query level from the USA when databases are involved to know which access method to use.
  4. Can CORBA be easily supported? There have many problems trying to use CORBA from C and avoiding proprietary ORBs but there appear to be usable free ORBS available now. A Java client communicating with an EMBOSS application is another method we can try. Ideally, sites with CORBA servers would provide client code.
  5. Database subsets are not fully supported yet. We can add these for many of the database access methods as simple lists of files to be included. We can make subindices for some methods, for example the EMBL-CD index files.
  6. Database supersets are not supported yet. There are interesting possibilities, for example:

Other points