EMBOSS: Project Meeting (Monday 27th Sep 2010)

EMBOSS: Project Meeting (Mon 27th September 10)

Attendees

EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag, Michael Schuster
Visitors:
Apologies:

1. Minutes of the last meeting

Minutes of the meeting of 20th September 2010 are here.

2. Maintenance etc.

2.1 Applications

None.

2.2 Libraries

Mahmut has been working on next-generation sequence data.

2.3 Other

Alan looked at compilation and optimisation, especially to tidy up 64-bit pointers on 32-bit machines. When set to allow large files, the compiler still checks for file offset bits but doesn't add commandline options to use the 64-bit calls directly. There are some options

to turn on file offset bits so the compiler used ftello and fseeko for ftell and fseek
--enable-large-files also uses stat64 in the code. It is not clear if stat is converted to stat64 automatically.
--have-ftello is also a possibility, as a flag to check that fseeko and ftello exist.

Alan will check again for a consistent solution. Testing is limited as we no longer have test systems for AIX, HP-UX, Compaq or Solaris with the native compiler. We can also seek help from the emboss-dev mailing list to test on other platforms.

Alan reported on the new configure options for gsoap library configuration.

Installing gsoap requires C source files using a given set of WSDL definitions, e.g. EBI's Wsebeye, to invoke 2 programs defined in gsoap.m4. This macro is used when --with-gsoap is defined, invoking the two programs to create C source files in ajaxdb before AJAX is compiled.

But gsoap also looks at HAVE_CONFIG_H so we also have to set config.h for EMBOSS, EMBASSY, and EMBASSY with EMBOSS installed, and we need to get the correct search order for config.h files.

Alan needs to do more work to get gsoap working on Windows. It will be useful for remote data access as the EBI SRS server is the usual method on Windows but has a very poor response time.

The gsoap code generation creates '//' comments which some compilers will not like.

Michael reported configure problems on MacOSX which will be fixed later today.

3. New developments

3.1 Access methods

Michael continues to clean up the code documentation and function names for the Ensembl API.

Mahmut has checked in a partial implementation of DAS sequence access and is looking at how to handle features.

The UCSC genome browser and CHADO are possible new data sources. The "Data Resources" wiki page has details of UCSC access through a REST API, mainly MySQL but using a completely different schema with a new table for each new database. Michael described CHADO's data table which is too flexible to easily use, configured to store generic objects but difficult to use to retrieve specific objects.

3.2 Database configuration

Peter has created server definitions to define attributes common to a set of databases. A DB definition can include a server: attribute naming a server definition which can be used for common attributes. Most DB attributes are also valid in a server definition. A detailed description is on the EMBOSS wiki.

3.3 Data types

Peter has implemented OBO and outobo as new ACD types, with a -oformat option for output, and an OBO format parser form input. This is needed to use EDAM and other OBO ontologies in EMBOSS.

Peter has extended the "USA" syntax to more general access for any datatype, including sequence, feature and OBO. Query fields need to be made general (not only the sequence-specific ones hardcoded in EMBOSS).

Text access can now be made general for any data type. Where there is no text (e.g. Ensembl data retrieval as an object) we should report that text is not supported.

3.4 EDAM

Jon suggested a converted to generate JSON files from ACD to help GUI developers. Peter proposed waiting for a request from a GUI developer.

PURLs have been created for all EDAM terms. The submission script works, but there is a server timeout if too many are created.

The script uses a directory of XML files, with one term per file and can be resubmitted to make sure.

The scripts are committed to the edamontology CVS server.

Online EDAM documentation is updated for the new formats, including html, xml and json representations of OBO.

The edamclean utility is updated to write modified files.

Peter noted there are two copies of EDAM on the NCBO portal, as the beta_08 release is still there.

3.5 DRCAT

Database categories have been adopted from BioCatalogue, and defined for all entries. These will be replaced by EDAM topic terms. Most were easy to map. A few needed new topic terms defined. In some cases the BioCatalogue categories were too specific to be useful.

Jon is now working through the EDAM topic terms to simplify them. A maximum of 3 levels deep should be sufficient. For data curators, a moderate number of well-structured terms is needed.

Once completed, Jon will review the annotation of 1700 services in BioCatalogue.

Jon reported on Dmitri's development of the "BioNemus" WSDL editor, which reads in an OWL version of EDAM to define data structures. EDAM will continue to be maintained in OBO, but this can be used to generate a purely semantic OBO and OWL version for BioNemus to use.

4. Administration

4.1 Windows

Alan will rebuild mEMBOSS.

4.2 Hardware

A replacement for the emboss7 server is installed, and waiting for the Systems Group to move it to the machine room.

5. Documentation and Training

5.1 Books

The copy editor's version of the Developer's Guide has been checked and amendments noted.

Jon will take a break before checking the next book.

Indexing is an issue still to be addressed.

6. User queries and answers

All done.

7. AOB

None.

8. Date Of Next Meeting

The next EMBOSS meeting will be on Monday 4th October.