EMBOSS: Project Meeting (Mon 10th May 10)


Attendees

EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag
Visitors: Michael Schuster
Apologies:

1. Minutes of the last meeting

Minutes of the meeting of 26th April 2010 are here.

2. Maintenance etc.

2.1 Applications

Mahmut reported progress on the results of supermatcher and wordfinder. The aim is to use the Rabin-Karp algorithm for the word-matching first phase of the algorithm in supermatcher.

Mahmut noted that supermatcher uses a -filter option which applies to the sequence stream input and suggests that this should be standardized as the first input.

Mahmut has further tested needle. A bug in alignment traceback has been fixed and added as a QA test. The application also now runs faster as debug code has been disabled.

Mahmut has validated stretcher which may not agree with the full global alignment from needle. Further tests are needed. The algorithm works in linear space and is documented as having a problem where gap penalties cross segment boundaries. The version of align in the FASTA2 distribution can be used to check the output is consistent with the original version.

2.2 Libraries

Alan has enabled PDF and SVG graphics in plplot on both Unix and Windows. On Unix the configure script tests for the presence of the libharu package (linked as the libhpdf library) and will ignore PDF if the library is not found. Where the library is in a non-standard location the --with-libhpdf option can specify the path.

SVG output needs no special library. It simply writes an SVG file.

Peter will update ajgraph.c with the default output sizes so that applications can scale plots on the new devices. These have to be explicitly set as usage in plplot varies across device drivers. The figure for SVG will probably be a dummy value.

For Windows, Alan has bundled libhpdf as a DLL which is also committed to CVS and included in the mEMBOSS build.

2.3 mEMBOSS

Alan had updated the mEMBOSS build utility bundlewin to compile on Windows systems under Borland (this is no longer free from Borland, but may be available from elsewhere) and for Microsoft Visual Studio 10 with 3 files added to the distribution;

When saved with CVSNT the DOS files had their end of line formatting converted incorrectly to Mac.

2.4 Other

When Jemboss is first started, the batch job manager reports "No current jobs" until a background job has been started. This message might be confusing and can be improved.

There was discussion of the possible ways to test Jemboss. Some general tests are needed, for example menus, launching clustalw through emma, and graphical output. Mahmut can set up junit tests to be launched from ant. We also need tests of the GUI boxes.

Mahmut has added PDF and SVG graphics output options. Output can be viewed using java desktop, but this only applies to Java 1.6. Possibly the java version could be tested and used to launch java desktop for version 1.6 and some user solution for earlier versions.

3. New developments

3.1 Ensembl access

Michael reported on the latest updates to the Ensembl library code. This currently runs with the previous release (56) of Ensembl. No major changes are needed to update to version 57.

Michael now has access to an Intel compiler at Sanger which generates messages for unused static variables from ajdefine.h. There are also type mismatches of integers and longs on 64-bit systems, for example in the ajstr string functions. Alan will review the Intel compiler output.

One compiler warning for 'parameter order undefined' is C++ specific and should be turned off according to Intel.

Michael has found a memory leak in ajStrNewS when called with an empty string.

3.2 BioMart access

Peter described the first implementation of BioMart access in the ajseqdb code. Results are returned by BioMart as a tab-separated record. This is now added as a new sequence format called "biomart". The database access method requires two attributes to be defined in the database definition. One is "identifier" which is used as the sequence name and returned as the first field in the record. The other is "sequence" which is the sequence attribute and is returned as the last field in the record. All other attributes are defined as "return" in the database definition, returns in the central fields of the record, and assumed for now to be sequence description. We may consider ways to define these attributes as more meaningful (gene name or database cross reference for example) in future.

because this is a sequence database definition, and there can be many sequence attributes and species in a Mart, there could be many possible sequence databases defined.

Peter plans to simplify BioMart access by defining a "server" where the server name is the start of a USA and the rest of the USA could specify the Mart (species), sequence field, and identifier filter.

Mart database definitions can include a "filter" attribute which is an additional BioMart query to, for example, limit the database to a single chromosome within an organism.

3.3 EDAM ontology

Jon has revised the validation code in edamclean to reflect changes to the top level terms and structure of the ontology.

Further edits and additions are planned. All ACD files for EMBOSS are re-annotated to the latest EDAM version. For the data definitions only 130 qualifiers had no specific annotation and have to use a tool-specific parameter term.

Sequence types have 2 terms, one for the type (gaped, etc. based on the EMBOSS sequence types) and one for "raw sequence", "sequence record", "sequence set" or "sequence stream". These are nested terms in EDAM. There is also a separate set of feature table terms.

Terms for data formats are now in a separate "syntax" branch to use for file and later for XML formats.

Jon has discussed EDAM with the BioCatalogue team at EBI. There are clear applications in tagging, ontology markup, and replacing existing service categories. This can start once the cleaned EDAM version is announced.

Jon will announce the new EDAM later today. The documentation will need revision.

3.4 Ontology access

Peter is working on code to parse OBO format files, including EDAM, GO and SO.

3.5 Other

Mahmut has generated SAWSDL annotation using the first relation for each application in ACD. This will be fixed by updates to acdxsd which Peter is working on. The SAWSDL has been tested with the EMBRACE registry scripts. Scripts were updated and changes will be checked with the External Services BioCatalogue team.

3.6 Release 6.3.0

There was discussion of features to be included in release 6.3.0 (July). These included:

4. Administration

Alan and Peter have tested the server configuration set up by the systems group and found performance to be very good. The server has strong I/O performance and is configured to support multiple users with twin quad-core processors and direct storage. CPU power is similar to the workstations for floating point performance.

Peter will make inquiries on pricing.

Alan noted that Fedora 13 is due for release on May 18th. No noticeable changes are anticipated. NFS4 is now the default. There were some beta release issues with opening NFS ports.

Jon will send details of software licenses needed for further EDAM work.

Peter will start planning for the first Scientific Advisory Board meeting.

5. Documentation and Training

5.1 Books

Jon aims to finish work on the Developer Guide before next week's EDAM effort. We are waiting for a reply from the publishers.

6. User queries and answers

None new.

7. AOB

Peter will attend the first Galaxy Developers Meeting in Cold Spring Harbor this weekend.

8. Date Of Next Meeting

The next meeting will be on Monday 17th May. Peter will still be away at the Galaxy meeting.