EMBOSS: Project Meeting (Monday 26th October 2009)

EMBOSS: Project Meeting (Mon 26th October 09)

Attendees

EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag
Visitors: Michael Schuster (EBI/Ensembl)
Apologies:

1. Minutes of the last meeting

Minutes of the meeting of 12th October 2009 are here.

2. Maintenance etc.

2.1 Applications

Mahmut has committed needleall. Calculations use single precision floats.

2.2 Libraries

Peter noted that a user has pointed out data access is remarkably fast from ACNUC servers, although the data itself may be rather old if the server has not been recently updated. He will look into adding ACNUC as a data access method.

Peter will look into issues with plotting in mEMBOSS. prettyplot produces poorly scaled text and boxes in a 'win3' graphics window under mEMBOSS.

Mahmut has updated the ajAlignDefineSS function to specify 'const' sequences as the originals are unchanged. The constructors have been refactored to make further constructors and ajAlignDefine functions easier to add.

2.3 Other

Mahmut suggested quoting any wildcard characters in command lines reported by Jemboss. There was a recent example where a user specified '*' as a search term for extractfeat and reported very confusing error messages.

Mahmut will implement the proposed Jemboss jar file changes. Two additional jar files for 3D output are not well supported and will be removed. Another jar file to print DNA alignments can be merged into the main jar file.

Mahmut has checked in a file of Illumina adaptor sequences obtained from users at Sanger. He is using a BioPerl example data file for testing as it is publicly available data.

3. New developments

Alan has completed work on the ensembl/ library under Unix, and is now testing with mEMBOSS. There are no issues with DLL file compatibility. The SQL application runs. The ensembl application fails because there is a schema change to the exon handling fields in a recent ensembl release.

Michael explained that schema changes are only allowed in alternate releases, around 5 per year. We can test the latest development server and compare results with the Ensembl BioPerl API. Peter will incorporate these tests into the QA suite.

Michael will update the ensembl library code after the next ensembl release. This is his turn as release coordinator so he will have no time until next month.

Peter explained the modifications needed to function and datatype naming to conform to EMBOSS library standards. There was some discussion about the readability of code with long multi-word datatypes. As these can be renamed with a script (renaming datatypes and functions, and adjusting @namrule numbers) the current cleanup will continue with a code review when it is completed.

Michael has split one of the source files as it was exceptionally large. He will send details of the split so that we can update the EMBOSS copy of the library.

Michael explained the preference for ENSEMBL to serve whole results sets rather than chunks. The server receives a very large number of requests and would be more strained by repeated queries to serve chunks of data.

Michael will work though the 'FIXME' comments in the code. Many of these are use cases where the native ensembl API may need changes.

Michael suggested that for complex queries it may be faster to use a local copy of the ENSEMBL data. We can try both local and remote access.

Alan is now looking into the BioMart (Perl) API. This uses XML, for which we will need to use the expat library. As we aim to avoid additional dependencies, the current expat source code will be included in an AJAX sub-library with the name 'libeexpat'. Functions will be renamed using macros to avoid name clashes with the native library which may be linked in automatically on some systems (possibly with X11 under CYGWIN).

Alan recommends the macro-based renaming approach as adding m4 macro tests for multiple libraries would be excessively complicated.

Alan noted a similar renaming is needed in pcre where the 'regcomp' function has a conflict warning on MacOSX.

Peter will look into reading large next generation sequence data assemblies from SAM or BAM data files. This will require a lighter version of the AjPAlign data structure to limit the memory requirements. BAM format will also need a compression/decompression library. Further information is in the samtools documentation.

Jon is preparing the first beta release of the EDAM ontology. The model has been checked by Tony Burdett. Meetings have been held with the EBI BioCatalogue team and the links to the ontology and documentation have been passed to the EMBRACE registry and other interested EMBRACE partners. An issue remains in the relations between reports and datatypes.

The BioCatalogue team at EBI are keen to use EDAM to supplement or replace their internal ontology, especially for use in registry searches. Their main concern is the sustainability of the ontology beyond the end of EMBRACE.

UCL will check new terms for the beta release. They have also provided a list of protein domain/family proposed terms.

Jon has been in communication with Paul Gordon in Calgary. Paul has recommended syntax for SAWSDL annotations that conform to practice in Canadian BioMOBY services. The annotations can use PURL persistent URLs or LSID life science identifiers, with the PURL option preferred to avoid issues with LSID resolution and so that users get more useful messages from links that have since moved.

Paul also suggested adding some annotation for the data schema in SAWSDL, especially useful for defining workflow inputs and outputs but also useful for linking services more accurately.

Jon reported from a meeting with EBI External Services that they would like to see the EB-eye interface included as it is particularly good at defining and reporting cross-referenced data. Peter plans to define a "SERVER" object in emboss.defaults which represents access to a set of data resources which can be defined individually with reference to the server, or referenced by the user through a server name and database name combination.

4. Administration

Alan will look into converting the code repository to SVN. This would allow branches and make it easier to delete or move files. We currently do not use CVS tags or branches so a move should be straightforward.

Alan will look into options for Windows 7, including the need for separate 32 and 64 bit versions.

5. Documentation and Training

5.1 Books

Jon had a reply from the typesetters. There are a few required fonts which we need to check for. In some cases stylesheet modifications will be needed.

The main issue the publishers have is the timing of physical production as they would like to have a target conference for the book launch.

The time line is about 8 months to produce the books. We would clearly have a new release out by then. We can provide an addendum to the books for each release to keep them up to date.

A deadline of December 24th was agreed for the final book text, with 2 weeks to test the generated text before passing to CUP at around the time of the next EMBOSS release.

5.2 Other

Peter reported that Thunderbird 3.0b4 is more stable after the latest bugfix although it does not specifically refer to the inbox display problems.

6. User queries and answers

Mahmut tested the prettyseq problem but found it worked in the current CVS version.

Peter will look into extractfeat and features on the reverse strand under Linux, then Alan can check on mEMBOSS.

Mahmut checked on a report of problems with tfm and the location of documentation files. This was a local environment variable issue.

Peter is working on extending the Phylip formats for sequence input and output.

A user reported that gap penalty limits are not consistent. We will review all applications and set similar limits for each of them.

A user has requested an option to define disulphide bridges between Cysteines in iep for molecular weight calculation.

The pepstats documentation fails to describe the local data file Emolwt.dat.

Mahmut has corrected vectorstrip to trim the sequence quality scores.

A user has sent a patch to water to align multiple sequence inputs.

7. AOB

Peter reported from the Large Database and Networks meeting in Beijing. He gained the impression that the EMBOSS updatable indexing capability could be particularly important for remote sites needing to limit their network traffic for database update. He also discussed the WebLab interface with the development/support team.

8. Date Of Next Meeting

The next meeting will be on Monday 2nd November.