EMBOSS: Project Meeting (Mon 14th September 09)


Attendees

EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag
Visitors:
Apologies:

1. Minutes of the last meeting

Minutes of the meeting of 7th September 2009 are here.

2. Maintenance etc.

2.1 Applications

Mahmut has checked in an updated version of needle with end-weighted gaps. Some of the code needs reformatting to fit the code guidelines (e.g. maximum line length). The end-weighted code is maintained in one function for now. One user has requested end gaps for one sequence only. This may lead to a complicated set of gap options and perhaps a new application.

Peter has modified mse for a former GCG user who wants key-based commands as used in the GCG "seqed" editor. Mse had additional functions as a multiple alignment editor so some of these are hard to define as keys have been already allocated. A proposed solution had been sent to the user for testing.

Peter noted that there is a new release of Phylip (3.69). Peter will apply the changes to the EMBOSS version.

2.2 Libraries

Peter has, with help from Peter Cock, further improved the speed of FASTQ reading and writing, using lookup tables for quality score conversions. The overhead for writing Solexa quality scores is now only 10% of the total run time.

Peter will investigate a report of problems with SwissProt output format (note after the meeting - all fixed in the current release).

Peter is working through a few remaining ajGraph function names.

Peter noted that one bug identified in the mse testing was that optional sequence inputs do not work properly. It appears that mse is the only application with an optional sequence input at present. The ACD processing has been changed to define a value for $acdprotein and other attributes of the missing sequence data.

A phylipnew user has been finding problems with distance matrix files containing '-1.000' where a value could not be calculated. These will be readable with a warning, and treated as missing values. Peter will check with Joe Felsenstein to see what we should do to handle these files correctly. It may depend on the application being used.

2.3 Other

Alan pointed out that the "install jembossserver" step fails at "make install" looking for client.jar when run from a CVS checkout, and that "make jnlp.sh" had not copied the org directory tree.

Peter noted that the "see also" section in documentation could be improved. For example, needle only reports global alignment applications. This could be extended to show other alignment applications by navigating the groups hierarchy.

3. New developments

Alan has reviewed most of Michael Schuster's Ensembl SQL API code, about 150k lines in total. The code quality is very good. Some coding style and documentation reformatting is needed. Initial uses would be for sequence retrieval. There are open questions about how to extend the USA syntax to include Ensembl coordinates, and how to use the code to retrieve other data types. The code has already been checked for memory leaks. Some function names are quite long but should be acceptable. Function headers follow EMBOSS style and have additional links to the BioPerl API equivalents.

Some care is needed to check that lists of strings are correctly managed.

Inclusion in EMBOSS raises the issue of adding foreign code and dependencies to the libraries. Alan will try building the code as a separate library, in a directory below ajax. If successful, other candidate code to split out includes ajgraph and ajhist (already built separately), the pcre imported code, and future interfaces to BioMart. We could also split out ajacd as this is usually where the dependencies arise.

Michael Schuster has a few test applications that we can use to try the code.

Jon has updated ACD files for EMBOSS and EMBASSY with relations describing the application functions. The knowntypes.standard file is reformatted to include an EDAM term description for each known type. These need to be added as "relations" attributes for each input and output data type in the ACD files. About 100 new EDAM terms have been added to cover the fine-grain detail.

Peter will analyse the ACD files to check which known types are most used, and which are redundant.

4. Administration

Alan and Peter met with the EBI systems group last week. We expect to be able to purchase two servers with 8 cores and 32GB memory within our budget. Peter will supplement the disk space with other funds to aim for a complete set of SAS disks up to 20TB (or possibly SATA disks with higher capacity). The disk controllers have 2 ports on the server which allow us to split indexing and interactive work. We can also order a spare controller.

The workstation order is (we hope) now being manufactured. There was a delay in availability of the monitors, and nobody had contacted us to ask about possible substitutes.

SATA disk cables, spare mice and a replacement external backup drive have been ordered.

5. Documentation and Training

5.1 Books

Ongoing.

5.2 Website

Mahmut noted some broken links which are being fixed.

6. User queries and answers

Jon reported that one user has asked about alternative position-specific weight matrix formats. We currently only read formats that we generate from other EMBOSS applications.

7. AOB

Peter gave a brief overview of the ELIXIR project, as described in Janet Thornton's seminar last week.

Peter reported on last week's BBSRC meeting.

8. Date Of Next Meeting

The next meeting will be on Monday 21st September.