EMBOSS: Project Meeting (Monday 25th July 2011)

EMBOSS: Project Meeting (Mon 25th Jul 11)

Attendees

EBI: Peter Rice, Mahmut Uludag, Michael Schuster
Visitors:
Apologies: Alan Bleasby, Jon Ison,

1. Minutes of the last meeting

Last week was the BOSC and ISMB conferences in Vienna.

Minutes of the meeting of 11th July 2011 are here.

2. Maintenance etc.

2.1 Release

Alan announced the 6.4.0 release on 15th July.

Peter will update the web sites and announce to the Advisory Board, and to the emboss-dev list, once library documentation pages have been cleaned up.

2.2 Applications

Alan is looking into a reported issue with tcode results.

2.3 Libraries

Michael noted that some Ensembl databases are not usable. These are multi-species bacterial databases where each has an enumerated list of species. We need to find a way to pass these values through the query and the database definition.

2.4 mEMBOSS

In mEMBOSS, ajlong was set back to 32 bits which resulted in problems printing ajlong values. This affected only a few applications which are not generally used on Windows (e.g. dbigcg). A new mEMBOSS release will be made to provide a fix.

In running mEMBOSS QA tests about 50 tests failed. Most were due to test databases such as tdas not defined in the test environment.

2.4 SoapLab

Mahmut found other jobs on the EBI cluster were breaking the batch queue.

SoapLab is running smoothly at the London Data Centres. A monitor service needed some tidying up. It runs via LSF and so takes time to return a result. The cluster load balancers need a faster response time so Mahmut is working on a "health check" servlet to test the availability of a node, and looking into Log4J.

2.5 EDAM

2.6 DRCAT

3. New developments

3.1 Reference sequences

Peter outlined the remaining developments needed under the BBR grant. Most involve the annotation of a reference sequence, either to load a file of mapped reads or to generate genome browser track data in a variety of formats. We also need to handle variation data describing alternatives to the reference sequence.

Reading data could use the concept of screen resolution to limit the amount of data loaded into memory, with detail at the sequence level only needed for a short region of the reference. Reference sequences need start and end positions defined before the input is read, which may require a new ACD input type.

Many of the data formats use BAM style indexing to allow remote access (FTP or HTTP) to a small region of the input file.

Michael suggested a chunked approach to reading sequences which can be needed in some genomes, for example the Opossum genome which is defined in CON files referring to individual sequence entries.

Mahmut has looked at Mira assembly format as a way to load assembled read data.

Michael suggested useful formats include VCF which is compressed to hold a large number of variants, and dbSNP.

3.2 Support for interfaces and GUIs

Peter has promised to update the Galaxy wrappers for the new release, and to generate XML files for the Mobyle project.

Jon proposed BioXSD as a new supported format which is now becoming stable. It would help is we can match BioXSD types to EMBOSS data structures.

3.3 Data access

At the BOSC code fest, Peter was looking into the new BioPython data indexing scheme which uses an SQL database of file starts and lengths to index entries. This fits very well with EMBOSS text access as the length allows code to know when to stop reading an entry.

will look into access to ontology data from Entrez. This has not yet been tested.

Mahmut noted we should add further EDAM annotations to some of the databases defined in emboss.standard

4. Administration

Alan reported Linux has now declared a release 3.0.0.

Alan noted there had been problems printing from the EMBOSS machines. These resolved themselves after some delay. Mahmut reported some problems in using the updated eclipse in Fedora 15. These were issues with Maven when working on SoapLab. The issues are now fixed.

5. Documentation and Training

5.1 Web server

Alan reported that the new emboss.open-bio.org web site is available. Peter has loaded the release 6.4.0 and latest CVS documentation for applications and library documentation. Alan has contacted Open-Bio about enabling server-side includes on the new web server.

Jon is ready to update the XML book source files. Alan will create a CVS branch to preserve the original source files for the first editions. Updates will be used to maintain the new web site pages.

Jon will document how to generate the HTML web pages from the books using XMLmind. Stylesheets may need some adjustment for table formats. Peter suggested a simple Perl post-processor as an alternative if style sheets are tricky to manage.

Peter needs to generate new sections for the new data types in the latest release. It should be possible to generate the library descriptions by incorporating additional book text into the source code "section" documentation blocks.

5.2 Books

Peter reported the books were the highlight of the Cambridge University Press stand at ISMB in Vienna.

6. User queries and answers

All done.

7. AOB

Alan reported a query from Tim Carver on handling of large translation attributes in GFF3 format. Peter will investigate.

Alan is looking into a user report of a small conflict between some tcode results and the original paper.

8. Date Of Next Meeting

The next EMBOSS meeting will be on Monday 1st August.