EMBOSS: Project Meeting (Mon 7th June 10)


Attendees

EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag
Visitors:
Apologies:

1. Minutes of the last meeting

Minutes of the meeting of 24th May 2010 are here.

2. Maintenance etc.

2.1 Applications

Mahmut has further tested the alignment applications. It was possible for the traceback functions to start just beyond the end of the path matrix array. Memory boundaries for bounded alignments are now adjusted correctly in supermatcher and wordfinder

Mahmut tested supermatcher with 10k short reads against 32 Illumina adaptor sequences. After implementing the Rabin-Karp algorithm the run was twice as fast. Some further improvement may be possible.

Peter has reviewed all the EFUNC and EDATA messages. acdrelation has been updated. The main program is moved to the top of the source file, functions static with a program name prefix.

Peter will update acdvalid to check the knowntype against the EDAM term in the knowntypes.standard file.

2.2 Libraries

Peter has updated the handling of sequence databases in ajnam.c so that information on whether a database access method supports entry, query and all access is now taken from the registered definitions in ajseqdb.c. Database types are now automatically converted to either "Protein" or "Nucleotide" from the lower case or single letter alternatives. This is reflected in the output of showdb

Peter has reviewed all the EFUNC and EDATA messages. White space before functions is cleaned and fixed at 4 lines. Any ifdef blocks now start before the function documentation. "Fixme" comments are moved into the top of the function source code.

Messages from the Ensembl library code are cleaned except for those from the namrule, argrule and valrule definitions. These require function names to be standardized, functions to be sorted alphabetically, and standard naming then defined to fit the new names.

Mahmut found an issue with the name calculated attribute where the first input was a seqall. There is also an issue with capitalisation of default file naming when the first input is a seqset. Peter will investigate.

Alan cleaned up some code in ajobo and ajtax for mEMBOSS builds.

2.3 Other

Alan noted that a fix for making Jemboss with no java available needs testing.

Peter noted that there is a need to test sharing of binary files on big-endian systems. Alan will identify a suitable test system to check code that has endian tests.

3. New developments

3.1 BioMart

Alan has updated the processing of server URLs to be IPv6 compatible. This is a drop-in replacement that could be coded more efficiently with some modifications to other functions that call it. Some URL processing functions should move to ajstr.c

Alan described BioMart's rules for ordering of results. A sequence attribute is always reported first, followed by the first filter term which in current EMBOSS use is the identifier. BioMart format can be altered to expect the sequence and identifier as the first two fields in the tab-delimited record.

Alan could also add a column header record by sending one extra attribute query to the BioMart server.

There are cases where 4 sequences are returned, but each with the same identifier. As this is how BioMart works, we should simply use the BioMart duplicated identifiers for now.

3.2 EDAM ontology

Jon attended a successful EMBRACE workshop last week at DTU in Copenhagen. There were 14 attendees, ranging from experienced developers with services to annotate to new developers. Most of the workshop was based on presentations.

Issues of BioCatalogue support for SAWSDL were discussed. BioCatalogue could, for example, use widgets to suggest annotations based on EDAM terms. BioCatalogue categories could be merged with the EDAM topics branch.

BioCatalogue makes use of social tagging, as did BioMoby. The tags are a source of new terms, and could be cleaned up by annotating them with EDAM terms and merging spelling variants.

BioCatalogue currently adds extra annotation to services. This could be exported as suggested WSDL annotations and offered to the service providers.

EDAM could also be used in rank ordering services.

SoapLab was also discussed. Some of the attendees using SoapLab are interested in access to the latest WSDL annotations and ACD relations. Mahmut will check on the possibility of a SoapLab release.

There was discussion, but no conclusion, on the best tools to support annotated WSDLs. it would be useful to maintain a list of tools that have been tested and found to work or fail for some reason.

Batch PURL submission failed half way through. There were also cases where "tombstoned" terms remained available, and vice versa. The support was not very helpful. Jon will look at alternatives, e.g. OBO-foundry compliant URLs though these are not yet standard. The NCBO portal could create these URLs. We need to check whether there would be support for alternative formats (OBO, HTML, etc.)

Jon is waiting for a reply from the Ontology Lookup Service team.

Several reviewers have been asked to look through the EDAM terms in their areas. More are needed.

Terms for specific databases and ontology names have been removed from EDAM to make navigation simpler.

Terms were added to cover the services provided by workshop attendees.

3.3 BAM sequence data

Peter has completed the reading of BAM sequence data, using code that follows the pattern of samtools. All functions are in ajsambam.c. Other SAM source files that were included for 6.2.0 have now been deleted. Internals use AJAX strings, tables and lists.

BAM file input appeared efficient. Benchmarks against other packages need to be run.

BAM output is needed in time for the release. SAM output format was partly implemented for the last release and will also be checked.

Future plans for BAM and SAM formats include the reading of reference alignments. This will probably be after the release.

3.4 New applications

The databases list has standard names, and query fields defined with semantic descriptions. These will soon include EDAM terms.

Jon proposed a generator to use the databases list to create query tools for selected data resources. One example would be to return expression data from relevant resources.

Peter will write a parser for the databases list.

3.5 Non-sequence data resources

Peter plans to add access methods for test and HTML data, and to add new database types to ajnam.c. The existing functions to test database definitions will become sequence-specific.

4. Administration

Alan installed Fedora 13. The upgrade went smoothly.

Fedora 13 now includes EMBOSS. A test installation could be added on emboss2.

Alan has added EDAM to the Open-Bio CVS.

5. Documentation and Training

5.1 Books

Jon will do the final updates for the Developers Guide. The website needs attention, but the deadline is publication of the books.

6. User queries and answers

One query from Charles Plessey on zlib. Alan will reply.

7. AOB

Peter will attend the a next generation sequencing workshop in Bari in 2 weeks.

8. Date Of Next Meeting

The next meeting will be on Monday 14th June.