EMBOSS: Project Meeting (Monday 24th May 2010)

EMBOSS: Project Meeting (Mon 24th May 10)

Attendees

EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag
Visitors:
Apologies:

1. Minutes of the last meeting

Minutes of the meeting of 17th May 2010 are here.

2. Maintenance etc.

2.1 Applications

Mahmut has tested the results from stretcher are correct when end gap penalties are included. It would be possible to allow different gap penalties at the ends. After recent enhancements needle is no still slower than stretcher for large input sequences, so the contrary reference in the documentation has been removed.

Mahmut has tested matcher against the latest version of the original lalign program from the fasta2 package. Results were slightly different.

Mahmut discovered that the alignment scores from supermatcher and wordfinder did not agree over the word match region. This is now corrected.

Mahmut suggested it would be useful to have an application which would read an existing alignment and recalculate the score. Peter will look into the possibilities. Some existing alignment formats would need to be used for input, with the gap penalties and comparison matrix parsed or provided by the user.

Mahmut suggested water may give non-ideal results where there is a gap immediately after the first residue.

Mahmut will add QA tests for each of these use cases.

Mahmut would like to report kmer frequencies from wordmatch in graphical form.

2.2 Libraries

Alan has agreed with Michael Schuster to add a conditional include of a MacOSX header file to avoid ajjava.c compilation warnings.

Alan has corrected print formats for unsigned integers to use %u instead of %d.

2.3 Other

AS_HELP_STRING macros have been added by Alan to configure.in to pretty format the text produced by the --help option.

Alan has cleared up Intel compiler warnings for variables "set but not used", except some in ajfeat.c

A major issue reported by Alan is the equality testing of floating point variables. The Intel compiler warns in all cases. New macros in ajdefine.h are E_FPEQ(a,b,e) and E_FPZERO(a,e) where 'e' is an epsilon value optionally set to U_FEPS for floats and U_DEPS for double precision, defined to be the minimal change for numbers close to 1. These may be revised if higher values are needed. Warnings in ajax/core are resolved. Developers should check by searching for the macros and testing.

Other Intel compiler warnings included definitions in ajdefine.h of ajFalse and ajTrue as "const int". These are now #define because the compiler objected whenever they were unused by a source file that included the header.

Another issue is the casting of size_t (for example from a strlen call) to ajuint in case precision is lost.

In ajgraph.c the variable currentcharht is no longer used.

Peter will rerun the QA tests.

3. New developments

3.1 EDAM ontology

Jon has committed persistent URLs (PURLs) for EDAM terms. Alan wrote utilities to batch commit when curl failed. After discussions with purl.org support it appears that an initial POST is needed to obtain a cookie, and that the content-type of the second POST must be "text/xml". PURLs set as XML must not be URL-encoded. With the added details it is now possible to use curl for batch submissions.

Hamish McWilliam has created the end-points for EDAM PURLs. These can be plain OBO format, or HTML with relations as links. The is_a relations can be navigated up and down in SRS.

EDAM is now included in the NCBO Portal http://bioportal.bioontology.org/ This was very easy to do. Jon has also contacted EBI's Ontology Lookup Service http://www.ebi.ac.uk/ontology-lookup/

Jon has revised the EDAM documentation with new guidelines for annotators. This was needed in preparation for the EMBRACE course on web service development and annotation in Copenhagen next month, with 14 expected attendees. It will include the BioCatalogue registry and the use of BioXSD datatypes.

Hamish McWilliam has helped with suggesting data formats as EDAM terms. Some new format-related terms were added.

There has also been discussion of terms for sequence annotation. For sequences in database entries, "sequence record" or "sequence record lite" can refer to an EMBL record or a FASTA file. References in ACD files always use "sequence record" as fully formatted database entries are accepted and parsed.

Terms have been added for sequence branches, covering protein, nucleotide, gapped, etc.

Cardinality terms have been added for single sequences, pairs, streams, etc.

EMBOSS has other datatypes that can be considered as sequences, for example features or reports can be output as sequence formats. These can be marked as "sequence annotation".

EDAM-specific relations (not defined in the OBO standard) are now referenced through "relationship" tags. This allows them to be visualised in OBO-Edit.

In OBO-Edit a key is defined as any term with no relations defined. Extra relations have been removed from the intended top-level terms so that this now works correctly.

Jon attended a final EMBRACE workshop on EDAM in Amsterdam. Most of the workshop was presentations on EDAM, BioXSD and the EMBRACE registry. Agreement was reached on the top level term structure of EDAM.

Peter has validated the latest EDAM release. The use of EDAM:EBI, EDAM:WHATIF and similar terms to annotate the origins of annotations was agreed as a similar non-standard policy is used by GO.

The Data resource branch in EDAM is not only categories of database, e.g. "sequence" or "structure". Individual databases are removed and ontologies will be removed later this week to make the overall structure cleaner.

The Swiss Institute of Bioinformatics will review EDAM to test the basic simplicity of the content.

The EMBRACE partners who helped develop EDAM are looking for funding to continue working and meeting.

The BioNemus project can read a schema or ontology to build a WSDL file given any input or output datatypes. Other datatypes are generated from these I/O types. The project would like to use EDAM to define data structures via "has_a" relations, for example "sequence record has_a sequence" or "accession has_a string". This can easily extend too far. EDAM will try to add the minimal definitions needed for the terms and cardinality.

The latest EDAM release is beta 7, available from http://edamontology.sf.net

3.2 Ontology access

Peter has committed code that can parse OBO format ontologies. The data is loaded into memory for validation which is too much of an overhead for general use. The code will be reused to index ontologies, with identifiers and parents (is_a relations) indexed to allow traversal up and down the tree.

Peter has also committed code in ajtax to parse the NCBI ontology in its dump format (available from NCBI). The NCBI ontology has a hierarchical structure but is complicated by using 7 separate files to cover deleted and merged nodes, names, GenBank divisions, genetic codes, citations and other details.

The indexing applications are intended for inclusion in the July release.

Peter has added EDAM ontology parsing to ajacd. If the variable EMBOSS_EDAM defines the location of an OBO file, acdvalid will read the file and validate any relations: attributes in the ACD file.

3.3 BAM sequence data

Peter is attending an EMBnet workshop on next generation sequence data in June. In preparation he will complete the coding for BAM sequence format started but abandoned before the previous release. The existing code in the samtools package provides guidance on interpreting the format but using the sources involved too many extra include files and uses data types that are new in C99.

4. Administration

Alan expects Fedora 13 to be released later this week. it includes the latest autoconf. We have been using version 1.63 for the past 2 years. It can be installed on the EMBOSS machines as soon as Alan has tested it.

5. Documentation and Training

5.1 Books

Jon will do the final updates.

6. User queries and answers

None new.

7. AOB

Peter attended the Galaxy Developers Meeting at Cold Spring Harbor. Galaxy interfaces will be automatically defined immediately following EMBOSS 6.3.0 to ensure Galaxy stays up to date. Many Galaxy users report that the 5.0.0 definitions are working reasonably well with EMBOSS 6.2.0 applications so only minor changes should be needed - the advantage will be automation of the procedures.

There was a report from one group in the USA interested in using an ontology to annotate the BioCatalogue WSDL files. They are now in contact with Jon for EDAM and with the BioCatalogue team at EBI.

8. Date Of Next Meeting

May 31st is a public holiday. The next meeting will be on Monday 7th June.