EMBOSS: Project Meeting (Mon 24th May 10) |
Mahmut has tested matcher against the latest version of the original lalign program from the fasta2 package. Results were slightly different.
Mahmut discovered that the alignment scores from supermatcher and wordfinder did not agree over the word match region. This is now corrected.
Mahmut suggested it would be useful to have an application which would read an existing alignment and recalculate the score. Peter will look into the possibilities. Some existing alignment formats would need to be used for input, with the gap penalties and comparison matrix parsed or provided by the user.
Mahmut suggested water may give non-ideal results where there is a gap immediately after the first residue.
Mahmut will add QA tests for each of these use cases.
Mahmut would like to report kmer frequencies from wordmatch in graphical form.
Alan has corrected print formats for unsigned integers to use %u instead of %d.
Alan has cleared up Intel compiler warnings for variables "set but not used", except some in ajfeat.c
A major issue reported by Alan is the equality testing of floating point variables. The Intel compiler warns in all cases. New macros in ajdefine.h are E_FPEQ(a,b,e) and E_FPZERO(a,e) where 'e' is an epsilon value optionally set to U_FEPS for floats and U_DEPS for double precision, defined to be the minimal change for numbers close to 1. These may be revised if higher values are needed. Warnings in ajax/core are resolved. Developers should check by searching for the macros and testing.
Other Intel compiler warnings included definitions in ajdefine.h of ajFalse and ajTrue as "const int". These are now #define because the compiler objected whenever they were unused by a source file that included the header.
Another issue is the casting of size_t (for example from a strlen call) to ajuint in case precision is lost.
In ajgraph.c the variable currentcharht is no longer used.
Peter will rerun the QA tests.
Hamish McWilliam has created the end-points for EDAM PURLs. These can be plain OBO format, or HTML with relations as links. The is_a relations can be navigated up and down in SRS.
EDAM is now included in the NCBO Portal http://bioportal.bioontology.org/ This was very easy to do. Jon has also contacted EBI's Ontology Lookup Service http://www.ebi.ac.uk/ontology-lookup/
Jon has revised the EDAM documentation with new guidelines for annotators. This was needed in preparation for the EMBRACE course on web service development and annotation in Copenhagen next month, with 14 expected attendees. It will include the BioCatalogue registry and the use of BioXSD datatypes.
Hamish McWilliam has helped with suggesting data formats as EDAM terms. Some new format-related terms were added.
There has also been discussion of terms for sequence annotation. For sequences in database entries, "sequence record" or "sequence record lite" can refer to an EMBL record or a FASTA file. References in ACD files always use "sequence record" as fully formatted database entries are accepted and parsed.
Terms have been added for sequence branches, covering protein, nucleotide, gapped, etc.
Cardinality terms have been added for single sequences, pairs, streams, etc.
EMBOSS has other datatypes that can be considered as sequences, for example features or reports can be output as sequence formats. These can be marked as "sequence annotation".
EDAM-specific relations (not defined in the OBO standard) are now referenced through "relationship" tags. This allows them to be visualised in OBO-Edit.
In OBO-Edit a key is defined as any term with no relations defined. Extra relations have been removed from the intended top-level terms so that this now works correctly.
Jon attended a final EMBRACE workshop on EDAM in Amsterdam. Most of the workshop was presentations on EDAM, BioXSD and the EMBRACE registry. Agreement was reached on the top level term structure of EDAM.
Peter has validated the latest EDAM release. The use of EDAM:EBI, EDAM:WHATIF and similar terms to annotate the origins of annotations was agreed as a similar non-standard policy is used by GO.
The Data resource branch in EDAM is not only categories of database, e.g. "sequence" or "structure". Individual databases are removed and ontologies will be removed later this week to make the overall structure cleaner.
The Swiss Institute of Bioinformatics will review EDAM to test the basic simplicity of the content.
The EMBRACE partners who helped develop EDAM are looking for funding to continue working and meeting.
The BioNemus project can read a schema or ontology to build a WSDL file given any input or output datatypes. Other datatypes are generated from these I/O types. The project would like to use EDAM to define data structures via "has_a" relations, for example "sequence record has_a sequence" or "accession has_a string". This can easily extend too far. EDAM will try to add the minimal definitions needed for the terms and cardinality.
The latest EDAM release is beta 7, available from http://edamontology.sf.net
Peter has also committed code in ajtax to parse the NCBI ontology in its dump format (available from NCBI). The NCBI ontology has a hierarchical structure but is complicated by using 7 separate files to cover deleted and merged nodes, names, GenBank divisions, genetic codes, citations and other details.
The indexing applications are intended for inclusion in the July release.
Peter has added EDAM ontology parsing to ajacd. If the variable EMBOSS_EDAM defines the location of an OBO file, acdvalid will read the file and validate any relations: attributes in the ACD file.
There was a report from one group in the USA interested in using an ontology to annotate the BioCatalogue WSDL files. They are now in contact with Jon for EDAM and with the BioCatalogue team at EBI.