EMBOSS: Project Meeting (Mon 22nd Aug 11)


EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag Michael Schuster

1. Minutes of the last meeting

Minutes of the meeting of 15th August 2011 are here.

2. Maintenance etc.

2.1 Applications

2.2 Libraries

Peter has implemented parsing of EMBL/GenBank CON entries for reference sequences and for sequence data. CON entries have features, but refer to whole genome shotgun entries for the sequence data. Several errors were found in the current EMBL release where the WGS entries have moved but the CON entry has not been updated. Peter has been in in discussions on the GFF3 feature format on the GMOD and Sequence Ontology mailing lists.

EMBOSS GFF3 handling has been improved to use correct a few feature names, support circular features, and to follow GFF3 style for tags and parent links. SO terms still fail to cover all protein features. GFF3 format has no way to formally identify features for a protein sequence, so EMBOSS will continue to use a comment in the header for this.

Peter has extended database definitions to use a new "special:" attribute where library code can look for known prefixes in the tag value. These will replace the current specially formatted comment attributes in ensembl code.

Peter will add LGPL and CVS tags in the header files. Mahmut asked whether the use of header files could be more specific in the libraries as any header change currently causes a complete rebuild in eclipse.

Michael would like to document enumerated types and constants. Peter will extend the documentation scripts to cover these declarations.

2.3 Other

Michael is working on updates to configuration and will be ready to commit soon. Unused symbols are being removed and definitions relocated as appropriate in ajdefine.h and ajarch.h.

Mahmut noted that Jemboss now requires the 1.6 awt desktop, so compilation fails with Java 1.5. Tests could be added to the Makefile which currently states 1.4 as the requirement.

Michael noted that some CYGWIN tests in configure appear to be not used.

Michael offered to demonstrate a Solaris virtual machine that can be used in VirtualBox for testing.

3. New developments

3.1 Assembly data

Mahmut has a parser for MAF format assembly data from MIRA, and for data in SAM format. Peter Cock's maf2sam script was particularly helpful in resolving differences between the formats. The SAM format extends the sequence parsing to include mapping and header information.

3.2 Ensembl

Michael has data structures for assemblies and mapping coordinates at the chromosome, clone and contig levels. These can be used to map and remap features at each level.

Michael will compare the CIGAR alignment string handling in Ensembl with Mahmut latest code to check for consistency especially in the use of non-standard characters.

Michael would like to add Wiki pages to describe the Ensembl API code at a high level.


Jon has a set of 1700 resource definitions from the Nucleic Acids Research website which could be added to DRCAT. It would be helpful if these could be automated, but there are many special cases.

3.4 EDAM

EDAM topics will need extending to have a wider scope, and some topic terms are tool-centric rather than data-centric. A refactoring could be helpful.

Jon reported a request for annotation of web services with EDAM. He was referred to the EDAM website and BioCatalogue, and will be invited to the next EDAM meeting.

5. Documentation and Training

5.1 Web server

Peter and Jon will work on possibilities for a revised home page.

5.2 Books

We should meet soon with CUP to discuss the books.

Jon suggested contacting training providers and asking for feedback.

6. User queries and answers

All done.

8. Date Of Next Meeting

Monday 29th August is a public holiday. The next EMBOSS meeting will be on Monday 5th September.