EMBOSS: Project Meeting (Mon 22nd March 10) |
Peter fixed a problem in loading very large (Human chromosome 1) FASTA sequences. The sequence format auto-detection required the input to be buffered, which was done with a record size (reserved size for each string) of over 2000 bytes. Several fixes were applied. The buffered records use the minimum string size (ajStrAssignS no longer copies the original reserved size). More significantly, as FASTA format cannot fail once it has accepted the ID record, buffering is now turned off for FASTA format and for any other format once there is no "return ajFalse" in the ajseqread.c function that parses it. Run times improved dramatically as a result. Alan has rewritten the HttpGet functions/sub-functions in ajseqdb to make them IPv6-compliant. This also involved additions to ajsys for socket handling. The HttpGet functions now contain no ifdefs. Committed to CVS.
Alan spent some time looking at SIGALRM equivalents for WIN32. Preliminary investigation shows that some Microsoft example code snippets do not appear to work quite as advertised using the Express compiler.
A new application martseqs queries a registry and marks those marts/datasets/attributes that can return sequence information. The library has been extended to cope. Code is committed.
Alan wrote HTTP URL routines to parse URLs. This is done in an IPv6-compliant way following W3 recommendations. They are currently in ajmart.c but can be moved to (e.g.) ajstr.c at some stage (or equivalents added). Code committed.
The Swiss Institute of Bioinformatics are reviewing EDAM and will report back to Jon
Peter would like to implement a parser and indexing for OBO format ontology data (EDAM, GO, SO, and others) and for the NCBI taxonomy so that these can be used to enhance EMBOSS results.
Peter will look into modifying the AjPFeature structure so that child features are stored in a list within the parent. The current approach of including all in one table is problematic when sorting and when processing results. The code changes should be minimal and the results would be much cleaner. At the same time, GFF3 format needs some attention to enforce stricter rules including the escaping of characters in tag values.
Peter has worked through the latest 'to do' records for the Developers Manual. Most of the 'new' items were repeats of tasks already done and committed.
Jon reported the Developers Guide is currently around 450 pages. The ACD syntax documentation could be too much for the book size and may need trimming to fit (with the full document on the web site).
We are waiting for news from the publishers on the printed size of the books. The page count estimates are for the Word version.