EMBOSS: Project Meeting (Mon 16th April 2007) |
Peter has (again) fixed the efficiency of setting long string values. Testing showed that very long sequences were making far too many extensions and copies of strings. The string library functions are now fixed to correctly double the length of long strings (the code added for this was being bypassed). Reading EMBL and GenBank formats now sets the sequence string to be the length expected from the first line of the entry, avoiding the need to expand the sequence string.
Peter has updated all the test databases with a script that picks up the latest versions of all entries. Some entries have been removed, for example 3 related sequences are now in a single entry in EMBL/GenBank. Wildcard searches in the program examples depended on sequence IDs which are no longer available for EMBL entries. These are replaced by accession number wildcards or by list files.
Peter would like to review the list of test databases. We currently include wormpep as an example FASTA format database which is becoming dated. We should probably include REFSEQ. We need to generate GCG database files for the dbigcg and dbxgcg tests. The University of Cambridge can help generate the files.
Alan noted that we still need a patch file for the first fixes.
Jon has checked through the ACD datatype and attribute documentation. He will now concentrate on ACD syntax, ACD files and commandline behaviour which will form two chapters and an appendix section.
The next meeting is on Monday 30th April.