EMBOSS: Project Meeting (Mon 10th August 09) |
Peter reported discussions with the other Open-Bio projects on FASTQ format. An outstanding issue is whether to allow zero length sequences after trimming of adaptors and low base qualities. As FASTQ files may be paired (where the data originally was for paired-end reads) it will be necessary to keep such zero length sequences. This will require a change to the current requirement for at least 1 base in a sequence. Applications will need to define a minimum sequence length which may default to zero or 1 - whichever caused the least disruption to existing ACD files. Peter reported from the GMOD meeting. In discussions with the samtools author Heng Li it was agreed that EMBOSS would concentrate on interconversion of alignment formats to/from SAM and BAM formats. This is a much requested extension to samtools that EMBOSS can most appropriately cover.
Peter described the SAM tab delimited format and the BAM binary version. BAM is increasingly used to store unaligned short read data as the compression is very efficient and BAM files can be very efficiently indexed and accessed using samtools remote protocols (ftp and http).
The job timeout is not working when jobs are submitted through LSF. Terminating a job when the timeout is reached returns a "completed" job status instead of "terminated". Mahmut will fix this before extending the run time.
Mahmut has updated the Taverna pluging for SoapLab to support Taverna 1.7.2. The version update is necessary because Taverna's use of Raven requires plugin versions to match the Taverna release number.
Mahmut will also modify the SoapLab plugin to support Taverna 2.0. Discussions with Taverna support have resolved an issue with logging error messages (logging only works in the lib directory). The plugin API is similar to the Taverna 1.7.2 interface.
Peter proposed an extension to the DB definitions in emboss.default to define a SERVER with an access method that could return all databases and their formats. Applications could query any server to find the database names and formats that it can access.
The USA sytnax will require an extension to access a database from a known server. Peter proposed a syntax of "dbname@server:" as the database prefix. The emboss.default file would only need a definition of the server.
Peter suggested creating shared directories on either the current RAID server or the proposed new server. Alan pointed out that the new server may be accessed externally so that the internal RAID server is the best place. Directories will be created for databases and for software. The software directory can be included in users' paths once packages have been installed.
Peter has run spell checking on the Word versions and updated all except the last part of the Developer's manual. All changes have been committed so the spell check run will be completed this afternoon. Peter has the accepted words in the CUSTOM.DIC Word dictionary file on his laptop.
Jon has created a new file IndexTerms.txt with proposed indexable terms and their synonyms. These need to be reviewed and any unwanted terms deleted before the list is supplied to the publishers.
Jon has checked sections titles, corrected capitalisation and cleaned up cases where the chapter and first section title were identical. Some sections have been reordered and some paragraph tags are replaced by section tags. "NB" in the text is replaced by note tags. When reviewing the text additional blocks can be created for "caution", "important", "note" and "tip". These will be highlighted in the final book format. Some informal tables have been replaced with formally defined table blocks.
All files need to be checked for consistent use of the 6.1.0 version number.
Jon has removed the NUCLEUS chapter from the Developer's Guide as there is a lack of detailed material.
Jon has generated complete User and Developer Reference Manuals using the new master files.
Peter proposed a "fridge magnet" motif for the book covers, using EMBOSS application names with appropriate juxtapositions (perhaps "seqret syco"). This could be used together with any graphics design ideas.
EMBOSS will be able to interpret any phylip file by retrying using each possible format in turn. The first line of the file defines the number of sequences and their length so that any misinterpretation of names will lead to invalid results. The phylip formats try each other in turn on input.