EMBOSS: Project Meeting (Mon 22nd June 09) |
Peter described the new FASTQ sequence formats to read next-generation short read files. The files have a simple format which can be correctly parsed by first reading the sequence up to a line starting with a plus character, and then reading exactly the same number of characters of quality values, ignoring characters lower than ASCII 33. This avoids the errors in some parsers that explicitly look for an at sign to start a new sequence and becmoe confused as this is also a valid quality character.
Because there are two overlapping standards for quality scores EMBOSS will read "fastq" as a pure sequence format, ignoring scores. If given a format of "fastqsanger" it will read Sanger encoded phred scores. Format "fastqillumina" will expect Illumina encoding of phred scores with character 64 representing zero. Format "fastqsolexa" will expect Solexa/Illumina 1.0 format which has Solexa quality values. These differ from phred values for scores of 15 and above, and can go as low as -5 for single reads. This is because of the inclusion of the probability of correctness in the formula. A score of -5 is equivalent to a random base call in Solexa scoring. Solexa values are converted to phred values on input.
The same format names can be used on output. "fastq" will write Sanger quality scores for now. This may change if another version of the format is more generally used.
A further format "fastqint" reads integer quality scores. This format is accepted by MAQ. It is not clear if it is still in use.
Peter noted that good example files for FASTQ formats are hard to find. Many of the on-line format examples have unlikely distributions of quality scores.
Launch of mwcontam was broken by using file lists separated by newlines. This needs some care in fixing as the use of comma separated lists breaks one of the SoapLab QA tests.
Mahmut has checked in documentation for the SoapLab2 typed interface, including new features and deployment with the tomcat manager.
Mahmut reported interest in an updated SoapLab2 plugin for Taverna.
Mahmut has also been looking into Jemboss installation issues. One user installs with a different form of the command line and finds broken documentation links.
Peter noted that EMBOSS will need to add feature annotation for alignments. This would be an extension to the current sequence set object. We will need to consider combining alignment features with features for the individual aligned sequences.
Peter has signed up for a further EBI interfaces course in October.
Peter has added documentation of FASTQ short read formats to the Wiki.
Jon is working through responses for the secondary databases. Responses from the database providers have been highly positive, and include information on how to download complete database releases. SwissProt have given permission to distribute a modified UniProt dbxref.txt file. Jon is also adding library code to read dbxref.txt into a data structure.
Peter suggested checking with the ELIXIR team to compare responses with their WP2 Data Resources survey.
Peter will add code to pick cross-references from the DR lines and /db_xref qualifiers in the nucleotide feature table and return a list of cross-reference objects. Implicit references can be included at a later stage.
Alan requested that all changes for the 6.1.0 release are committed and tested in good time. We need to check carefully for new files that should be explicitly named in the Makefile.am files to be included in the release.
Alan will test the release on Fedora 11, MACOSX (beta snow leopard), Solaris (with gcc), SGI O2, and Suse. We have no Alphas to test on, and no IBM loan systems. The HP loan systems need an operating system upgrade. mEMBOSS will be tested on 64-bit XP and 32-bit and 64-bit Vista.
EBI systems have not yet reinstalled the RAID server. Alan will install Fedora 11 on the existing machines once the RAID is available.
the kemboss author has sent new documentation to be included in the User's Guide.
Peter will be away next week, giving a 15-minute talk at BOSC, a 25-minute demo at ISMB, a poster (sent to Sanger for printing) and a lunchtime Birds of a Feather session. BOSC will be a good place to discuss common issues with the other Open-Bio projects.