EMBOSS: Project Meeting (Mon 22nd June 09)


EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag

1. Minutes of the last meeting

Minutes of the meeting of 15th June 2009 are here.

2. Current release

2.1 Applications

Alan has modified embossdata to sort filenames before reporting.

2.2 Libraries

Peter has updated MEGA format to produce files compatible with MEGA version 4.1 beta. One of the example files contained 45 alignments of proteins with "gene" and "domain" command labels. These are treated by EMBOSS input as 45 separate alignments, and can be correctly copied by seqretsetall. Testing with MEGA 4.1 shows that the file is treated as one alignment with the genes and domains as alignment annotation. EMBOSS has no alignment annotation at present.

Peter described the new FASTQ sequence formats to read next-generation short read files. The files have a simple format which can be correctly parsed by first reading the sequence up to a line starting with a plus character, and then reading exactly the same number of characters of quality values, ignoring characters lower than ASCII 33. This avoids the errors in some parsers that explicitly look for an at sign to start a new sequence and becmoe confused as this is also a valid quality character.

Because there are two overlapping standards for quality scores EMBOSS will read "fastq" as a pure sequence format, ignoring scores. If given a format of "fastqsanger" it will read Sanger encoded phred scores. Format "fastqillumina" will expect Illumina encoding of phred scores with character 64 representing zero. Format "fastqsolexa" will expect Solexa/Illumina 1.0 format which has Solexa quality values. These differ from phred values for scores of 15 and above, and can go as low as -5 for single reads. This is because of the inclusion of the probability of correctness in the formula. A score of -5 is equivalent to a random base call in Solexa scoring. Solexa values are converted to phred values on input.

The same format names can be used on output. "fastq" will write Sanger quality scores for now. This may change if another version of the format is more generally used.

A further format "fastqint" reads integer quality scores. This format is accepted by MAQ. It is not clear if it is still in use.

Peter noted that good example files for FASTQ formats are hard to find. Many of the on-line format examples have unlikely distributions of quality scores.

2.3 SoapLab

Mahmut has merged development branch changes to SoapLab into the main CVS branch. This version is now deployed on the EBI test servers.

Launch of mwcontam was broken by using file lists separated by newlines. This needs some care in fixing as the use of comma separated lists breaks one of the SoapLab QA tests.

Mahmut has checked in documentation for the SoapLab2 typed interface, including new features and deployment with the tomcat manager.

Mahmut reported interest in an updated SoapLab2 plugin for Taverna.

2.4 Other

Alan has been making minor tweaks to the configuration of Jemboss.

Mahmut has also been looking into Jemboss installation issues. One user installs with a different form of the command line and finds broken documentation links.

3 New developments

Alan continues to work on interfaces to Ensembl and investigating the internals of BioPerl. Jon reported some secondary databases are also using Ensembl interfaces. Some data resources have a public MySQL query interface which could use extensions to the planned ensembl interface code.

Peter noted that EMBOSS will need to add feature annotation for alignments. This would be an extension to the current sequence set object. We will need to consider combining alignment features with features for the individual aligned sequences.

Peter has signed up for a further EBI interfaces course in October.

Peter has added documentation of FASTQ short read formats to the Wiki.

Jon is working through responses for the secondary databases. Responses from the database providers have been highly positive, and include information on how to download complete database releases. SwissProt have given permission to distribute a modified UniProt dbxref.txt file. Jon is also adding library code to read dbxref.txt into a data structure.

Peter suggested checking with the ELIXIR team to compare responses with their WP2 Data Resources survey.

Peter will add code to pick cross-references from the DR lines and /db_xref qualifiers in the nucleotide feature table and return a list of cross-reference objects. Implicit references can be included at a later stage.

4. Administration

Alan has produced workstation specifications. Peter is in contact with our Dell account representative and will order them this week. The specification is for Intel Quad Q9550 at 3GHz (top of the core duo range and with a better price-performance than the i7). The workstations include 8GB of fast DDR memory, 2.5TB of local disk, 24 inch widescreen monitors and the usual peripherals. Alan will install Fedora 11 dual boot with the Windows 64-bit Vista provided by Dell.

Alan requested that all changes for the 6.1.0 release are committed and tested in good time. We need to check carefully for new files that should be explicitly named in the Makefile.am files to be included in the release.

Alan will test the release on Fedora 11, MACOSX (beta snow leopard), Solaris (with gcc), SGI O2, and Suse. We have no Alphas to test on, and no IBM loan systems. The HP loan systems need an operating system upgrade. mEMBOSS will be tested on 64-bit XP and 32-bit and 64-bit Vista.

EBI systems have not yet reinstalled the RAID server. Alan will install Fedora 11 on the existing machines once the RAID is available.

5. Documentation and Training

5.1 Books

Jon reminded everyone of the 24th August deadline.

the kemboss author has sent new documentation to be included in the User's Guide.

5.2 Website

Peter has updated pages on the Wiki.

6. User queries and answers

All done this week.

7. AOB

Peter announced the new grant on LinkedIn and has had some good responses. This looks to be a good way to announce news as all contacts get a 2-line report in their update email when the "status" message changes.

Peter will be away next week, giving a 15-minute talk at BOSC, a 25-minute demo at ISMB, a poster (sent to Sanger for printing) and a lunchtime Birds of a Feather session. BOSC will be a good place to discuss common issues with the other Open-Bio projects.

8. Date Of Next Meeting

The next meeting will be on Monday 29nd June in the usual meeting room A2-106. Peter will be away at BOSC and ISMB.