EMBOSS: Project Meeting (Monday 15th June 2009)

EMBOSS: Project Meeting (Mon 15th June 09)

Attendees

EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag
Sanger:
Visitors:
Apologies:

1. Minutes of the last meeting

Minutes of the meeting of 8th June 2009 are here.

2. Current release

2.1 Applications

Peter suggested adding new applications to support the processing of multiple sequence sets, as indicated by issues in reading sets of data in MEGA format. These include an application to return the nth set from an input file, and versions of infoseq to report on a sequence set or multiple sequence sets.

2.2 Libraries

Peter has committed a number of code changes.

Peter has extended support for MEGA sequence format. The support for output in MEGA format is in place. There are a number of issues when reading the example data files provided by the developers of MEGA. Several of these files include multiple sets of aligned sequences. Although EMBOSS can read these sets, it is complicated to write them in a format that MEGA will accept. For example, the standard "format" header should include the number of sequences and the length of the alignment, but it appears that only one file header is allowed for each output file so that it is better to omit these fields from the "format" command records. MEGA does not gracefully handle some of the incorrectly formatted data - the application may terminate.

Peter has extended the "infile" datatype to have a new attribute "directory". The attribute may also be useful for "datafile" data types, but this remains to be investigated.

Peter has added a new ACD function "@(value:variable)" to allow an environment variable to be defined as the default directory to read an input file. This will allow the HMMERNEW applications to support HMMER's use of HMMERDB, BLASTDB and BLASTMAT as the locations of input files. Processing gracefully handles undefined input directories, and explicit paths specified by the user. A cleanup of the handling of empty paths by the file name functions will make the ACD code cleaner.

Peter has fixed an issue in ACD command line processing where a short qualifier name such as "-i" could be misinterpreted as an associated qualifier "-odirectory". Where an associated qualifier has a single match, a second check is made against all qualifiers. if there is an exact match to another qualifier then associated qualifier processing rejects the original match. This issue probably dates back to the original EMBOSS 1.0.0 release.

Jon has added missing quotes in HMMERNEW ACD files. These could also be tested by acdvalid.

Mahmut suggested supporting FASTQ data format for sequence input. Peter will extend FASTA format to inluce FASTQ processing, without explicitly processing the sequence quality data for now.

2.3 SoapLab

Mahmut has checked out the latest Subversion code for the Taverna SoapLab plugin.

Mahmut reported some issues with the SoapLab server monitoring servlet code from External Servcies. Useful extensions include checking for a delayed job launch before reporting a 2 minute timeout error as a server problem.

Garbage collection was turned off until memory is full because the minimum and maximum values for JVM memory were set to the same value of 1GB. Open file errors can be avoided by making sure SoapLab closes all open pipes without leaving them to be closed by garbage collection.

Mahmut has updated the SoapLab2 abstract submitted to BOSC.

Mahmut is working with Martin Senger on a new SoapLab release before the presentation at BOSC at the end of this month.

2.4 Other

Alan has modified the "make dist" commands to avoid error messages on Linux when removing CVS directories from the distribution tree.

Alan has been working on shell script configuration to enable the Unix version to build Jemboss by default. There have been some GNU makefile issues.

Alan has also committed changes to allow compilation without warnings with gcc 4.4. With --dev-warnings enabled, some local variables shadowed global names. These were renamed with a 'v' prefix. Peter noted similar issues with the latest Cygwin.

Alan discussed plans to update the configure and make procedures for developers to support libtool 2 under Fedora 11 and Cygwin. It is possible to update the current Fedora 10 installations and run "autoreconf -fiv" to allow early testing of the changes, or to wait for installation of Fedora 11. Libtool 2 will generate additional files that will need to be included in CVS.

Peter has installed BioLegato and the BIRCH package from Brian Fristensky in Manitoba. There are some issues in the installation procedure which can redefine environment variables and the user prompt, and adds a welcome message on login. Changes have been requested to allow better support of Birch and BioLegato with standalone versions of other packages. Peter will try generating BioLegato configuration files using the GDE configuration file syntax.

Alan has built and successfully tested mEMBOSS using the latest commits.

3 New developments

Alan continues to look into the Ensembl API and BioPerl, using examples from the recent EBI workshop on application programming interfaces for EBI resources.

Jon has completed working through the list of cross-referenced data resources in EMBL and SwissProt. We now have an updated list of contact addresses and have some replies. A number of resources are providing BioMart and DAS based interfaces which we will be able to use when these are added to EMBOSS. Plain text is available for some resources, but the official URLs can be out of date. In many cases a single data resource has multiple entities available, and may have multiple pages per entry. We can start with selecting a single start point and reporting the URL. The Wiki is being updated with the latest details for each resource.

Jon has started development of a new application xrefall to return all data cross-references for a sequence. This will use a data file "dbtypes.dat" which will list the EDAM data resource type and all data resource names used by the sequence databases. Peter proposed extending this to be a list with a recommended name and a list of alternative names used, stored as a table internally as we already do for feature keys. The file will also be used for new applications that report specific types of cross-references. The output file will report by data resource types, listing the primary and secondary identifiers and the URL.

4. Administration

Alan reported that the new SCSI card successfully resolved the RAID server error issues. We are now waiting for systems to reistall the RAID server and emboss7.

The old disks are available for reuse. They may be up to 7 years old.

Alan has installed Fedora 11 on 7 home machines. Some installation bugs have been reported, for example issues when installing with 2 graphics cards, and problems with print dialogues. Installation on the EMBOSS machines can wait until these have been fixed.

The cost centre for the new grant has been opened. Alan will propose specifications for the 3 developer workstations.

Peter will check on pricing for XMLspy.

Mahmut suggested making more use of eclipse and netbeans for project editing. We can arrange joint training sessions on these and emacs after the release. Alan noted that Fedora 11 includes both eclipse and netbeans.

Alan has subscribed a new user to the mailing lists to test digests. It is not clear where the digest frequency is defined. It was suggested that a daily digest would be best for users who select the "digest" option.

5. Documentation and Training

5.1 Books

The full version of XMLmind and the converter have been installed.

Jon has removed the "new applications" file from XML as not relevant to the books. The page contents are now on the Wiki.

5.2 Website

Jon has updated bad external links., and updated the ACD syntax documentation to include the recent changes. Peter will copy the new text into the master ACD documentation template in CVS.

6. User queries and answers

All done this week.

7. AOB

Peter has announced the new grant on the mailing lists. This has been picked up by GenomeWebs which reported on the news this week.

8. Date Of Next Meeting

The next meeting will be on Monday 22nd June in the usual meeting room A2-106.