EMBOSS: Project Meeting (Mon 17th September 07) |
Alan has extended amino acid property calculations to use monoisotopic molecular weights in addition to the current average weights. Monoisotopic weights are used in analysis of mass spec data. The embProp functions now read a new file Emolwt.dat and pass new objects EmbPPropAmino and EmbPPropMolwt. A command line switch -mono is now added to all programs that use this local data file.
The local data files need comments added to their headers to describe the source of the data values.
Alan has added a new application density to report and plot high and low densities for individual bases (-quad) and AT+GC (-dual) as requested by Tim Carver at Sanger.
Alan has made very general wrappers for the emira and emiraest integration which could be used for other applications.
Peter has modified pepcoil to produce a report as output. This included extensions to the "motif" report format. To give a clean output by default non coiled-coil regions are not reported. The frames are reported as additional columns. In the "motif" format these appear only if a frame is set by the program. In table formats the frame columns appear with missing ('.') values.
Peter has further updated GFF3 output to require a reading frame for coding sequence features. This included adding a new function to identify CDS features using the Sequence Ontology internal code. Similar functions can be added for other feature classes.
Peter has corrected the meaning of "Case" in function names for string processing so that it always refers to case-insensitive processing. For some functions the naming was inconsistent.
Mahmut has identified applications suitable for generating DASGFF annotations. To fit with current servers, these need to accept a sequence as the only input, for example applications which search for matches to a standard database of motifs such as prosite, jaspar or transfac. This excludes applications such as fuzzpro which need a user-defined pattern.
Peter identified a need for a new output report attribute to indicate whether features are over the entire sequence or are for regions with specific properties. The whole: attribute can be used for applications such as dan which report all windows over the sequence. Such applications are of little interest as DASGFF annotation servers.
Mahmut has found further references to XML feature and sequence formats. None have a recommended standard, although AGAVE is top of a list from Peter Ernst (DKFZ).
Mahmut noted that DASGFF format includes the concept of a feature group where related features (e.g. a set of features for a single transcript/gene) have a common group name with annotation. Peter will look into reading DASGFF as input and preserving the group annotation in the feature table so that it can be reproduced on output.
Mahmut will circulate links to DAS2 applications, including IGB.
Mahmut asked whether object files could be built in a subdirectory as they are appearing in eclipse in the list of files (or perhaps there is a way to filter them out in eclipse) and whether large source files (over 10,000 lines) could be split as they can be slow to scroll in eclipse in C++ mode.
Mahmut found a problem with the AppLab server running estgenome with large input sequences. The problem was fixed by killing the jobs after more than 12 hours. We have a new version of LSF installed so the server will be able to run jobs in batch again.
A user has produced an emacs mode for editing ACD files. Alan has contacted him to merge in the EMBOSS C style. This is now being tested. It may be necessary to reintroduce the automatic brace insertion.
Alan is waiting for confirmation of the fix for the primer3_core launch problem on Windows.
Alan will rebuild on Windows with the 3 new applications added since the release.
Jon has committed the latest revisions of guidelines for the documentation of applications and source code, and has added details of user-defined values for ACD data types such as lists.
Jon noted that we still need to update the list of applications required to describe those already available elsewhere (FASTA and BLAST) and to remove some that are already in EMBOSS.
The next meeting is on Monday 1st October.