EMBOSS: C2 Summary Report |
The workshop topics were presented by Peter Rice, followed by discussion on the issues raised.
The following is a brief summary of the discussion. For more detail, see the individual topics, which will be updated with notes on the discussion and on progress since the workshop.
The list of input formats continues to grow. There were request for the following formats to be included:
MASE is a multiple sequence alignment editor originally written by D.Faulkner and J.Jurka, TIBS 13,321-322 (1988), but since then maintained and upgraded in Los Alamos.
Example:
; CPZANT ATGGGAGCGGGGGCGTCTGTTTTGAGGGGAGAGAAGCTAGATACATGGGA AAGTATCAGGCTTCGGCCCGGTGGCAAGAAAAAGTACATGATAAAACATC TGGTTTGGGCAAGATCGGAGCTGCAGCGTTTTGCGCTCAGCTCCTCCCTT CTAGAAACATCAGAAGGTTGTGAAAAGGCTATCCATCAATTGAGCCCTTC CATAGAAATAAGATCCCCTGAAATAATATCTTTGTTTAACACCATTTGTG ; U455 ATGGGTGCGAGAGCGTCAGTATTAAGCGGGAAAAAATTAGATTCATGGGA GAAAATTCGGTTAAGGCCAGGGGGAAACAAAAAATATAGACTGAAACATT TAGTATGGGCAAGCAGGGAGCTGGAAAAATTCACACTTAACCCTGGCCTT TTAGAAACAGCAGAAGGATGTCAGCAAATACTGGGACAATTACAACCAGC TCTCCAGACAGGAACAGAAGAACTTAGATCATTATATAATACAGTAGCAG
USAs (sequence specifications) should include URLs. This can be done by looking for http: in the USA, but it may clash with checks for sequence format (which would be required as well). Needs some more thought but should definitely be done.
It could be useful to hold sequences as objects for use in more than one algorithm or program. At present this seems to be best achieved by writing a single application for all the required functions, but there could be other options (See ACD files below).
Gap character conversion still needs attention. Gap characters should be converted to an internal representation (probably -) and back to the appropriate gap characters for each output format.
Gap character conversion is needed on output to make sure that any gaps in a sequence are correctly converted to what the format requires.
GCG database format, using Bill Pearson's code, is in great demand. The GCG index files are not well suited to EMBOSS, being proprietary and slow, but a Staden/EMBL-CD index would be fast and would allow a simple way to specify subsets of the database as wildcards or list of file names in the divisions file.
Some method of incremental indexing, for EMBL and SwissProt/SpTrEMBL, would be welcome. No clear consensus was reached on how best to implement it.
Blast results in GFF format need to have all score elements in the tag value fields, in addition to a "score" value.
Frame would be better as "1, 2, 3" with "0" for "frame ignored". This is now fixed in the GFF format and cannot be changed. "." must be used to ignore the frame.
A text output format is needed. No special requirements were identified. Anything should be more useful than separate report formats for each application.
A strictly controlled vocabulary will be needed. The GFF format suggests the EMBL feature table, but EMBOSS may need some extensions.
An interesting suggestion was to provide defaults in a project file, either for all applications (in an emboss.default or .embossrc file) or application specific (in a defaults.program file for example). The syntax could be:
OPTION program.qualifier "value"
All values are strings internally. Examples would be default local data files such as codon usage tables. Maybe this is best implemented in the emboss.defaults file first, with other file(s) added later.
This topic should be raised on the emboss@sanger.ac.uk mailing list to get feedback from the wider user community.
A specific filename extension for output would be useful for post processing. There is a default in the ACD processing which could be redefined for Web interfaces.
The filename extension point (see Web Interfaces) applies here too.
NCBI's Vibrant interface was suggested as a possible GUI and graphics engine.
There was discussion of editors. Current plans are to use existing editors and to make sure that they can read and write formats which EMBOSS understands. We are already in contact with the authors of CINEMA, Artemis and JalView.
Some editors can start external applications, and could be more closely integrated with EMBOSS.
There is also the possibility of integrating a simple editor from the public domain. One suggestion was Will Gilbert's MSE but the source code is not included in the distribution.
There were offers to help with documentation, and discussion of a formal "EMBOSS Documentation" project.
Documentation should include training material with EMBOSS applications as examples. Where EMBOSS has no application to cover a particular example, one should be developed for completeness.
For Linux systems, an RPM distribution would be very useful. There are already plans to produce this with other Linux developers at the Sanger Centre.
There is little feedback so far on which platforms are most used by the EMBOSS beta test sites. When the new installation procedure is ready, it will be announced together with a request for beta testers to fill in a simple registration form with this information.
Some data file directories can be very large, for example codon usage tables from CUTG. It would be useful to have a database query for these, for example SRS to extract a table from CUTG instead of individual tables. This should be a simple syntax issue.
Prettyplot (the EGCG version) can run very slowly at times on pen plotters. The cause is changing pen colour for each letter, which causes the plotter to switch pens hundreds of times. This will be fixed by processing one colour at a time, as far as possible.
There have been requests to separate the graphics library from the rest of the package. Work is still in progress.
There was discussion of an emboss Usenet newsgroup. For now, postings will continue to embnet.general and bionet.software (and bionet.announce where appropriate).