EMBOSS: C2 Summary Report

The workshop topics were presented by Peter Rice, followed by discussion on the issues raised.

The following is a brief summary of the discussion. For more detail, see the individual topics, which will be updated with notes on the discussion and on progress since the workshop.

    Sequences and Databases

  1. Input Sequence Formats

    The list of input formats continues to grow. There were request for the following formats to be included:

    USAs (sequence specifications) should include URLs. This can be done by looking for http: in the USA, but it may clash with checks for sequence format (which would be required as well). Needs some more thought but should definitely be done.

    It could be useful to hold sequences as objects for use in more than one algorithm or program. At present this seems to be best achieved by writing a single application for all the required functions, but there could be other options (See ACD files below).

    Gap character conversion still needs attention. Gap characters should be converted to an internal representation (probably -) and back to the appropriate gap characters for each output format.

  2. Output Sequence Formats

    Gap character conversion is needed on output to make sure that any gaps in a sequence are correctly converted to what the format requires.

  3. Sequence Databases

    GCG database format, using Bill Pearson's code, is in great demand. The GCG index files are not well suited to EMBOSS, being proprietary and slow, but a Staden/EMBL-CD index would be fast and would allow a simple way to specify subsets of the database as wildcards or list of file names in the divisions file.

    Some method of incremental indexing, for EMBL and SwissProt/SpTrEMBL, would be welcome. No clear consensus was reached on how best to implement it.

  4. Sequence Features

    Blast results in GFF format need to have all score elements in the tag value fields, in addition to a "score" value.

    Frame would be better as "1, 2, 3" with "0" for "frame ignored". This is now fixed in the GFF format and cannot be changed. "." must be used to ignore the frame.

    A text output format is needed. No special requirements were identified. Anything should be more useful than separate report formats for each application.

    A strictly controlled vocabulary will be needed. The GFF format suggests the EMBL feature table, but EMBOSS may need some extensions.

    Command Line Interface

  5. ACD Files

  6. Command Line Syntax

  7. User Prompts

    An interesting suggestion was to provide defaults in a project file, either for all applications (in an emboss.default or .embossrc file) or application specific (in a defaults.program file for example). The syntax could be:

    OPTION program.qualifier "value"

    All values are strings internally. Examples would be default local data files such as codon usage tables. Maybe this is best implemented in the emboss.defaults file first, with other file(s) added later.

    This topic should be raised on the emboss@sanger.ac.uk mailing list to get feedback from the wider user community.

  8. Error Messages

    Other User Interfaces

  9. Web Interfaces

    A specific filename extension for output would be useful for post processing. There is a default in the ACD processing which could be redefined for Web interfaces.

  10. GUI Interfaces

    The filename extension point (see Web Interfaces) applies here too.

    NCBI's Vibrant interface was suggested as a possible GUI and graphics engine.


  11. Existing Applications

    There was discussion of editors. Current plans are to use existing editors and to make sure that they can read and write formats which EMBOSS understands. We are already in contact with the authors of CINEMA, Artemis and JalView.

    Some editors can start external applications, and could be more closely integrated with EMBOSS.

    There is also the possibility of integrating a simple editor from the public domain. One suggestion was Will Gilbert's MSE but the source code is not included in the distribution.

  12. New Applications

  13. User Documentation

    There were offers to help with documentation, and discussion of a formal "EMBOSS Documentation" project.

    Documentation should include training material with EMBOSS applications as examples. Where EMBOSS has no application to cover a particular example, one should be developed for completeness.

    Installing and Beta Testing

  14. User Testing

  15. Installation

    For Linux systems, an RPM distribution would be very useful. There are already plans to produce this with other Linux developers at the Sanger Centre.

    There is little feedback so far on which platforms are most used by the EMBOSS beta test sites. When the new installation procedure is ready, it will be announced together with a request for beta testers to fill in a simple registration form with this information.

  16. Environment Variables

  17. Local Data Files

    Some data file directories can be very large, for example codon usage tables from CUTG. It would be useful to have a database query for these, for example SRS to extract a table from CUTG instead of individual tables. This should be a simple syntax issue.


  18. Technical Support

  19. Support for Developers

  20. Library Source Files

  21. Programming Language Support

  22. Software Documentation

  23. Graphics Library

    Prettyplot (the EGCG version) can run very slowly at times on pen plotters. The cause is changing pen colour for each letter, which causes the plotter to switch pens hundreds of times. This will be fixed by processing one colour at a time, as far as possible.

    There have been requests to separate the graphics library from the rest of the package. Work is still in progress.

  24. Functions Returning Objects

  25. Perl Support


  26. Future Funding

  27. Workshops and Training

  28. Publications

    There was discussion of an emboss Usenet newsgroup. For now, postings will continue to embnet.general and bionet.software (and bionet.announce where appropriate).