EMBOSS: C2 Summary Report

The workshop topics were presented by Peter Rice, followed by discussion on the issues raised.

The following is a brief summary of the discussion. For more detail, see the individual topics, which will be updated with notes on the discussion and on progress since the workshop.

Sequences and Databases

Input Sequence Formats
The list of input formats continues to grow. There were request for the following formats to be included:
- RSF GCG Rich Sequence Format
- ABI trace files
- SCF 3 trace files and previous versions
- MASE Multiple Alignment Sequence Editor format, which is very similar to Intelligenetics.
  MASE is a multiple sequence alignment editor originally written by D.Faulkner and J.Jurka, TIBS 13,321-322 (1988), but since then maintained and upgraded in Los Alamos.
  Example:
```
;
CPZANT
ATGGGAGCGGGGGCGTCTGTTTTGAGGGGAGAGAAGCTAGATACATGGGA
AAGTATCAGGCTTCGGCCCGGTGGCAAGAAAAAGTACATGATAAAACATC
TGGTTTGGGCAAGATCGGAGCTGCAGCGTTTTGCGCTCAGCTCCTCCCTT
CTAGAAACATCAGAAGGTTGTGAAAAGGCTATCCATCAATTGAGCCCTTC
CATAGAAATAAGATCCCCTGAAATAATATCTTTGTTTAACACCATTTGTG
;
U455
ATGGGTGCGAGAGCGTCAGTATTAAGCGGGAAAAAATTAGATTCATGGGA
GAAAATTCGGTTAAGGCCAGGGGGAAACAAAAAATATAGACTGAAACATT
TAGTATGGGCAAGCAGGGAGCTGGAAAAATTCACACTTAACCCTGGCCTT
TTAGAAACAGCAGAAGGATGTCAGCAAATACTGGGACAATTACAACCAGC
TCTCCAGACAGGAACAGAAGAACTTAGATCATTATATAATACAGTAGCAG
```
USAs (sequence specifications) should include URLs. This can be done by looking for http: in the USA, but it may clash with checks for sequence format (which would be required as well). Needs some more thought but should definitely be done.
It could be useful to hold sequences as objects for use in more than one algorithm or program. At present this seems to be best achieved by writing a single application for all the required functions, but there could be other options (See ACD files below).
Gap character conversion still needs attention. Gap characters should be converted to an internal representation (probably -) and back to the appropriate gap characters for each output format.
Output Sequence Formats
Gap character conversion is needed on output to make sure that any gaps in a sequence are correctly converted to what the format requires.
Sequence Databases
GCG database format, using Bill Pearson's code, is in great demand. The GCG index files are not well suited to EMBOSS, being proprietary and slow, but a Staden/EMBL-CD index would be fast and would allow a simple way to specify subsets of the database as wildcards or list of file names in the divisions file.
Some method of incremental indexing, for EMBL and SwissProt/SpTrEMBL, would be welcome. No clear consensus was reached on how best to implement it.
Sequence Features
Blast results in GFF format need to have all score elements in the tag value fields, in addition to a "score" value.
Frame would be better as "1, 2, 3" with "0" for "frame ignored". This is now fixed in the GFF format and cannot be changed. "." must be used to ignore the frame.
A text output format is needed. No special requirements were identified. Anything should be more useful than separate report formats for each application.
A strictly controlled vocabulary will be needed. The GFF format suggests the EMBL feature table, but EMBOSS may need some extensions.

Command Line Interface
ACD Files
Command Line Syntax
User Prompts
An interesting suggestion was to provide defaults in a project file, either for all applications (in an emboss.default or .embossrc file) or application specific (in a defaults.program file for example). The syntax could be:
```
OPTION program.qualifier "value"
```
All values are strings internally. Examples would be default local data files such as codon usage tables. Maybe this is best implemented in the emboss.defaults file first, with other file(s) added later.
This topic should be raised on the emboss@sanger.ac.uk mailing list to get feedback from the wider user community.
Error Messages

Other User Interfaces
Web Interfaces
A specific filename extension for output would be useful for post processing. There is a default in the ACD processing which could be redefined for Web interfaces.
GUI Interfaces
The filename extension point (see Web Interfaces) applies here too.
NCBI's Vibrant interface was suggested as a possible GUI and graphics engine.

Applications
Existing Applications
There was discussion of editors. Current plans are to use existing editors and to make sure that they can read and write formats which EMBOSS understands. We are already in contact with the authors of CINEMA, Artemis and JalView.
Some editors can start external applications, and could be more closely integrated with EMBOSS.
There is also the possibility of integrating a simple editor from the public domain. One suggestion was Will Gilbert's MSE but the source code is not included in the distribution.
New Applications
User Documentation
There were offers to help with documentation, and discussion of a formal "EMBOSS Documentation" project.
Documentation should include training material with EMBOSS applications as examples. Where EMBOSS has no application to cover a particular example, one should be developed for completeness.

Installing and Beta Testing
User Testing
Installation
For Linux systems, an RPM distribution would be very useful. There are already plans to produce this with other Linux developers at the Sanger Centre.
There is little feedback so far on which platforms are most used by the EMBOSS beta test sites. When the new installation procedure is ready, it will be announced together with a request for beta testers to fill in a simple registration form with this information.
Environment Variables
Local Data Files
Some data file directories can be very large, for example codon usage tables from CUTG. It would be useful to have a database query for these, for example SRS to extract a table from CUTG instead of individual tables. This should be a simple syntax issue.

Developers
Technical Support
Support for Developers
Library Source Files
Programming Language Support
Software Documentation
Graphics Library
Prettyplot (the EGCG version) can run very slowly at times on pen plotters. The cause is changing pen colour for each letter, which causes the plotter to switch pens hundreds of times. This will be fixed by processing one colour at a time, as far as possible.
There have been requests to separate the graphics library from the rest of the package. Work is still in progress.
Functions Returning Objects
Perl Support

Future
Future Funding
Workshops and Training
Publications
There was discussion of an emboss Usenet newsgroup. For now, postings will continue to embnet.general and bionet.software (and bionet.announce where appropriate).

EMBOSS: C2 Summary Report

Sequences and Databases

Command Line Interface

Other User Interfaces

Applications

Installing and Beta Testing

Developers

Future