EMBOSS: Project Meeting (Mar 15th 1999)


Sanger Centre: Peter Rice, Ian Longden, Richard Bruskiewich
HGMP: Alan Bleasby, Thon de Boer, Mark Faller, Gary Williams
Apologies: Rodrigo Lopez, Martin Senger, Rob Andrews, Ewan Birney, Val Curwen, Sinead O'Leary

1. Matters Arising

See below.

2. ACD extensions

Peter has implemented a new command line default qualifier "-options" and a new default attribute "optional:" for all data types. If any ACD item specified "opt: Y" it will be prompted for if "-options" is on the command line, or turned on by the "emboss_options" environment variable.

Peter proposed that the "parameter" attribute should change from an integer to a boolean. This would be the same as "required" and "optional", and would make sure that all parameters are defined in the correct order. As integers, it is possible to define (and prompt for) parameter 2 before parameter 1. This is inconsistent, and also makes the task of writing out help text much more difficult. The change was agreed.

Peter has implemented, but not yet committed, definition of a "documentation" attribute for applications. All applications should now have square brackets containing a "doc:" definition, which will be printed when the application starts if user interaction is allowed.

Peter has also implemented a "help" attribute as a default for all data types. This is intended as the help text when the application usage is printed. This will be turned on with "-help" on the command line.

Both these new texts need some way to provide simple formatting. Peter proposed implementing a plain backslash as a required newline for both "documentation" and "help". This can be extended later by giving some other meaning to other strings that start with a backslash. It also has the advantage of being relatively easy to read.

The "-help" command line qualifier should convert the ACD definition into something similar to the current style for command line syntax in the application documentation. The format was discussed, and the consensus was for a layout as follows:

% appname sequence outseq
   Mandatory arguments:
      [-sequence]    Default text or help text
      [-outseq]      Default text or help text
       -quala        Default text or help text
       -qualb        Default text or help text
   Optional arguments:
       -optqualc     Default text or help text
       -optquald     Default text or help text

   Advanced arguments:
       -hiddenquale  Default text or help text
       -hiddenqualf  Default text or help text

   Associated qualifiers:
       -sformat1     Default text
       -sprompt1     Default text

       -osformat1    Default text
       -sprompt1     Default text

This will probably need "-help" to take a string value with the help level needed, or a set of -help qualifiers.

Associated qualifer help text would be defined as an additional field within the ACD source definitions.

If more than one qualifier with the same associated qualifiers is used (e.g two input sequences) the additional qualifiers will be summarized as a list of "-sformat2" etc. with the note "(see above)"

There is a design problem with boolean qualifiers. After some discussion, it was proposed that:

Gary's utility programs define sequence regions as a series of start and end locations in a string. Peter proposed to make this a new "regions" data type so that standard syntax and validation could be applied.

The "transeq" program has its own list of translation tables. Peter proposed a new "translation" data type that would read the NCBI tables and use standard naming. There should also be an internal default of the standard genetic code which all applications can use. If alternative tables are allowed, this should be defined as "-translate" in the ACD file, usually as a hidden qualifier because only advanced users will need it.

3. ajPosreg: Posix 1003.2 Regular Expressions

Peter described the differences between the ajreg (egrep) regular expression functions and the ajposreg extended (POSIX 1003.2) regular expression functions.

The ajposreg functions have been successfully tested against a set of several hundred test expressions. They automatically detect the number of substrings defined by a regular expression, and always save them unless the Nosub version of the compile function is used. There are also case insensitive versions and versions which can handle multiple lines within the search string.

POSIX 1003.2 includes extended regular expressions, implemented in ajposreg, and basic (grep) regular expressions which are not covered (the flag is ignored) because the ajreg library should be faster and better for them.

The main reason to implement ajposreg is the support for "bounds" of repeat sizes, for example ".{2,5}" to match 2 to 5 characters. This can be useful in prosite motifs.

POSIX 1003.2 includes "bracket expressions" for character classes. These include [:lower:] for lower case, [=e=] for accented and unaccented 'e' characters, and [.eszet.] for special characters like the German "ss" character.

The library also supports back references where a backslash followed by a number is the nth substring already matched by the same regular expression.

4. Library Documentation

Peter has made a draft document for library documentation which was circulated at the meeting. It follows the style of Thon's ACD document, and aims to introduce the AJAX and NUCLEUS libraries and their defined data types and objects.

Example code for using the library functions and the data objects will be included in this document, and can be cut and pasted from the HTML version. This should be easier than trying to include them in the source code.

On the other hand, it is easier to keep the main function documentation up to date by embedding it in the source file function headers so this will remain the primary reference material for individual functions.

5. Graph Output

Ian has added many new graphics functions. There is a new set of ajHist functions for histograms, and extensions to the ajGraph functions for line and text drawing, error bars, and temporary changes to graph settings.

Graphics test applications are "treetypedisplay" for the line and text drawing functions and "histogramtest" for the histogram functions.

Some graphics programs are not using the graph data type in ACD files. This still needs to be merged with the graphics options implemented to date.

6. General progress on release 0.0.4

Alan is still testing CVS workarounds for the PLplot binary file problem. Peter will check with Phil Butcher at Sanger.

Rodrigo's build problems were solved. It was a local compiler problem.

Alan has added 4 new applications, "ant", "nab", "sig" and "cpg" which are replacements for EGCG programs "antigenic", "helixturnhelix", "sigcleave" and "cpgreport" respectively.

There was discussion about application naming. Peter proposed posting this topic to the emboss mailing list and to bionet to get some user views.

7. Any other Business


8. Next meeting

Next meeting Monday 22nd March, usual time and place.
Peter Rice, Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton Hall, Cambridge, CB10 1SA, UK.