EMBOSS: Project Meeting (Jun 21st 1999)
Peter has added a new "matrix" type in ACD. previously comparison matrices have been passed as strings. "matrix" has one attribute "protein" which is a boolean and is either "Y" or "N" if the type is fixed, or is a function "$(sequence.protein)" for applications like "matcher" that can use DNA or protein. For protein, the default matrix is "BLOSUM62", for DNSA it is "DNAMAT" which is a version of the NCBI DNA_MAT matrix.
Gap penalties for "matcher" and "stretcher" are set to depend on the sequence type with a simple function test "@($(sequence.protein) ? 14:16)" It seems best to define them explicitly in this way. GCG try to put a gap penalty in their matrix files but matcher and stretcher in their original versions use different default penalties for BLOSUM62 and GCG 9.1 has a different definition of gap penalties in the matrix files.
There is a new data type AjPMatrix in ajmatrices.h which handles comparison matrices. For reasons of efficiency there are functions that return the matrix itself as "int**" and the sequence character conversion table as an AjPSeqCvt object. The conversion table is built automatically from the sequence characters in the matrix file.
Matrix files at present must be in NCBI format with one matrix per file. Alternatives may be added later.
Alan is starting work on creating frequency matrices (profiles) from a set of aligned proteins and using this for database searches.
Peter is starting work on the Phylip programs. They will be built as a separate package under an "embassy" directory and will probably need their own CVS storage for each application for licensing purposes.
Michael Shmitz at LBL has some modifications to the libraries for "dan" and some updated nucleic acid parameter values. He also has some suggestions for the output of restrict, and has contributed a "shuffle" application to introduce changes in sequences. This will be the first contribution to EMBOSS from outside Europe.
There is a need for example code, which Peter will cover in the demonstration programs.
The main change needed was the addition of a comparison matrix data type (see above).
Peter has a visitor from EMBnet Italy in July who will have some code for parsing the EMBL feature table which could be included.
Reading of features with sequences was discussed. There is a need for a way to specify feature files such as GFF if they are not part of the sequence format.
Peter proposed a special syntax for this, similar to the USA but for features. This lead to the proposal for a "Uniform Feature Object" or "UFO" to be defined and added to the ACD and command line processing. Sequences with features could use the UFO specification to load features from a GFF file or to read the EMBL (or SwissProt) feature table. A similar syntax could be used for output.