E uropean
M olecular
B iology
O pen
S oftware
S uite

1998

EMBOSS: History

In the beginning there were the

"GCGEMBL utilities"

Then there was the

"Extended GCG Package"

Now there is the

"EMBOSS Suite"

EMBOSS: Aims

Developing new tools for sequence analysis
Replacing popular but obsolete EGCG applications
Integrating with SRS and ACEDB
Integrating with popular user interface packages
Integrating with other publicly available packages and tools
Encouraging developers to use the EMBOSS software libraries

Target Users

Each of the following groups have their own special needs which EMBOSS aims to satisfy:

Sanger Centre genomic sequencing and analysis groups
EMBnet service providers in 25 countries with over 30,000 users
Academic users
Pharmaceutical industry

Integration Issues

Data formats
Data locations
Output formats
Inter-operability within each package
Using one application within another package
Applications as sources of data
Choice of user interfaces
Client-Server architectures

Integration Issues (2)

Local specialist data
Batch processing
Automation
Data mining
Size of the user community
Specialization of the user community
Size of the development team
Stability of the development team
Documentation standards
Training needs and availability
Technical support
Specialized local development

Integrating EMBOSS

Controlled application interface
Specialized software libraries
Data objects with extensible features
Standard coding language
Planned inter-operability
Large scale user testing
Collaborations with developers
Consultations with end users

EMBOSS: Libraries

1. AJAX

General purpose
User interface
File handling
Sequence formats
High level graphics

2. NUCLEUS

Sequence analysis specific
Algorithms and methods

Applications

Target areas for Sanger Centre users:

EST clustering
Rapid pattern matching
Rapid sequence database searches
Repeat identification
Nucleotide analysis (e.g. CpG islands)
Codon usage for small genomes
Gene identification tools
Sequence patterns in large data sets
Protein motif and domain identification
Presentation tools for publication

EMBOSS: Standards

All code in ANSI standard C
Concepts from the ANSI C++ draft
Support for all common Unix platforms
Future support for non-Unix platforms
Choice of sequence formats
Choice of database formats
GNU Library GPL licensing of libraries
GNU GPL licensing of applications
Site specific customizing
User interface definitions

Why use ANSI C ?

Portability
Reuse of legacy code
Ease of maintenance
Avoidance of overloading
Data structures can represent objects
Can implement using the C++ Standard Library as a model
Linking with other languages

Code Documentation

We are developing our own source code documentation standard

Based on JavaDoc
Extensions from WISE 2
Eventual conversion to HTML
Control by perl scripts

EMBOSS: ACD Files

AJAX Command Definition (ACD) files control all EMBOSS applications.

Complete user interface definition
Simple definition syntax
Flexible command line syntax
Automatic processing
Information provided at startup
Conversion possibilities

EMBOSS: ACD Example

This file called "seqret.acd" defines the application "seqret" :

application: seqret

sequence: sequence [ param: 1
    info: "Input sequence" ]

seqout:   outseq   [ param: 2
    info: "Write to" ]

EMBOSS: seqret

"seqret" gets its name from the hidden library code involved in its complexity.
The program simply reads in a sequence and writes it out again.

The source code is very short:

#include "emboss.h"

int main (int argc, char* argv[]) {

  AjPSeq seq;
  AjPSeqout seqout;

  embInit ("seqret", argc, argv);

  seq = ajAcdGetSeq ("sequence");
  seqout = ajAcdGetSeqout ("outseq");

  ajSeqWrite (seqout, seq);

  ajExit ();
}

EMBOSS In Action

"seqret" can be run in many ways, for example:

% cat laci.tfa
>LACI_ECOLI P03023 lactose repressor
MKPVTLYDVAEYAGV

% seqret
What sequence : laci.tfa
Write to [stdout] :
MKPVTLYDVAEYAGV

% seqret   laci.tfa   -auto
MKPVTLYDVAEYAGV

% seqret  laci.tfa  -osf fasta  -auto
>LACI_ECOLI P03023 lactose repressor
MKPVTLYDVAEYAGV

% seqret   laci.tfa   -sf gcg   -auto
MKPVTLYDVAEYAGV

More on ACD

This test file uses other data types and shows how dependencies are handled in ACD:

appl: ajtest

sequence: seq [para=1]

bool: test [def=y]

float: fval [req=y def=2.5 max=100.0]

int: aval [req=y def=10 max=$seq.length]

int: bval [req=y def=$aval max=$aval.max]

outfile: out [para=2]

EMBOSS: Sequences

Uniform Sequence Address

URL-style sequence naming

database : entryname

embl : paamir

SW : AMIR_PSEAE

format :: filename

fasta :: /nfs/users/pmr/paamir.fa

gcg :: paamir.em_ba

format :: filename : entryname

fasta :: unfinished : cf18b6

Sequence Qualifiers

For any sequence, extra command line qualifiers are available. For example:

-sformat=fasta: input sequence format
-sbegin=1: first base to be used
-send=999: last base to be used
-[no]sreverse: reverse complement for DNA
-supper: convert to upper case
-slower: convert to lower case

EMBOSS: Databases

Databases are defined centrally, and (optionally) added by users.

Access methods

SRS

Staden

EMBL CDrom

HTTP

other ...

Definition fields

The Command Line

Command line syntax is made extremely flexible.
ACD files allow EMBOSS to find qualifiers by name in.
These examples use seqret to file laci.tfa starting at base 25:

seqret fasta::laci.tfa -sbegin 25
seqret laci.tfa -sb 25 -sf fasta
seqret -sb1 25 -sf1 fasta laci.tfa
seqret -sbegin=25 laci.tfa -sformat=fasta
seqret sbegin=25 laci.tfa sf=fasta
seqret /SBEGIN=25 /SEQ=laci.tfa /SF=FASTA
This seems rather confusing, but only because there is no enforced standard.
There will be a recommended syntax in the first full release.

ACD Processing

A single call (embInit) handles:
Parsing the ACD file
Parsing the command line
Prompting the user
Validation
Opening files
Reading sequences etc.
All values are treated as strings until they are used.
All sequences are read, and passed to applications as sequence objects.
No further interaction with the user is expected.

Extending ACD Types

Define attributes
(e.g. sequence type)
Define qualifiers
(e.g. -sformat)
Write an AcdSet function:
1. pick up qualifier values
2. pick up attribute values
3. set default value
4. prompt if needed
5. validate (e.g. read sequence)
6. further prompts if needed
7. set special attributes
Write an AcdGet function

Another application

"tandem" reads in a sequence and identifies simple tandem repeats.

The EMBOSS version has this ACD file:

appl: tandem
 sequence: sequence [param=1]
 outfile:  outfile  [param=2]
 int: minrep [req=y def=2 min=2 max=10]
 int: maxrep [req=y def=$minrep
              min=$minrep max=10]
 int: threshold [def=20]
 bool: mismatch [ ]
 bool: uniform  [ ]

Another application (2)

But the original program has this interface:

Usage: tandem {options} file nmin nmax
 options: -T n threshold, default 20
          -N   treat N's as mismatches
          -U   allow uniform consensus

The following ACD file works this way, without changing the program's source code:

appl: tandem

  sequence: sequence [ param=1 ]
  outfile: outfile [ ]
  int: minrep [ param=2 def=2 min=2 ]
  int: maxrep [ param=3 def=$minrep ]
  int: t threshold [ def=20 ]
  bool: n mismatch [ ]
  bool: u uniform  [ ]

Interfaces

We would like to fully automate the generation of "external application" definitions for the following set of example Web, GUI and other interfaces:

AppLab and W2H (M. Senger, EBI)
www2gcg (M. Colet, EMBnet Belgium)
SeqPup (D. Gilbert, Indiana)
Staden (R. Staden, LMB)
GMS (EMBnet Netherlands)
ANGIS (EMBnet Australia)

Potential contributors

EMBnet Germany, Italy, France, Netherlands
EMBnet Australia, Russia, Switzerland, Israel
EMBnet Spain
EMBnet Norway
Other "academic" authors
Other common "free" packages

Adding to EMBOSS

All contributions are welcomed
Applications must be linked to the EMBOSS libraries
For now, code should be in ANSI C (other languages will follow)
Preferably under a GPL licence, but ...
The "EMBOSS Associated Applications Directory" covers software contributed with non-GPL licences
Even commercial software could be included one day

EMBASSADIRS

Many developers prefer non-GPL licences
EMBASSADIRS are applications or packages with other licenses
They should be free for academic use
They still link to the EMBOSS libraries
They still use EMBOSS ACD files
The applications look exactly like normal EMBOSS applications
Some sites may install only a selection of these extra packages

Project Schedule

Feb-97:: Funding application
Nov-97:: Funding approved
Apr-98:: Coding effort started
Aug-98:: First library release for developers to comment
Sep-98:: First developers workshop for potential contributors in EMBnet
Dec-98:: First release of a complete package with preliminary documentation
Jun-99:: Second release with full documentation.

Project Coordination

Central coordination is by Peter Rice and his group at the Sanger Centre
The AJAX library will be coordinated by Alan Bleasby at SEQNET (the UK EMBnet node)
HGMP in Hinxton have contributed to library development
Thure Etzold (EBI) collaborates on integration with SRS
Richard Durbin (Sanger) and Jean Thierry Mieg (Montpellier) collaborate on integration with ACEDB
Many of the packages integrated with EMBOSS will remain the responsibility of their original authors

Funding

EMBOSS is funded by the Wellcome Trust on a 3 year research grant. This covers 2 staff positions at the Sanger Centre.
EMBnet national service nodes have their own budgets for software development, and are expected to make major code contributions.
Specific application areas can become projects in their own right.
Additional funding is always welcome.

Technical Support

All EMBOSS code will be fully tested and documented.
EMBnet provides a Technical Manager support group with expertise in software installation and maintenance.
EMBnet provides regular bioinformatics training courses.
EMBnet has publications expertise which will be used to produce user documentation to a good standard and at a reasonable price.
The same model has been in use since 1994 for the "Extended GCG" package.