• E uropean

  • M olecular

  • B iology

  • O pen

  • S oftware

  • S uite

http://emboss.sourceforge.net/

1998


EMBOSS: History

In the beginning there were the

"GCGEMBL utilities"

Then there was the

"Extended GCG Package"

Now there is the

"EMBOSS Suite"


EMBOSS: Aims

  • Developing new tools for sequence analysis

  • Replacing popular but obsolete EGCG applications

  • Integrating with SRS and ACEDB

  • Integrating with popular user interface packages

  • Integrating with other publicly available packages and tools

  • Encouraging developers to use the EMBOSS software libraries


Target Users

Each of the following groups have their own special needs which EMBOSS aims to satisfy:

  • Sanger Centre genomic sequencing and analysis groups

  • EMBnet service providers in 25 countries with over 30,000 users

  • Academic users

  • Pharmaceutical industry


Integration Issues

  • Data formats

  • Data locations

  • Output formats

  • Inter-operability within each package

  • Using one application within another package

  • Applications as sources of data

  • Choice of user interfaces

  • Client-Server architectures


Integration Issues (2)

  • Local specialist data

  • Batch processing

  • Automation

  • Data mining

  • Size of the user community

  • Specialization of the user community

  • Size of the development team

  • Stability of the development team

  • Documentation standards

  • Training needs and availability

  • Technical support

  • Specialized local development


Integrating EMBOSS

  • Controlled application interface

  • Specialized software libraries

  • Data objects with extensible features

  • Standard coding language

  • Planned inter-operability

  • Large scale user testing

  • Collaborations with developers

  • Consultations with end users

workrtf1.gif


EMBOSS: Libraries

1. AJAX

  • General purpose

  • User interface

  • File handling

  • Sequence formats

  • High level graphics

2. NUCLEUS

  • Sequence analysis specific

  • Algorithms and methods


Applications

Target areas for Sanger Centre users:

  • EST clustering

  • Rapid pattern matching

  • Rapid sequence database searches

  • Repeat identification

  • Nucleotide analysis (e.g. CpG islands)

  • Codon usage for small genomes

  • Gene identification tools

  • Sequence patterns in large data sets

  • Protein motif and domain identification

  • Presentation tools for publication


EMBOSS: Standards

  • All code in ANSI standard C

  • Concepts from the ANSI C++ draft

  • Support for all common Unix platforms

  • Future support for non-Unix platforms

  • Choice of sequence formats

  • Choice of database formats

  • GNU Library GPL licensing of libraries

  • GNU GPL licensing of applications

  • Site specific customizing

  • User interface definitions


Why use ANSI C ?

  • Portability

  • Reuse of legacy code

  • Ease of maintenance

  • Avoidance of overloading

  • Data structures can represent objects

  • Can implement using the C++ Standard Library as a model

  • Linking with other languages


Code Documentation

We are developing our own source code documentation standard

  • Based on JavaDoc

  • Extensions from WISE 2

  • Eventual conversion to HTML

  • Control by perl scripts


EMBOSS: ACD Files

AJAX Command Definition (ACD) files control all EMBOSS applications.

  • Complete user interface definition

  • Simple definition syntax

  • Flexible command line syntax

  • Automatic processing

  • Information provided at startup

  • Conversion possibilities


EMBOSS: ACD Example

  • This file called "seqret.acd" defines the application "seqret" :

    application: seqret
    
    sequence: sequence [ param: 1
        info: "Input sequence" ]
    
    seqout:   outseq   [ param: 2
        info: "Write to" ]
    

EMBOSS: seqret

  • "seqret" gets its name from the hidden library code involved in its complexity.

  • The program simply reads in a sequence and writes it out again.

  • The source code is very short:

    #include "emboss.h"
    
    int main (int argc, char* argv[]) {
    
      AjPSeq seq;
      AjPSeqout seqout;
    
      embInit ("seqret", argc, argv);
    
      seq = ajAcdGetSeq ("sequence");
      seqout = ajAcdGetSeqout ("outseq");
    
      ajSeqWrite (seqout, seq);
    
      ajExit ();
    }
    

EMBOSS In Action

"seqret" can be run in many ways, for example:

% cat laci.tfa
>LACI_ECOLI P03023 lactose repressor
MKPVTLYDVAEYAGV

% seqret
What sequence : laci.tfa
Write to [stdout] :
MKPVTLYDVAEYAGV

% seqret   laci.tfa   -auto
MKPVTLYDVAEYAGV

% seqret  laci.tfa  -osf fasta  -auto
>LACI_ECOLI P03023 lactose repressor
MKPVTLYDVAEYAGV

% seqret   laci.tfa   -sf gcg   -auto
MKPVTLYDVAEYAGV


More on ACD

  • This test file uses other data types and shows how dependencies are handled in ACD:

    appl: ajtest
    
    sequence: seq [para=1]
    
    bool: test [def=y]
    
    float: fval [req=y def=2.5 max=100.0]
    
    int: aval [req=y def=10 max=$seq.length]
    
    int: bval [req=y def=$aval max=$aval.max]
    
    outfile: out [para=2]
    
    

EMBOSS: Sequences

Uniform Sequence Address

URL-style sequence naming

database : entryname

embl : paamir

SW : AMIR_PSEAE

format :: filename

fasta :: /nfs/users/pmr/paamir.fa

gcg :: paamir.em_ba

format :: filename : entryname

fasta :: unfinished : cf18b6


Sequence Qualifiers

For any sequence, extra command line qualifiers are available. For example:

-sformat=fasta
input sequence format

-sbegin=1
first base to be used

-send=999
last base to be used

-[no]sreverse
reverse complement for DNA

-supper
convert to upper case

-slower
convert to lower case

EMBOSS: Databases

Databases are defined centrally, and (optionally) added by users.

Access methods

SRS

Staden

EMBL CDrom

HTTP

other ...

Definition fields

Directory

Filename(s)

Database alias

URL

Sequence format


The Command Line

  • Command line syntax is made extremely flexible.

  • ACD files allow EMBOSS to find qualifiers by name in.

  • These examples use seqret to file laci.tfa starting at base 25:

    seqret fasta::laci.tfa -sbegin 25

    seqret laci.tfa -sb 25 -sf fasta

    seqret -sb1 25 -sf1 fasta laci.tfa

    seqret -sbegin=25 laci.tfa -sformat=fasta

    seqret sbegin=25 laci.tfa sf=fasta

    seqret /SBEGIN=25 /SEQ=laci.tfa /SF=FASTA

  • This seems rather confusing, but only because there is no enforced standard.

  • There will be a recommended syntax in the first full release.


ACD Processing

  • A single call (embInit) handles:

  • Parsing the ACD file

  • Parsing the command line

  • Prompting the user

  • Validation

  • Opening files

  • Reading sequences etc.

  • All values are treated as strings until they are used.

  • All sequences are read, and passed to applications as sequence objects.

  • No further interaction with the user is expected.


Extending ACD Types

  • Define attributes

    (e.g. sequence type)

  • Define qualifiers

    (e.g. -sformat)

  • Write an AcdSet function:

    1. pick up qualifier values
    2. pick up attribute values
    3. set default value
    4. prompt if needed
    5. validate (e.g. read sequence)
    6. further prompts if needed
    7. set special attributes

  • Write an AcdGet function


Another application

  • "tandem" reads in a sequence and identifies simple tandem repeats.

  • The EMBOSS version has this ACD file:

    appl: tandem
     sequence: sequence [param=1]
     outfile:  outfile  [param=2]
     int: minrep [req=y def=2 min=2 max=10]
     int: maxrep [req=y def=$minrep
                  min=$minrep max=10]
     int: threshold [def=20]
     bool: mismatch [ ]
     bool: uniform  [ ]
    
    

Another application (2)

  • But the original program has this interface:

    Usage: tandem {options} file nmin nmax
     options: -T n threshold, default 20
              -N   treat N's as mismatches
              -U   allow uniform consensus
    
    
  • The following ACD file works this way, without changing the program's source code:

    appl: tandem
    
      sequence: sequence [ param=1 ]
      outfile: outfile [ ]
      int: minrep [ param=2 def=2 min=2 ]
      int: maxrep [ param=3 def=$minrep ]
      int: t threshold [ def=20 ]
      bool: n mismatch [ ]
      bool: u uniform  [ ]
    
    

Interfaces

We would like to fully automate the generation of "external application" definitions for the following set of example Web, GUI and other interfaces:

  • AppLab and W2H (M. Senger, EBI)

  • www2gcg (M. Colet, EMBnet Belgium)

  • SeqPup (D. Gilbert, Indiana)

  • Staden (R. Staden, LMB)

  • GMS (EMBnet Netherlands)

  • ANGIS (EMBnet Australia)


Potential contributors

  • EMBnet Germany, Italy, France, Netherlands

  • EMBnet Australia, Russia, Switzerland, Israel

  • EMBnet Spain

  • EMBnet Norway

  • Other "academic" authors

  • Other common "free" packages


Adding to EMBOSS

  • All contributions are welcomed

  • Applications must be linked to the EMBOSS libraries

  • For now, code should be in ANSI C (other languages will follow)

  • Preferably under a GPL licence, but ...

  • The "EMBOSS Associated Applications Directory" covers software contributed with non-GPL licences

  • Even commercial software could be included one day


EMBASSADIRS

  • Many developers prefer non-GPL licences

  • EMBASSADIRS are applications or packages with other licenses

  • They should be free for academic use

  • They still link to the EMBOSS libraries

  • They still use EMBOSS ACD files

  • The applications look exactly like normal EMBOSS applications

  • Some sites may install only a selection of these extra packages


Project Schedule

Feb-97:

Funding application

Nov-97:

Funding approved

Apr-98:

Coding effort started

Aug-98:

First library release for developers to comment

Sep-98:

First developers workshop for potential contributors in EMBnet

Dec-98:

First release of a complete package with preliminary documentation

Jun-99:

Second release with full documentation.


Project Coordination

  • Central coordination is by Peter Rice and his group at the Sanger Centre

  • The AJAX library will be coordinated by Alan Bleasby at SEQNET (the UK EMBnet node)

  • HGMP in Hinxton have contributed to library development

  • Thure Etzold (EBI) collaborates on integration with SRS

  • Richard Durbin (Sanger) and Jean Thierry Mieg (Montpellier) collaborate on integration with ACEDB

  • Many of the packages integrated with EMBOSS will remain the responsibility of their original authors


Funding

  • EMBOSS is funded by the Wellcome Trust on a 3 year research grant. This covers 2 staff positions at the Sanger Centre.

  • EMBnet national service nodes have their own budgets for software development, and are expected to make major code contributions.

  • Specific application areas can become projects in their own right.

  • Additional funding is always welcome.


Technical Support

  • All EMBOSS code will be fully tested and documented.

  • EMBnet provides a Technical Manager support group with expertise in software installation and maintenance.

  • EMBnet provides regular bioinformatics training courses.

  • EMBnet has publications expertise which will be used to produce user documentation to a good standard and at a reasonable price.

  • The same model has been in use since 1994 for the "Extended GCG" package.