- E uropean
- M olecular
- B iology
- O pen
- S oftware
- S uite
http://emboss.sourceforge.net/
1998
EMBOSS: History
In the beginning there were the
"GCGEMBL utilities"
Then there was the
"Extended GCG Package"
Now there is the
"EMBOSS Suite"
EMBOSS: Aims
- Developing new tools for sequence analysis
- Replacing popular but obsolete EGCG applications
- Integrating with SRS and ACEDB
- Integrating with popular user interface packages
- Integrating with other publicly available packages and tools
- Encouraging developers to use the EMBOSS software libraries
Target Users
Each of the following groups have their own special needs which EMBOSS
aims to satisfy:
- Sanger Centre genomic sequencing and analysis groups
- EMBnet service providers in 25 countries with over 30,000 users
- Academic users
- Pharmaceutical industry
Integration Issues
- Data formats
- Data locations
- Output formats
- Inter-operability within each package
- Using one application within another package
- Applications as sources of data
- Choice of user interfaces
- Client-Server architectures
Integration Issues (2)
- Local specialist data
- Batch processing
- Automation
- Data mining
- Size of the user community
- Specialization of the user community
- Size of the development team
- Stability of the development team
- Documentation standards
- Training needs and availability
- Technical support
- Specialized local development
Integrating EMBOSS
- Controlled application interface
- Specialized software libraries
- Data objects with extensible features
- Standard coding language
- Planned inter-operability
- Large scale user testing
- Collaborations with developers
- Consultations with end users
EMBOSS: Libraries
1. AJAX
- General purpose
- User interface
- File handling
- Sequence formats
- High level graphics
2. NUCLEUS
- Sequence analysis specific
- Algorithms and methods
Applications
Target areas for Sanger Centre users:
- EST clustering
- Rapid pattern matching
- Rapid sequence database searches
- Repeat identification
- Nucleotide analysis (e.g. CpG islands)
- Codon usage for small genomes
- Gene identification tools
- Sequence patterns in large data sets
- Protein motif and domain identification
- Presentation tools for publication
EMBOSS: Standards
- All code in ANSI standard C
- Concepts from the ANSI C++ draft
- Support for all common Unix platforms
- Future support for non-Unix platforms
- Choice of sequence formats
- Choice of database formats
- GNU Library GPL licensing of libraries
- GNU GPL licensing of applications
- Site specific customizing
- User interface definitions
Why use ANSI C ?
- Portability
- Reuse of legacy code
- Ease of maintenance
- Avoidance of overloading
- Data structures can represent objects
- Can implement using the C++ Standard Library as a model
- Linking with other languages
Code Documentation
We are developing our own source code documentation standard
- Based on JavaDoc
- Extensions from WISE 2
- Eventual conversion to HTML
- Control by perl scripts
EMBOSS: ACD Files
AJAX Command Definition (ACD) files control all EMBOSS applications.
- Complete user interface definition
- Simple definition syntax
- Flexible command line syntax
- Automatic processing
- Information provided at startup
- Conversion possibilities
EMBOSS: ACD Example
EMBOSS: seqret
- "seqret" gets its name from the hidden library code involved
in its complexity.
- The program simply reads in a sequence and writes it out again.
- The source code is very short:
#include "emboss.h"
int main (int argc, char* argv[]) {
AjPSeq seq;
AjPSeqout seqout;
embInit ("seqret", argc, argv);
seq = ajAcdGetSeq ("sequence");
seqout = ajAcdGetSeqout ("outseq");
ajSeqWrite (seqout, seq);
ajExit ();
}
EMBOSS In Action
"seqret" can be run in many ways, for example:
% cat laci.tfa
>LACI_ECOLI P03023 lactose repressor
MKPVTLYDVAEYAGV
% seqret
What sequence : laci.tfa
Write to [stdout] :
MKPVTLYDVAEYAGV
% seqret laci.tfa -auto
MKPVTLYDVAEYAGV
% seqret laci.tfa -osf fasta -auto
>LACI_ECOLI P03023 lactose repressor
MKPVTLYDVAEYAGV
% seqret laci.tfa -sf gcg -auto
MKPVTLYDVAEYAGV
More on ACD
EMBOSS: Sequences
Uniform Sequence Address
URL-style sequence naming
- database : entryname
- embl : paamir
- SW : AMIR_PSEAE
- format :: filename
- fasta :: /nfs/users/pmr/paamir.fa
- gcg :: paamir.em_ba
- format :: filename : entryname
- fasta :: unfinished : cf18b6
Sequence Qualifiers
For any sequence, extra command line qualifiers are available. For example:
-
-sformat=fasta
- input sequence format
-
-sbegin=1
- first base to be used
-
-send=999
- last base to be used
-
-[no]sreverse
- reverse complement for DNA
-
-supper
- convert to upper case
-
-slower
- convert to lower case
EMBOSS: Databases
Databases are defined centrally, and (optionally) added by users.
- Access methods
- SRS
- Staden
- EMBL CDrom
- HTTP
- other ...
- Definition fields
- Directory
- Filename(s)
- Database alias
- URL
- Sequence format
The Command Line
- Command line syntax is made extremely flexible.
- ACD files allow EMBOSS to find qualifiers by name in.
- These examples use seqret to file laci.tfa starting at base 25:
seqret fasta::laci.tfa -sbegin 25
seqret laci.tfa -sb 25 -sf fasta
seqret -sb1 25 -sf1 fasta laci.tfa
seqret -sbegin=25 laci.tfa -sformat=fasta
seqret sbegin=25 laci.tfa sf=fasta
seqret /SBEGIN=25 /SEQ=laci.tfa /SF=FASTA
- This seems rather confusing, but only because there is no enforced
standard.
- There will be a recommended syntax in the first full release.
ACD Processing
- A single call (embInit) handles:
- Parsing the ACD file
- Parsing the command line
- Prompting the user
- Validation
- Opening files
- Reading sequences etc.
- All values are treated as strings until they are used.
- All sequences are read, and passed to applications as sequence objects.
- No further interaction with the user is expected.
Extending ACD Types
- Define attributes
(e.g. sequence type)
- Define qualifiers
(e.g. -sformat)
- Write an AcdSet function:
- pick up qualifier values
- pick up attribute values
- set default value
- prompt if needed
- validate (e.g. read sequence)
- further prompts if needed
- set special attributes
- Write an AcdGet function
Another application
Another application (2)
- But the original program has this interface:
Usage: tandem {options} file nmin nmax
options: -T n threshold, default 20
-N treat N's as mismatches
-U allow uniform consensus
- The following ACD file works this way, without changing the program's source
code:
appl: tandem
sequence: sequence [ param=1 ]
outfile: outfile [ ]
int: minrep [ param=2 def=2 min=2 ]
int: maxrep [ param=3 def=$minrep ]
int: t threshold [ def=20 ]
bool: n mismatch [ ]
bool: u uniform [ ]
Interfaces
We would like to fully automate the generation of "external application"
definitions for the following set of example Web, GUI and other interfaces:
- AppLab and W2H (M. Senger, EBI)
- www2gcg (M. Colet, EMBnet Belgium)
- SeqPup (D. Gilbert, Indiana)
- Staden (R. Staden, LMB)
- GMS (EMBnet Netherlands)
- ANGIS (EMBnet Australia)
Potential contributors
- EMBnet Germany, Italy, France, Netherlands
- EMBnet Australia, Russia, Switzerland, Israel
- EMBnet Spain
- EMBnet Norway
- Other "academic" authors
- Other common "free" packages
Adding to EMBOSS
- All contributions are welcomed
- Applications must be linked to the EMBOSS libraries
- For now, code should be in ANSI C (other languages will follow)
- Preferably under a GPL licence, but ...
- The "EMBOSS Associated Applications Directory" covers software contributed
with non-GPL licences
- Even commercial software could be included one day
EMBASSADIRS
- Many developers prefer non-GPL licences
- EMBASSADIRS are applications or packages with other licenses
- They should be free for academic use
- They still link to the EMBOSS libraries
- They still use EMBOSS ACD files
- The applications look exactly like normal EMBOSS applications
- Some sites may install only a selection of these extra packages
Project Schedule
- Feb-97:
- Funding application
- Nov-97:
- Funding approved
- Apr-98:
- Coding effort started
- Aug-98:
- First library release for developers to comment
- Sep-98:
- First developers workshop for potential contributors in EMBnet
- Dec-98:
- First release of a complete package with preliminary documentation
- Jun-99:
- Second release with full documentation.
Project Coordination
- Central coordination is by Peter Rice and his group at the Sanger Centre
- The AJAX library will be coordinated by Alan Bleasby at SEQNET (the UK EMBnet
node)
- HGMP in Hinxton have contributed to library development
- Thure Etzold (EBI) collaborates on integration with SRS
- Richard Durbin (Sanger) and Jean Thierry Mieg (Montpellier) collaborate on
integration with ACEDB
- Many of the packages integrated with EMBOSS will remain the responsibility of
their original authors
Funding
- EMBOSS is funded by the Wellcome Trust on a 3 year research grant. This
covers 2 staff positions at the Sanger Centre.
- EMBnet national service nodes have their own budgets for software
development, and are expected to make major code contributions.
- Specific application areas can become projects in their own right.
- Additional funding is always welcome.
Technical Support
- All EMBOSS code will be fully tested and documented.
- EMBnet provides a Technical Manager support group with expertise in software
installation and maintenance.
- EMBnet provides regular bioinformatics training courses.
- EMBnet has publications expertise which will be used to produce user
documentation to a good standard and at a reasonable price.
- The same model has been in use since 1994 for the "Extended GCG" package.
|