Guide to writing EMBOSS applications

Copyright Alan Bleasby 2000
ajb@ebi.ac.uk

1: INTRODUCTION

This booklet is an introduction to programming EMBOSS and assumes the reader knows, in essence, what the package is and does. The package is available from http://emboss.sourceforge.net and will have been downloaded and installed from there. This guide begins with a general introduction to the internal structure of the package. It then moves onto the command line as a prelude to writing ACD files and finally gets to grips with programming in C per se.

2. PACKAGE STRUCTURE

There are 5 main levels in the EMBOSS hierarchy:

AJAX: This directory contains all low-level library functions. For example, sequence reading and writing, file handling, mathematical operations, string handling, memory handling, list processing etc are all here.

NUCLEUS: This directory contains high-level library functions. These are almost exclusively molecular biology algorithms for alignments, pattern matching, restriction, isoelectric point calculation etc.

PLPLOT: This directory contains the graphical output routines. It is an LGPL library developed outside the EMBOSS project. There are higher-level calls to this library in AJAX. This library may be replaced at some stage however, the AJAX interface to it should remain the same.

EMBOSS: This directory contains the applications (programs) released with the package. The source code is a useful repository of examples of how to do things although, of course, some examples are better than others; we all have our bad habits.

DOC: This directory, or rather its subdirectories, contains the online documentation for the programs as both html and text, the latter being used by the tfm program.

Some, but not all, of these levels have sublevels. The AJAX and NUCLEUS libraries do not; this makes it easy for the programmer to find things. As for the others, here is a brief description.

emboss/acd: every EMBOSS application has an associated ACD (Ajax Command Definition) file and this directory is where they live. ACD files are described later. So, for the program fred.c the file fred.acd must be added to this directory.

emboss/data: many of the EMBOSS applications have associated data files, for example BLOSUM matrices. This is where such files are kept. Putting files in here has the great advantage of the directory being in the search path the AJAX library uses. Data files for some common molecular biology databases are numerous and these appear in separate directories within the data file hierarchy so, there are directories emboss/data/PROSITE , emboss/data/REBASE, emboss/data/PRINTS. The EMBOSS applications prosextract, rebaseextract and printsextract are used to populate these directories.

plplot/lib: an important run-time directory for the graphics applications. This is where EMBOSS currently looks for any font files. The environment variable PLPLOT_LIB is used to point here.

doc/programs/*: everybody always writes the documentation before even starting on the program. In an ideal world at least. These directories are where the finely crafted prose is kept. The documentation must adhere to the format used by existing applications otherwise the online manual commands, like tfm, won't work.

There are other directories in EMBOSS but the above are the ones the programmer is most directly concerned with.

3. HOW TO ADD AN APPLICATION TO EMBOSS

This is easy, write the program, add the relevant files to some or all of the above directories, alter the Makefile-ish whatzits and you're done. EMBOSS compilation is controlled by GNU utilities and therefore have GNU-style Makefiles. The golden rule is don't touch the Makefile files. The only "makefile" you need to alter is:

emboss/Makefile.am

This contains a list of the applications supplied in the version of EMBOSS you compiled. If you edit the above file then the GNU utilities will recognise it has changed the next time you type make and recreate the Makefile file, which is the behaviour you want. Before you edit the Makefile.am you'll have put your application wibble.c in the emboss subdirectory. Here is a cut-down version of Makefile.am:

SUBDIRS = acd data

check_PROGRAMS = ajtest ajtest2 ajtest3 ajtestajnam \
cluster \
demofeatures demolist demosequence demostring demotable \
entrails histogramtest messtest \
patmattest plplottest proteinmotifsearch \
seqinfo \
seqretallfeat seqretfeat seqtofeat \
simplesw testplot treetypedisplay 
bin_PROGRAMS = acdc antigenic \
backtranseq banana \
chaos checktrans chips cirdna codcmp complex \
* lines deleted for clarity *
tmap transeq trimseq \
vectorstrip \
water wobble wordcount wordmatch \
wossname


#FLAGS=-lX11


INCLUDES = -I$(top_srcdir)/nucleus -I$(top_srcdir)/ajax-I$(top_srcdir)/plplot


acdc_SOURCES = acdc.c
antigenic_SOURCES = antigenic.c
backtranseq_SOURCES = backtranseq.c
banana_SOURCES = banana.c
chaos_SOURCES = chaos.c
checktrans_SOURCES = checktrans.c
chips_SOURCES = chips.c
* lines deleted for clarity *
trimseq_SOURCES = trimseq.c
vectorstrip_SOURCES = vectorstrip.c
water_SOURCES = water.c
wobble_SOURCES = wobble.c
wordcount_SOURCES = wordcount.c
wordmatch_SOURCES = wordmatch.c
wossname_SOURCES = wossname.c



* lines deleted for clarity *

 

LDADD = ../nucleus/libnucleus.la ../ajax/libajaxg.la ../ajax/libajax.la../plplot/libplplot.la $(XLIB)


if PURIFY

  LINK = purify $(LIBTOOL)--mode=link $(CC) $(CFLAGS) $(LDFLAGS) -o $@

else


endif

There are two sections which concern us for development. The bin_PROGRAMS line and the _SOURCES lines. The former is the place to put the name of the executable which, in this example will be wibble and the latter is where the name of the source file wibble.c needs to go. It is recommended that you add these in alphabetical order. The edited file will look like the following:

SUBDIRS = acd data


check_PROGRAMS = ajtest ajtest2 ajtest3 ajtestajnam \
cluster \
demofeatures demolist demosequence demostring demotable \
entrails histogramtest messtest \
patmattest plplottest proteinmotifsearch \
seqinfo \
seqretallfeat seqretfeat seqtofeat \
simplesw testplot treetypedisplay 


bin_PROGRAMS = acdc antigenic \
backtranseq banana \
chaos checktrans chips cirdna codcmp complex \
* lines deleted for clarity *
tmap transeq trimseq \
vectorstrip \
water wibble wobble
wordcount wordmatch \
wossname


#FLAGS=-lX11


INCLUDES = -I$(top_srcdir)/nucleus -I$(top_srcdir)/ajax-I$(top_srcdir)/plplot


acdc_SOURCES = acdc.c
antigenic_SOURCES = antigenic.c
backtranseq_SOURCES = backtranseq.c
banana_SOURCES = banana.c
chaos_SOURCES = chaos.c
checktrans_SOURCES = checktrans.c
chips_SOURCES = chips.c
* lines deleted for clarity *
trimseq_SOURCES = trimseq.c
vectorstrip_SOURCES = vectorstrip.c
water_SOURCES = water.c
wibble_SOURCES = wibble.c
wobble_SOURCES = wobble.c
wordcount_SOURCES = wordcount.c
wordmatch_SOURCES = wordmatch.c
wossname_SOURCES = wossname.c
* lines deleted for clarity *


LDADD = ../nucleus/libnucleus.la ../ajax/libajaxg.la ../ajax/libajax.la ../plplot/libplplot.la $(XLIB)

if PURIFY

  LINK = purify $(LIBTOOL)--mode=link $(CC) $(CFLAGS) $(LDFLAGS) -o $@

else

endif

There are other interesting things you can do with this file but the above is all you need to know to develop an application. The operation of the file can safely be left as a black box. One gotcha is worth a mench though. The bin_PROGRAMS section is one logical line, it has continuation (\) characters purely for ease of editing and beautification. Strange and wonderful things can happen if you accidentally delete a continuation character or , if necessary, forget to add one. On the other hand the _SOURCES section has discrete lines which should not have continuation characters.

4. THE COMMAND LINE

Before describing programming of both ACD and applications it is necessary to describe the EMBOSS command line so you can see how everything fits together. I'll mention this in little detail as all the grisly details are more the scope of how to use the applications rather than how to program one.

The command line is pretty standard. All values that can be passed to an application (i.e. the application parameters) are defined as being either a parameter or a qualifier. Here is an example command line:

restrict embl:ecompa -nosticky results.restrict

The simple difference is that a qualifier is the thing beginning with hyphen, everything else apart from the program name is a parameter. In the above example embl:ecompa is a sequence to input and results.restrict is the name for an output file. nosticky happens to tell the program that restriction enzymes producing sticky ends aren't wanted.

Qualifier: these always have a qualifier name which is preceded on the command line by a hyphen. Qualifiers can have associated values which appear immediately after them.

Parameter: These have an optional parameter name (beginning with a hyphen), if this is used they can appear anywhere on the command line. If the optional name isn't given they must appear in the order they are given in the ACD file for the program.

5. THE ACD FILES

AJAX Command definition files are programmed in the emboss/acd directory. They control all the user input operations. All EMBOSS programs have all their user input prompted for before the main part of the application begins. An EMBOSS application cannot ask the user for more information after several hours of processing!

An important thing to remember about any EMBOSS application is all input is read and held somewhere in memory before the application per se starts with a vengeance.

This document tells you just enough ACD to write most applications. For a full description of ACD please read the ACD syntax document available on the EMBOSS web pages.

5.1 The acdc application

EMBOSS provides the programmer with an application to test their ACD files as they're being written. This application is called acdc and which stands for "ACD compiler". Assuming you are writing the wibble application you'll have a file in the emboss/acd directory called wibble.acd. This acd file can be tested as follows:

acdc wibble

The rule that all input is read in by EMBOSS applies to acdc in that, if you've specified an input sequence, then that sequence must exist otherwise the acdc application will exit with an error. In short, you'll see, using acdc, exactly what the user will see if they type the same thing as you at the prompts.

This means that in EMBOSS you can write the acd before the application itself! This is good programming practice but usually the acd file will be written concurrently with the C source code file.

5.2 The Application Line

All ACD files must start with an appl construction. This construction gives three pieces of information. The first is the name of the application. Then, in the body of the construct, there are doc and groups lines. The doc line provides a string of text which will be printed to stdout whenever the application is run. The groups line associates the application to programs which do similar things or different things in the same general area. This type of information is used by the seealso application. Here is an example:

# AJAX COMMAND DEFINITION (ACD) FILE

# ajb 14th April 1999

appl: restrict [
        doc: "Finds restriction enzyme cleavage sites"
        groups: "DNA:sequence features, restriction enzymes"
]

Any empty lines, or lines beginning with ‘#', are comments. The application is called "restrict" and the string "Finds restriction enzyme cleavage sites" is printed whenever the application is run. The application operates on the subgroup of DNA sequences and the groups it belongs to are "sequence features" and "restriction enzymes". Applications can belong to more than one group and they are separated by commas.

You can use either DNA: or PROTEIN: if they are appropriate. If they are not just leave them out and type in the relevant groups. The most recent set of groups can be found on the EMBOSS home page; at the time of writing they are:

Alignment
Alignment global
Alignment local
Alignment multiple
Coding regions
Comparison
CpG islands
Database entry extraction
Database indexing
Database information
Display alignment multiple
DNA sequence composition
DNA sequence features
DNA sequence properties
Enzyme kinetics
Gene finding
Hydropathy
Motifs
Mutation
Pattern matching
Primers
Profiles
Protein sequence composition
Protein sequence features
Protein sequence properties
Reformatting
Repeats
Restriction enzymes
Secondary structure
Sequence comparison
Sequence composition
Sequence display
Sequence editing
Sequence information
Text search
Transcription
Translation
Utilities
Utilities help
Utilities keyword search

If your application doesn't fall into any general group then by all means write in a new group type but don't make a habit of it though.

5.3 General ACD syntax

All ACD definitions have the same general syntax which is:

datatype: name [ optional arguments ]

The datatypes are built in to emboss and include sequence, integer, float types and many more. The name can be anything you like within reason. This will be the name by which qualifiers (and optionally parameters) are known on the command line. The optional arguments allow you to specify information strings, default values, maxima and minima , and more. Any whitespace is ignored. There is an option to all applications which will print the ACD in the required/desired EMBOSS format though. That application switch is called acdpretty.

Within ACD, all application parameters are defined via the appropriate optional argument to be one of "parameter", "standard" or "additional". If none is specified the default of "advanced" is used. Their behaviour is as follows:

parameter: "Y"Specifies this is a parameter. EMBOSS will always prompt for the value if it isn't given on the command line.

On the other hand qualifiers are specified using the following two arguments:

standard: "Y" Specifies this is standard qualifier. EMBOSS will always prompt for the value if it isn't given on the command line.

additional: "Y" Specifies this is an additional qualifier. EMBOSS will not usually prompt for the value if it isn't given on the command line, which means a default value should be specified in the ACD file. If, however, the application is run with "-option" (an in-built qualifier available to all EMBOSS applications) then values for all additional qualifiers will be prompted for. You should never specify "N" after parameter, standard or additional.

It is, of course, the case that the application should use or test everything from the ACD specification, otherwise any unused definitions shouldn't be in the ACD file at all. A default value can be be set:

def: 5.6 Which sets the default value of a floating point datatype to 5.6. What follows the default definition depends on the kind of datatype you've defined. Similarly you can use:

max: To specify a maximum value and

min: To specify a minimum value.

All the EMBOSS dataypes have a default prompt associated with them. For specifying the more fundamental datatypes such as sequences and input/output files you should generally let EMBOSS use its defaults. It is good practice with qualifiers to give them some text which will be used as a prompt. This is usually done with

info: "Enter a friggin displacement"

or somesuch.

We can take the simple integers, floats and booleans, as examples:

int: garibaldi [ standard: Y min: -7 max: 38 def: 10 info: "Enter a garibaldi value" ]

float: vir [ additional: Y def: 5.6 info: "Set your vir here" ]

bool: londo [ additional: Y def: Y ]

The first two will be prompted for if not given on the command line, the last won't but will have a default true value. When using the help option available to all applications the first one will be shown as a required qualifier and the last two as advanced qualifiers. There are many tricks with this sort of thing and some of the more common ones appear below.

5.4 Specifying sequence files

You will usually specify sequences as parameters and you'll want them prompted for if not specified on the command line. Their specification is generally:

sequence: delenn [ param: Y ]

seqout: sheridan [ param: Y]

The first of these specifies that a single input sequence is required, the second that a sequence output file is a prerequisite. The fixed datatypes are "sequence" and "outseq", the names (which in this case are optional on the command line as these have been specified as parameters) are"delenn" and "sheridan". Whilst this is legal naming it isn't entirely clear. We therefore have a convention that the following definitions are used instead:

sequence: sequence [ param: Y ]

seqout: outseq [param: Y ]

After all, a user will be expecting a sequence to be called "sequence".

Often an application will either allow or require more than one sequence to be specified. In this case, instead of the sequence datatype you should use the seqall datatype. You will frequently see the following definition in EMBOSS ACD files:

seqall: sequence [ param: Y ]

Sequences also have an optional argument called type: which allows you to tell the ACD processing that a DNA or protein must be read. This argument even allows you to specify whether the sequence is allowed to have any ambiguity codes or unknown residues. The common types are:

dna
rna
puredna
purerna
nucleotide
purenucleotide
protein
pureprotein

Others are available for allowing sequences with gaps or any sequence at all. See the ACD syntax document for these.

5.5 Specifying other input and output files

Sequence files are rather special in that, not only does the file need to be opened, one sequence has to be read in before ACD processing finishes, no matter what format that sequence is in or where it is. Sometimes, though, you just need a general input file or a general output file. This can be achieved by the following datatypes:

infile: gkar [ standard: Y ]

outfile: lennier [ standard: Y]

As usual you can give these default values which will be treated as strings or any info prompt you wish.

5.6 Sequences have built in attributes

A common requirement is that some ACD values need to be specified in terms of other values which will not be known until the data (most notably the sequence) is read in. Sequences therefore have attributes. The most commonly used of these are its length, start position and end position. These are accessed using the "$" and "." characters. Suppose that you want an integer to be input which is allowed to have a value which does not exceed the sequence length. You would use something like the following to achieve this:

sequence: zocolo [param: Y ]

integer: marcus [ standard: y min: 1 max: $(zocolo.length) ]

Other useful attributes for sequences are:

$(xxx.begin) The sequence begin value

($xxx.end) The sequence endpoint

($xxx.protein) A boolean saying whether this sequence is a protein

Again, for less frequently used attributes see the ACD syntax formal specification.

5.7 Strings and Lists

By far the most common "other" datatypes are the strings, lists and select lists. The use of the string datatype is obvious:

string: sinclair [ additional: Y def: "EBLOSUM50" ]

The list datatypes are used whenever the user needs to be presented with a choice. The "select" lists give numerical options from 1->n whereas the "list" lists allow text to be typed as the keys. As you can use list to perform both types of operation this is the most frequently used. An example is:

list:

table [
        additional: Y
        default: "0"
        min: 1
        max: 1
        header: "Genetic codes"
        values:
        "0:Standard"; "1:Vertebrate Mitochondrial"; "2:Bacterial"
        delim: ";"
        codedelim: ":"
        info: "Code to use"
      ]

This will produce a list from which one and only one value must be selected (min: 1 max:1). The delim and codedelim fields refer to the list specification in the values field. The values you can select are really strings and are "0", "1" or "2". This is an example of how a list can be used to emulate a "select list".

5.8 Graphics

For these the use of the xygraph datatype is recommended. An example of its use would be:

xygraph: graph [ standard: Y multi: 4 ]

which says you want something that will allow you to display 4 graphs on the same plot.

5.9 Performing calculations in ACD

Sometimes you'll want, for example, a maximum value set to the sequence length plus or minus a certain value. For calculations the @() syntax is used. Here is an example:

int: window [ standard: Y max: @($(sequence.length)-30) ]

This sets the maximum value to be 30 residues less than the sequence length. Addition, subtraction, multiplication and division are allowed and all operations can be nested.

5.10 Variables

Sometimes a calculation can get a bit messy. For this, and other reasons, the ACD syntax allows you to use variables. A short example will help here:

var: mulexp: [ "@($(sequence.length)*2)" ]

int: bridge [ def: 100 max: @($(mulexp)-30) ]

5.11 Conditionals

Our brief tour of "common ACD for the programmer" ends with a look at conditional operations. A very common requirement is to prompt for a text output file only if the user hasn't selected a plotting option. Let's assume the default is to prompt for a text output file and the user needs to type plot on the command line to force a plot. The first part is easy:

bool: plot [ additional: Y def: N ]

Here is one way to control the prompting:

outfile: outf [ standard: @(!$(plot)) nullok: Y]

The negation operator (!) is effectively a calculation so the @ syntax has to be used. I mentioned earlier that you should never specify "N" after "parameter", "standard" or "optional". Well, strictly speaking you shouldn't but if an "N" is given, in this case by way of a calulcation, then that will force EMBOSS to not prompt for a value for that option. This is a handy method if you have multi-mode programs and only need to prompt for certain options in ceratain modes. Such programs are, however, a very good candidate for splitting into more simple, single-mode programs. The general rule is "one program, one basic function" and hence no requirement for complicated, multi-modal programs!

What is the nullok term for? Well, files always need a name and nothing is usually not a valid name. In this case it has to be. To stop ACD processing complaining the nullok term has to be added.

The ternary operator is often used to, for example, set a default value to differently depending whether the sequence is a protein or DNA one. Here is an example:

def: "@($(sequence.protein) ? 30 : 7)"

Which will set the default to 30 if the sequence is a protein otherwise it will be set to 7. The ternary operation is another calculation hence the @ syntaxagain.

6. PROGRAMMING PHILOSOPHY OF EMBOSS

A remark made by Richard Stallman, the author of EMACS, sums up the most important idea.

"Any program using arbitrary limits is fundamentally bugged"

If anyone submitted a program for inclusion in EMBOSS that had such a line as:

char sequence[10000];

within it then the main development team would need a lie down to recover from the horror. The second notable feature of EMBOSS programming is that it borrows concepts from C++. It is, however, written in the C programming language. The reason for this is mainly because when the EMBOSS project began there was no ANSI C++ standard. The library does look rather like C++. It uses objects and includes constructors, iterators, destructors and even incorporates the idea of data hiding.

Every object in EMBOSS is dynamically allocated. The programmer should not dig into the library objects themselves, that is purely the domain if the library functions themselves. You will mainly be dealing with string objects and sequence objects. There are many other objects in EMBOSS though and we'll meet them later.

Saying that you shouldn't look at objects is like a red rag to a bull as far as C programmers are concerned, myself included. So, lets have a look at the string object both as an example of how EMBOSS manages to emulate C++ and to introduce object nomenclature. The string object is one of the most simple ones and internally its definition is:

typedef struct AjSStr
{
   ajint Len;
   ajint Res;
   ajint Use;
   char *Ptr;
}
AjOStr, *AjPStr;

The object is the AjPStr definition. As C programmers will see, this is a pointer to a structure. All objects are referred to with pointers. For the AJAX pointers these start with "AjP". This nomenclature is maintained for all objects so its easy to guess that a single sequence object is an AjPSeq. Back to the AjPStr object though. The char *Ptr; is just a standard C pointer which holds a character string and the ajint Len; is its length. The character string may or may not be null terminated and this has both gotchas and benefits. For example, the library functions for printing AjPStr objects look at the length field for how many characters tp print; they won't stop at the first NULL if there is one. The ajint Res; item internally lets the library know how much reserved dynamic memory is associated with the object. This will obviously be at least equal to Len but often more. Res is and should be outside your direct control. If you use a library call to add anything to the string then, if it'll fit within the memory given by Res then the operation is performed immediately; if the memory required is larger than Res then more memory is allocated and the Res item is updated. A little more memory than required is usually allocated. That just leaves the ajint Use; item to describe. It is a usage pointer. You will sometimes want to have two objects pointing to exactly the same data. In that case it is pointless duplicating the string. Instead, the usage counter is incremented. When destroying a string object usage counter is first decremented. Only if the usage counter is zero will the object be deleted. Of course noone can prevent you from accessing the internals directly, all we can advise is that if you intend altering the contents of an object then safety is guaranteed if you use the library functions for the purpose in hand. If you don't use the library functions and dabble without fully understanding the library then its either segmentation fault or bus error time.

After that look at the internals I'll now introduce an ajax library function that will catenate two string objects. That library function has the prototype:

AjBool ajStrApp(AjPStr *str, AjPStr str2);

The funtion will append str2 to str leaving the result in str. The question is, since the objects are pointers anyway why is the first argument a pointer to a pointer? You should be able to work that out given the explanation above. Suffice it to say that if use of a function can possibly mean more memory has to be allocated for an object then a pointer to that object must be passed. The string that is being added to the end of the first one is not going to be altered in any way therefore the object alone can be passed. In short:

If a function can change how much memory is allocated to an object then a pointer to that object must be passed.

IF THERE IS ONE RULE TO REMEMBER FROM READING THIS TEXT THEN PLEASE REMEMBER THAT ONE

The library functions are well documented as to whether an object or pointer to an object is required but they know what they're doing. Hopefully you do too now.

6.1 And the rock cried out "No hiding place!"

We can now move onto the subject of data hiding. A reasonable working definition of data hiding is that the data held within that object are private. Only functions which are "friends" of the object are allowed to see the data. Any functions which are going to alter the data should not alter the original, rather they should get their own copy of the data and alter that. EMBOSS provides these ideas.

However, the concept of a char * character pointer is so fundamental to C programmers that the developers of EMBOSS thought it would be a real hassle not to provide access to the raw data. Angry C programmers were not the only reason for providing such access, a major consideration was to make porting of molbiol code out in the big wide world easier. A function you'll see a lot in EMBOSS programs is:

char *ajStrStr(AjPStr str);

This returns a pointer to the contents of a character string held within an AjPStr object. It is just so unbelievably useful to be able to get this sometimes. That's not to say that EMBOSS library functions are slow though, just that getting a pointer might help you sometimes. Once you have the pointer you can manipulate the internal data contents of an object directly but on your own head be it. We do not guarantee that internal representations will stay the same!

So, which do you use, the full data hiding capabilities of EMBOSS or the more direct "get the pointer" approach? If you are a purist then this is a rhetorical question. If you use common sense it is perfectly acceptable to mix the two and, after all, you do document your code don't you? I make the following recommendation. There is no real stigma in getting the char* pointer of a string object BUT, whenever the size of the data in an object can increase then use the library functions! The rule of no arbitrarily sized arrays must never be broken but the data hiding rule can be bent if it makes things more efficient or more intuitive. Use good judgement, the choice is yours. As you program more in EMBOSS you'll find yourself using the library functions in preference to the standard C library. A lot of code will continue to use pointers though, we are dealing with genome sized sequences and the lower the computational overhead the better.

For most objects there is a so-called cast function provided that lets you get at the data directly. For example the AjPInt object for dynamic integer arrays has an associated casting function called ajIntInt.

6.2 Constructing and Destructing.

It is good and recommended programming practice to construct objects before you use them. You might even say its essential but there you'd be partially wrong. Take the following code segment:

   AjPStr str=NULL;

   str = ajStrNew();

   ajStrAssC(&str, "This is a string");

   ajStrDel(&str);

First a string object pointer called str is declared. It is initialised to point to null. That itself is good practice. To create the object a constructor function is called. That is the line containing ajStrNew(). After that another library function is called which copies (assigns) a character string to the string object. Some work is done after this and then, as the string is no longer required the object is deleted and thereby its associated memory is freed. That is the proper way to write the code. Here is the wrong way:

   AjPStr str=NULL;

   ajStrAssC(&str, "This is another string");

   ajStrDel(&str);

The interesting thing is that this will still work. Whenever they can be the library routines tend to be smart. If a NULL pointer is passed to them they'll recognise the fact and construct a string for you. It can get you out of trouble, it can also cause trouble. So, please say what you mean when writing code. Explicitly construct use and deconstruct. There is no advantage to not using the constructor (that just means the assignment function has to call it internally anyway) , it is a good reminder to you that memory has been allocated and will most likely cause you to spot that it needs deleting when you've finished with it. In short, good programming practice helps stop applications leaking memory like a sieve.

7. A SIMPLE EMBOSS PROGRAM

As a bit of light relief lets put a bit of all this into action. This will lead to a description of the structure of an EMBOSS program. Here is an application that will read in a sequence and tell you the GC fraction. We'll call this program gcf.

The first thing to do is to create an ACD file. We require a sequence and an output file for the results. As they are both essential we'll make them parameters

appl: gcf 
[
   doc: "Work out GC fraction"
   groups: "DNA: sequence composition"
]

sequence: green [ param:Y type DNA ]

outfile: boggo [ param: Y ]

And that's it for the ACD. Call the file gcf.acd and put it in the emboss/acd directory. Now for the application itself.

#include "emboss.h"

int main(int argc, char **argv)
{
    AjPSeq seq=NULL;
    AjPStr str=NULL;
    AjPFile outf=NULL;
    char *p;
    ajint  len;
    ajint  count=0;
    ajint  begin;
    ajint  end;
    ajint  c=0;



    embInit("gcf",argc,argv);

    seq  = ajAcdGetSeq("green");
    outf = ajAcdGetOutfile("boggo");


    str = ajStrNew();

    begin = ajSeqBegin(seq);
    end   = ajSeqEnd(seq);

    p = ajSeqChar(seq);


    ajStrAssSubC(&str,p,--begin,--end);
    ajStrToUpper(&str);
    p = ajStrStr(str);  /* see section 10.2.6 for the "proper" data hiding method */
    len = ajStrLen(str);

    while(*p)
    {
          c = *p++;
          if(c=='G' || c=='C')
             ++count;
    }

    ajFmtPrintF(outf,"GCFraction = %f\n", (float)((float)count/(float)len));

    ajStrDel(&str);
    ajExit();
    return 0;
}

Call this file gcf.c and put it in the /emboss directory. Now edit emboss/Makefile.am and add gcf to the bin_PROGRAMS line and add a line saying gcf_SOURCES = gcf.c . Now type make. Try it on a DNA sequence and check that it works.

7.1 The include directive

The compiler directive #include "emboss.h" imports the entire EMBOSS interface i.e. makes all the EMBOSS library calls available to you. This must be included at the start of every EMBOSS program.

7.2 Importing the command line

The command line must be available to the application so the main function must include it. This is done in the parameter list using int main(int argc, char **argv)

7.3 Declarations

Here we meet the objects for the first time. We said in the ACD that a sequence is required so we need a sequence object (AjPSeq). An output file is required as well so a file object is needed (AjPFile). Why an object for a file? Its so that EMBOSS can get or put input and output from anywhere using the same system. A file could be local or it could be an internet connection. A string object (AjPStr) is also needed for some processing. Note their initialisation to NULL, not really required but good practice.

The remaining declarations are plain ANSI C and should be familiar.

7.4 Processing the ACD

The next line, or one like it, appears in all EMBOSS programs. embInit("gcf",argc,argv); This one line hides wealth of activity. It reads in local database definitions, finds the right ACD file to use (from the "gcf" parameter) , processes the command line (it uses argc and argv from main) and, by the time the call returns it has read in the sequence and put it somewhere in memory and has also opened the output file. Remember I've said before that all of the user input processing is done before the application starts with a vengeance, well that one line does it.

N.B.: If you are doing graphics then use ajGraphInit("prog",argc,argv) instead.

7.5 Retrieving the ACD values

The ACD has placed all the values we want somewhere in memory. We now need to retrieve them. This is done by the ajAcdGet family of functions. Taking the sequence as an example, note that the sequence is referred to by the name it was given in the ACD file and not by the datatype. I gave the sequence such a strange name just to make this point clear. It would have been less obvious if the ACD had declared sequence: sequence [ param:Y type: DNA ] as it should have done. The output file pointer is received similarly.

Saying that the call has "put the values somewhere in memory" is just another way of saying that embInit and the ajAcdGet functions allocate all the memory required for the all the application parameters. There is therefore no need to explicitly create a new object: be careful not to do this as it would create a memory leak (==bad!)

7.6 Constructors

It is a good habit to put all the constructors together at the beginning of the application. They are therefore declared immediately after the ajAcdGet calls. In this case the only object that has not yet been created is the string object. This is done by the str = ajStrNew(); call.

7.7 The body of the application

How you program the body of the application depends on style. This example shows a reasonable mixture of EMBOSS code and standard C. There are points to note though. The ajAcdGetSeq call will always return the whole sequence. The user might have specified a start and end position on the command line though (using the sbegin and/or send built-in qualifiers). A substring of the sequence must therefore be processed. That's why the string object is needed. I've written this example long-windedly to make the process clear. First, the begin and end values are received. These will be between 1 and "n" where n is the length of the sequence; that means, for C purposes, the returned begin and end values will need decrementing. Next a char* pointer to the sequence is returned. Finally the substring is assigned to the string object. The rest should be clear, the ajFmtPrintF call is the EMBOSS equivalent of fprintf but uses the file object.

7.8 Destructing

We are clean programmers, right? So any memory we consume we're going to tidy up at the end and not rely on the operating system to clear up our mess. Destruct any objects you've created at the end of the program with the exception of any ACD-constructed objects, the latter should be one of the jobs of the ajExit(); call. In this case only the string object needs deleting.

7.9 Exit gracefully

main was declared as an int and it is good practice to return zero on success. That is done at the end with the return statement.

8. THE LIBRARIES AND PROGRAMMING

8.1 Introduction

It is not really the scope of this guide to provide you with a full list of all the library functions. The tropical rain forests are one consideration but tedium is the main one. So, given that there is currently about a million lines of code then where does the programmer start? The best solution is for you to download the function and datatype listings (see later), use the online indexing of the source code, and then use this guide which will give examples of how to use the functions.

8.2 The Function Listings

I strongly recommend that you go to the web page http://emboss.sourceforge.net/developers/ where you will find two important pages. These are Ajax Library Documentation and Nucleus Library Documentation. They are both important but you will be using AJAX most of the time so I'll use the former page as an example, they both have the same features. On that page you will see the library split into logical sections such as strings, lists, tables, arrays, memory etc. These sections follow the way the code is supplied in the distribution so, for example, for the string functions there are the two files ajax/ajstr.c and ajax/ajstr.h.

These sections have associated function and datatype links. Go to each of these sections and save them out as postscript and print them (or use some other appropriate method for printing.) Then, what you probably don't want to do is read them from start to finish. You can if you like but the best thing is to use them as a reference manual! On these pages, which are generated automatically and are therefore up-to-date the functions, where appropriate are split into logical sections concerning constructors, destructors, assignments, operators, casts etc. Have a flick through to get the general idea. You'll hardly ever need to look at the datatypes unless you intend writing a library function. For applications it is the functions that are important.

The source code itself is pretty much self-documenting and one useful thing you can do is to look at the ajax/*.h files which often list the functions alphabetically and give a short description.

One of the quickest ways to find something or see if it exists (which it probably will) is to use the online indexed facility.

8.3 The Online Indexed Library

The libraries are indexed using a piece of software called SRS. This enables you to browse the functions or search for one. Go to the Ajax Library Documentation page on the webserver (see 8.2). At the top of the page you'll see two databases highlighted. These are EFUNC and EDATA. The EFUNC database holds information about the functions; the EDATA database holds information about the datatypes.

Taking the simple program of section 7 as an example, suppose you'd forgotten what the ajStrAssSubC function was called but you knew that it dealt with substrings and char arrays and that it did an assignment. Click on EFUNC. From there click on Search at the top right of the page. On the form that is presented type:

substring & cop

and then click on Submit Query. You'll see two functions are found ajStrAssSub and ajStrAssSubC. Take a look at both of them. You'll see that even the source code is indexed as well as the description line. Clicking on a function another calls will take you to that function.

8.5 Nomenclature

All the ajax functions start with aj. These letters are followed by a short indicator of what general section they belong to so ajStr functions deal with strings and ajSeq functions deal with sequences.

Then follows a short description of what they do so ajStrAss is an assignment function, in C terms it is a form of strcpy. If the last letter of the name is a capital C then that function deals with char* strings.

The NUCLEUS library has similar nomenclature except that the function names start with emb standing for EMBOSS.

Now you know how to find the functions the next sections show you how to use them.

9. TRUE OR FALSE

EMBOSS has its own names for true and false values. These are values an AjBool datatype can hold. The values are:

ajTrue

ajFalse

10. STRINGS

10.1 Datatype, constructors and destructors

The main datatype is the AjPStr and there are two constructors. Here are code segments for them both.

AjPStr str=NULL;
str = ajStrNew();

AjPStr str=NULL;
str = ajStrNewC("Hello World");

The first just allocates a new string object, the second does the same but initialises it with a given character string. There is one destructor.

ajStrDel(&str);

10.2 Useful, commonly used, string functions.

There are tens and tens of these functions. They are the workhorses of the ajax library. Just take a look at your printed function library and you'll see the ajstr section is rather thick. Here are some of the very commonly used functions with some helpful hints.

10.2.1 Assignment (copying) and Appending (catenating)

AjBool ajStrAss(AjPStr *string1, AjPStr string2);

AjBool ajStrAssC(AjPStr *string1, char *string2);

These functions both return an AjBool which says whether or not more memory had to be allocated to the string. That isn't usually that useful so the return value is ignored. They are therefore usually called like:

ajStrAss(&string1, string2);

ajStrAssC(&string1, "Hello world");

but if you are programming properly you should explicitly say you want to ignore the return value by writing:

(void) ajStrAss(&string1,string2);

(void) ajStrAssC(&string1,"Hello World");

We tend to insist on that in library functions as it is proper ANSI C. For applications we are more generous but occasionally go through code and tidy it up.

For assignment of substrings use:

ajStrAssSub(AjPStr *string1, AjPStr string2,ajint start, ajint end);

ajStrAssSubC(AjPstr *string1,char *string2, ajint start, ajint end);

There are three ways of catenating with a string object:

ajStrApp(AjPStr *string1, AjPStr string2);

ajStrAppC(AjPStr *string1, char *string2);

ajStrAppK(AjPStr *string1, char c);

These are the equivalents of the C strcat function.

10.2.2 Case changes

These functions will change the case of a string object.

ajStrToUpper(AjPStr *string);

ajStrToLower(AjPStr *string);

They can be indispensible as sequence reading, quite rightly, doesn't change the case of a string but some databases have their sequences in lower case and others in upper case.

10.2.3 String Comparison

There are many of these functions. Again I'll limit this to the common ones.

These are the equivalents to the C strcmp function. They will return true if the strings are of equal length and exactly match.

AjBool ajStrMatch(AjPStr string1, AjPStr string2);

AjBool ajStrMatchC(AjPStr string1, char *string2);

They have two equivalents which are case insensitive, these being:

AjBool ajStrMatchCase(AjPStr string1, AjPStr string2);

AjBool ajStrMatchCaseC(AjPStr string1, char *string2);

Wild card matching is also provided by the functions:

AjBool ajStrMatchWild(AjPStr string1, AjPStr string2);

AjBool ajStrMatchWildC(AjPStr string1, char *string2);

The equivalent functions to the C library call strncmp which matches N characters are:

ajint ajStrNCmpO(AjPStr string1, AjPStr string2, ajint n);

ajint ajStrNCmpC(AjPStr string1, char *string2, ajint n);

but note that they return 0 (ajFalse) if the strings match. They are mainly used by the library itself.

Other matching routines allow you to test prefixes, suffixes etc.

10.2.4 String Length

That of course had to be in the library. And is:

ajint ajStrLen(AjPStr string);

Unlike the C function strlen this is a very speedy function as the string object already contains the length of the string.

10.2.5 Tokenising a string

AJAX has its own tokenisation functions. These include a constructor and a destructor. As an example, to tokenise the string "lets:tokenise:this" into its three colon-delimited components a code segment example would be:

AjPStr string=NULL;
AjPStr result=NULL;
AjPStrTok token=NULL

result = ajStrNew();
string = ajStrNewC("lets:tokenise:this");
token = ajStrTokenInit(string, ":");

ajStrToken(&result, &token, ":");    /* result =lets */
ajStrToken(&result, &token, ":");    /* result = tokenise*/
ajStrToken(&result, &token, ":");    /* result = this */

ajStrTokenClear(&token);
ajStrDel(&result);

You can see that this is a bit like the C function strtok.

The three ajStrToken() calls return the three elements one at a time. The program would probably want to do something with them but this is just an example code segment. The AjPStrTok is the string tokenisation object, ajStrTokenInit is its constructor and ajStrTokenClear is its destructor.

Handy hint:

The function ajStrTokenCount will tell you how many tokens a string will produce.

Handy hint 2:

Tokenisation is a valuable technique for reading in data. It can be used, with other functions, to emulate the C function sscanf.

10.2.6 String Iterators

String iterators allow you to step through the contents of a string one position at a time. These iterators have their own constructors and destructor. The datatype is AjIStr, the constructor is ajStrIter and the destructor is ajStrIterFree. Its use is best illustrated by showing how the simple GC content program in section 7 should have been written to avoid violating data hiding principles.

This is the modified program:

#include "emboss.h"

int main(int argc, char **argv)

{
  
    AjIStr iter;

    AjPSeq seq=NULL;

    AjPStr str;

    AjPFile outf=NULL;

    ajint  count=0;

    ajint  begin;

    ajint  end;

    ajint  len;

    char c='\0';

          


    embInit("wibble",argc,argv);


    seq  = ajAcdGetSeq("green");

    outf = ajAcdGetOutfile("boggo");


    str = ajStrNew();


    begin = ajSeqBegin(seq);

    end   = ajSeqEnd(seq);


    ajStrAssSubC(&str,ajSeqChar(seq),--begin,--end);

    ajStrToUpper(&str);


    len = ajStrLen(str);


    iter = ajStrIter(str);

    while(!ajStrIterDone(iter))

    {

        c = ajStrIterGetK(iter);

        if(c=='G' || c=='C')

            ++count;

        ajStrIterNext(iter);
      
    }


    ajFmtPrintF(outf,"GC Fraction = %f\n",(float)((float)count/(float)len));


    ajStrIterFree(&iter);

    ajStrDel(&str);

    ajExit();


    return 0;
    
}

Note that there isn't a char* pointer in sight. The iter iterator is constructed using the AjPStr str. The ajStrIterDone returns ajTrue if there are no more characters left in the string. The ajStrIterGetK function returns the current character in the string object and the ajStrIterNext function moves the iterator on to the next character.

10.2.7 Too many to mention

The above shows the main features of using strings. There are loads of string functions left unmentioned but these are easily perused online. String functions exist for trimming, inserting, reversing, etc. Look at the function documentation and browse the EFUNC and EDATA databases!

11. LISTS

I'll say it again Arbitrary array sizes are forbidden in EMBOSS programs. Hopefully you'll be so sick of me saying that soon, or already, that it'll be burned into your long term memory. A typical position with databases is that you don't know how many entries there are until you've reached the end; or that you cannot know how many hits you're going to get using a given query. How can you cater for that without using arbitrary arrays? The answer is, of course, lists. EMBOSS lists can be used to implement both FIFO or LIFO stacks. Also, like string objects, they have their own iterator.

Lists are kept intentionally general in EMBOSS, they have to deal with many different types of objects. That is why you'll see they use void* address pointers. It's the ultimate in non-specific pointers. There is one exception to this rule though. As string objects are so common they have their own list operations. I'll describe them first. Please read 11.1 before skipping to general lists as the principles are the same.

11.1 String Lists

The datatype for a string list object is the AjPList, this datatype is also used for all the other list operations. The constructor for string lists is the function ajListStrNew but there are two destructors namely ajListstrFree and ajListstrDel. The difference is that the former will delete all the strings in the list plus the list itself, the latter will not delete the strings but will delete the list. A sample code segment would be:

AjPList list=NULL;

list = ajListstrNew();
ajListstrFree(&list);

11.1.1 LIFO lists

LIFO (last in first out) lists can be regarded as the default for EMBOSS as far as function naming is concerned. You put something onto a string list using the ajListstrPush function. You get something off the list by using the ajListstrPop function. Here is another rule to remember. It is only the address of an object that gets pushed onto a list. There is absolutely no copying of objects. Here is a common error.

AjPList list=NULL;
ajint i;
AjPStr str=NULL;

str = ajStrNew();
list = ajListstrNew();

for(i=0;i<10;++i)
{
    ajFmtPrintS(&str,"%d",i);
    ajListstrPush(list, str);
}

The ajFmtPrintS function will print 0->9 to the string object. Its like sprintf in C. So, what does that code segment do? Well, if you're lucky, when you pop things off the list you'll find all the strings contain "9", if you're unlucky it'll crash (not very likely with this example but if the print statement caused str to be reallocated the accessing what was returned most likely would cause a crash). Why is it doing that? Yes, because only the address of the string is being pushed. The expected behaviour can be obtained by creating a new string each time around the loop. This code is correct:

   AjPList list=NULL;

    ajint i;

    AjPStr str=NULL;


    list = ajListstrNew();


    for(i=0;i<10;++i)

    {
        str = ajStrNew();

        ajFmtPrintS(&str,"%d",i);

        ajListstrPush(list, str);
    }

To pop these items off the list we can use ajListstrPop. Something like the following would complete the program:

    AjPList list=NULL;

    ajint i;

    AjPStr str=NULL;

    AjPStr tmp=NULL;


    list = ajListstrNew();


    for(i=0;i<10;++i)
    {
         str = ajStrNew();

         ajFmtPrintS(&str,"%d",i);

         ajListstrPush(list, str);
    }

     

    for(i=0;i<10;++i)
    {
         ajListstrPop(list,&tmp);

         /*print out the string or do something else with it */

         ajStrDel(&tmp);
    }
    ajListstrDel(&list);

There are two points of interest in the above. One is that I needn't have used tmp, I could have reused str instead (why?) but it arguably made the code easier to read. The second is that I could replace the second for loop with a while loop since the ajListstrPop function returns ajTrue if there was something on the list. So:

    while(ajListstrPop(list,&str))
    {
         /* Print out the string or do something else with it */
         ajStrDel(&tmp);
    }

would do the same thing. Note the need to delete the strings when they're finished with.

11.1.2 FIFO Lists

FIFO (first in first out) lists are easy to implement. The code is precisely the same as that for LIFO lists with one exception. The function ajListstrPushApp is used instead of ajListstrPush. The former function appends an entry onto a list. This will mean the first item to be put on the list will be the first to be popped.

11.1.3 Reversing the order of a list

Easy. To reverse the order of entries in the list in section 11.1.1 you only have to type:

ajListReverse(list);

Note that pushing items onto a list and then reversing the list is another way to make a FIFO list.

11.1.4 Sorting a string list

Sorting a list is a bit trickier to understand if you haven't used routines like quicksort before. The prototype of the list sorting function is:

ajListSort(AjPList list, ajint (*compar) (const void *, const void *));

What the function needs is another function which will return 0 if two strings are the same and a positive number if the second should sort before the first or a negative number if the first should sort before the second. For string lists such a routine is already provided in the library and is called ajStrCmp. So, to sort the list in 11.1.1 (which is already sorted but what the heck) you'd use:

ajListSort(list,ajStrCmp);

11.1.5 Getting the number of entries in a list

Use the ajListLength function

11.1.6 List Iteration

Lists can be iterated through much as string objects can. Iterating through a list is useful as the list can be scanned through more than once without having to pop values off all the time. The syntax is pretty much the same as for strings. The list iterator datatype is the AjIList. Here is a way that 11.1.1 could be rewritten using an iterator.

    AjPList list=NULL;

    AjIList iter=NULL;

    ajint i;

    AjPStr str=NULL;

    AjPStr tmp=NULL;


    list = ajListstrNew();


    for(i=0;i<10;++i)

    {
        str = ajStrNew();
 
        ajFmtPrintS(&str,"%d",i);

        ajListstrPush(list, str);
    }


    iter = ajListIter(list);


    while(ajListIterMore(iter))
    {
         tmp = (AjPStr)
         
         ajListIterNext(iter);

        /* print out the string or do something else with it */

    }

    ajListIterFree(iter);

    ajListstrFree(&list);

The ajListIterMore tests to see if the iteration can continue. Note that the ajListIterNext function returns a void* pointer so a cast has to be used to tell the compiler that its really an AjPStr. I've used the ajListStrFree function this time for a bit of variety. This means the ajStrDel function calls are unnecessary.

List iteration has other advantages. There are library functions, ajListstrRemove and ajListstrInsert, which can add or remove items in the middle of a list. They are effectively pops and pushes anywhere in the list. This can only be done using iterators.

11.1.7 Turning Lists into arrays

It is sometimes more convenient to work with an array than with a list, once the list has been created. EMBOSS provides functions for this. If we were to apply the following code segment to the list from 11.1.1:

AjPStr *array;

ajint len;

len = ajListstrToArray(list,&array);

Will set len to 10 and make array an array of ten AjPStr objects which can be referred to as array[0] to array[9]. The list is left unchanged.

Please note the above cannot be written as:

AjPStr **array;

ajint len;

len = ajListstrToArray(list,array);

which will compile but crash when run, for rather obvious reasons given a little thought!

We would, of course, delete the array with code like:

for(i=0;i<len;++i)
    ajStrDel(&array[i]);
AJFREE(array);

11.2 More General Lists

Some of the functions are precisely the same. The equivalents of the ones that aren't have the same function names without the three characters str therefore ajListStrPush becomes just ajListPush. The second difference is that there is only one destructor ajListDel which only deletes the list and not the objects themselves. The other major difference, as I've said, is that you must use casts as the general functions use void* pointers. Let me illustrate this using the same examples but this time having lists of AjPCod codon objects.

11.2.1 The general LIFO list

The example from 11.1.1 becomes:

AjPList list=NULL;
ajint i;
AjPCod cod=NULL;
AjPCod tmp=NULL;

list = ajListNew();
for(i=0;i<10;++i)
{
    cod = ajCodNew();
    /* do something to the codon object and then */
    ajListPush(list, (void *)cod);
}
 
while(ajListPop(list,(void **)&tmp))
{
    /* do something with the codon  object then */
    ajCodDel(&tmp);
}

ajListDel(&list);

Note the void* and void** casts.

11.2.2 General FIFO lists

Just the same as for string lists except ajListPushApp is used with a (void *) cast.

11.2.3 General List Reverse

Exactly the same as for strings

11.2.4 General List Sorting

This uses the same ajListSort function however we now most probably will have to write our own comparison function and be very careful with casting otherwise the program will not compile. Let us assume we had an object defined as;

typedef struct AjSWibble
{
    ajint granny;
    ajint weatherwax;
}
AjOWibble, *AjPWibble;

and we had a list of AjPWibble objects which we wanted to sort on the basis of the granny field. Our comparison function must return an ajint with 0 if both values are the same or a positive or negative value depending on whether one should sort before the other. Our comparison function would be written like:

ajint wibblecompare(const void *a, const void *b)
{
    return (*(AjWibble*)a)->granny*(AjWibble*)b)->granny;
}

depending on the preferred sort order. And the sort function becomes:

ajListSort(list,wibblecompare);

11.2.5 Getting the number of entries in a general list

Use the ajListLength function i.e. exactly the same as for strings.

11.2.6 List Iteration

The previous example using codon objects becomes:

AjPList list=NULL;

AjIList iter=NULL;

ajint i;

AjPCod cod=NULL;

AjPCod tmp=NULL;

list = ajListNew();

for(i=0;i<10;++i)
{
    cod = ajCodNew();
    /* do something */

ajListPush(list, (void *)cod);

iter = ajListIter(list);
while(ajListIterMore(iter))
{
    tmp = (AjPCod) ajListIterNext(iter);
    /* print out the string or do something else with it */
}

ajListIterFree(iter); while(ajListPop(list,(void **)&tmp)) ajCodDel(&tmp); ajListDel(&list);

Note the casts AND the extra step of deleting the objects and the list separately at the end.

11.2.7 Turning General Lists into arrays

Like string lists but with casts.

AjPCod *array;

ajint len;

len = ajListToArray(list,(void ***)&array);

Will set len to 10 and make array an array of ten AjPStr objects which can be referred to as array[0] to array[9]. The list is left unchanged.

11.3 An End To Lists

As with string objects there are lots more things lists can do but they use the same principles as given above. For example, the ajListMap function applies another function to every member of the list; the principle here is very similar to that of the sort function except that there is only one parameter in the map function. It can be useful for emptying lists.

12. DYNAMIC NUMERIC ARRAYS. Pints, Doubles and Shorts (and floats and longs)

Much as arbitrary length string arrays are a no-no so are arbitrary numeric arrays. Sometimes you'll know that an integer array will never need to be larger than the some function of another number, such as the length of a sequence. In that case its acceptable to directly AJALLOC such memory. Sometimes you will not know or don't want to know or maybe it would be wasteful to allocate too much for most circumstances. This is where dynamic numeric arrays (DNAs) come in. AJAX allowed you to use one, two or three dimensional arrays of ajints, ajshorts, floats, doubles and ajlongs.

12.1 One dimensional integer arrays

This is the most simple case. The datatype is the AjPInt, the constructor is ajIntNew and the destructor is ajIntDel. Here is a code segment:

AjPInt array=NULL;
ajint v;

array  = ajIntNew();
ajIntPut(&array, 27, 15);
v = ajIntGet(array, 27);

ajIntDel(&array);

This example stores the value 15 at position 27 in the equivalent of a one dimensional integer array (ajint*), and then retrieves it. As many values can be put into the array as you like at any position. The code will automatically take care of the memory allocation. N.B. It is an error to access an unfilled element. This can be avoided by conversion to a normal array (see below).

Hint: If you have at least some idea of the size of the array then you can use the ajIntNewL(n) constructor instead. This doesn't mean the array is restricted to n values but does mean that the library will have to worry less about memory reallocation. Alternatively, if your code calculates the maximum position first and fills that value then less memory reallocation will need to be done. In short, filling the array from the top is more efficient than filling from the bottom. This is only really a consideration for very large arrays.

At any time you can get the length of the array by using :

ajint n;

n = ajIntLen(array);

You may also find it more convenient, after filling an array, to convert the array to a standard C integer array. This can be done as follows:

ajint *a;

a = ajIntInt(array);

You can then access the elements using a[0] -> a[n] as usual but remember that the DNA still exists so you may want to delete it.

N.B. All unfilled elements will have a zero value.

12.2 Two dimensional integer arrays

These are very similar to one dimensional arrays. The datatype is the AjInt2d and some example code is:

AjPInt2d array;
ajint v;

array = ajInt2dNew();
ajInt2dPut(&array, 27,8, 15);
v = ajInt2dGet(array, 27,8);
ajInt2dDel(&array);

This fills the 8^th column of the 27^th row with the value 15 and then retrieves it.

Getting the dimensions for multidimensional arrays has a slightly different syntax.

ajint rows;

ajint columns;

ajInt2dLen(array, &rows,&columns);

will return the row and column values. Converting to a normal array is just the same:

ajint **a;
a = ajInt2dInt(array);

12.3 Three dimensional arrays

These have the same syntax as two dimensional arrays. The datatype is AjPint3d.

12.4 Floats, Doubles, Shorts and Longs

All these have their equivalent functions to the integer ones and all have the same syntax as for the integer ones. Note though, that although the datatypes of the values change the row/column/etc positions are still specified as integers.

13. TABLES

Tables in EMBOSS allow you to add key/value pairs and then retrieve an associated value for a given key. As with lists there are two types. One is where the key is a string object, the other where the key is more general. As it is nearly always the case that strings can be used for most keys I'll mainly discuss these. The advantage of using strings as keys is that the library calls are easier to understand. A full discussion of general tables may be given in future versions of this document. General tables have not needed to be used by either the library or any application so far.

For all types of table the datatype is the AjPTable. The tables are automatically hashed to provide speedy access to the keys. Like some other functions it is more efficient to have some idea of the size the table will need to be before creating it. It is not critical though and more memory will always be allocated if the table grows beyond the estimate.

Here is some example code using a table, it uses strings for both keys and values:

AjPTable table=NULL;
AjPStr key=NULL;
AjPStr value=NULL;
ajint I;
static char *keys[] =
{
   "Sheridan", "Garibaldi", "Londo","Vir", NULL
};

static char *values[]=
{
   "One", "Two", "Three", "Four",NULL
};

table = ajStrTableNew(10);
i=0;
while(keys[i])
{
    key = ajStrNewC(keys[i]);
    value = ajStrNewC(values[i]);
 
    ajTablePut(table,(const void *)key, (void *)value);

    ++i;
}

key = ajStrNewC("Garibaldi");   
if(!(value = ajTableGet(table, (const void *)key)))
     ajFatal("Cannot find key");

/* value is now a string object holding "Two" */

ajStrTableFree(&table);
ajStrDel(&key);

So you can see that ajStrTableNew is the constructor and ajStrTableFree is the destructor, if the string objects are real and not just copies of other string pointers then this function deletes the strings as well. If you just want the table deleted use ajTableFree instead. The value 10 is just a guess at the number of values, we know it is really 4 as its hard coded into the character arrays but even examples in this documentation are given in the spirit of the thing.

Just like lists only the addresses are put into the table so new key and value objects need allocating for each addition. The function that adds the pairs is AjTablePut. This function is also used for general lists so void casts need to be applied.

The last part of the example allocates one of the strings we know is there to the AjPStr key and then uses ajTableGet to return the associated value (in this case "Two"). The ajTableGet function is also used by general lists and so the void cast is required. If a key is given which doesn't exist in the table then NULL is returned.

You can get the length of a table with entries = ajTableLength(table);, you can remove key/value pairs, you can apply a function to all the key/value pairs with ajTableMap and convert a table to a 2N+1 array (the last value being a delimiter usually NULL), an example of the latter for the above would be:

AjPStr *array;

array = (AjPStr *) ajTableToArray(table,NULL);

14. FORMATTED OUTPUT

There are only a handful of routines in ajfmt.c that are used to any degree in EMBOSS applications but they are some of the most important. After all, it is occasionally useful to print out the odd result. I'll lump warnings and errors into this section.

14.1 ajFmtPrintF

This can be regarded as the equivalent of the C routine fprintf. It takes an AJAX file object which has previously been opened as an output stream to somewhere (local file, network etc) and then uses a varargs approach for printing data. File objects are discussed in section 15 but for now assume that we have an AjPFile object called outf opened where we want it.

An example is all that is required to explain this function:

ajint i=5;
float fl=6.7;

static char *fred="Hello";
/* assumes open file object outf */

ajFmtPrintF(outf,"%s %d %f\n",fred,i,fl);

will print out "Hello 5 6.7" followed by a newline.

This function can cope with all the format specifiers of the standard C library and one or two more that are specific to EMBOSS. Rather then having to specify %s and cast a string object to char* you can print a string object directly using %S i.e.

AjPStr str=NULL;

str = ajStrNewC("Hello world");
/* assumes open file object outf */
ajFmtPrintF(outf,"%S\n",str);

even something like %60.60S will work. The other differences are:

%S as above

%D print a date

%B print an AjBool value

%s will accept a NULL pointer and print "<null>"

14.2 ajFmtPrintS

This is just like ajFmtPrintF only it prints to a string object hence:

AjPStr str=NULL;

str = ajStrNew();
ajFmtPrintS(&str,"%s %d %f","Hello",5,6.7);

will leave str holding "Hello 5 6.7".

14.3 ajFmtPrintAppS

Like ajFmtPrintS but appends to a string object

14.4 User messages, Warnings, Errors and Fatalities

Messages at various levels of severity can be sent to the user. There are five functions for this.

ajUser used to print messages (usually to do with correct program usage) to the user

ajWarn used for printing warnings

ajErr used for printing errors

ajDie prints non-recoverable errors and aborts the application

ajFatal prints fatal error messages and aborts the application

All these functions have varargs capability like the FmtPrint functions e.g.

ajFatal("Cannot open file %S",filename);

14.5 Debugging

The ajDebug function is like ajUser except that it will print messages to the debugging file if the user has specified –debug on the command line.

15. INPUT AND OUTPUT FILES

There are many variations on opening input and output files in EMBOSS and so I'll deal with the basics here; it is worth perusing the functions in ajfile.c after you've got the idea. The datatype AjPFile is the standard one (although there is another form for buffered files).

15.1 Opening a file for reading

The function ajFileNewIn is used. The filename is passed as an AjPStr. Here is a code segment:

AjPFile inf=NULL;

AjPStr fn=NULL;

fn = ajStrNewC("my.file");

if(!(inf = ajFileNewIn(fn)))
    ajFatal("Cannot open file %S for reading",fn);

15.2 Opening a file for writing

The function ajFileNewOut is used.
AjPFile outf=NULL; AjPStr fn=NULL; fn = ajStrNewC("my.file"); if(!(outf = ajFileNewOut(fn))) ajFatal("Cannot open file %S for writing",fn);

15.3 Closing input and output files

The same function, ajFileClose, is used for both e.g.

AjPFile outf=NULL;

/* assumes open file object */

ajFileClose(&outf);

15.4 Files in the EMBOSS data areas
;

If a program requires a data file (e.g. a BLOSUM matrix) then it first looks in the user's current directory, a few other directories as documented elsewhere, and finally looks in the EMBOSS data directory. The function to open such a data file for reading which will look in all those locations is ajFileDataNew. It has a rather unstandard calling method. Here is an example of its use:

AjPFile inf=NULL;

AjPStr str=NULL;

str = ajStrNewC("BLOSUM100");

ajFileDataNew(str,&inf);

if(!inf)
    ajFatal("Cannot open data file %S",str);

To open a file for writing in the top level EMBOSS data directories use the ajFileDataNewWrite function:

AjPFile outf=NULL;

AjPStr str=NULL;

str = ajStrNewC("BLOSUM666");
ajFileDataNewWrite(str,&outf);
if(!outf)
    ajFatal("Cannot open data file %S",str);

15.5 Input from other commands

EMBOSS can accept data from external applications. It does this by means of pipes. To use a pipe as input the ajFileNewInPipe is used. The filename for this function is a remote command.

15.6 Reading from multiple files

There are several variations on this. For example ajFileNewInList will accept a list of filenames and create a file object with the first file open. After reading all of the file a call to ajFileNext will close the current file and open the next in the list. When the function returns ajFalse there are no more files left to read. An alternative is to use ajFileNewDW in combination with ajFileNext. You cam give this a directory and a wildcard string; all matching files will be found and the reading then proceeds as for the previous example.

15.7 Buffered Input

Equivalent functions are available for buffered input using the datatype AjPFileBuff. To open a file for reading use ajFileBuffNew, to close it use ajFileBuffDel etc.

AjPFileBuff buff=NULL;

AjPStr str=NULL;

str = ajStrNewC("my.file");

if(!(buff = ajFileBuffNew(str)))
    ajFatal("Cannot open input file %S",str);

/* addd something here */

ajFileBuffDel(&buff);

As the name suggests, reading from buffered files can be more efficient as more data is slurped into memory at any one time. You have to be a little careful that you clear buffers etc when you use them.

Not all the file functions can work with buffered files, logical impossibilities for a lot of them. There is normally little overhead in using normal file objects as the operating system normally buffers things anyway.

Thefunction ajFileBuffLoad will read all the file into the buffer. Not recommended for large databases and small computers.

15.8 Seeking, Raw Reading, Position and other C-ish equivalents

ajFileSeek is the equivalent of the C fseek function

ajFileTell is the equivalent of the C ftell function

ajFileStat is the equivalent of the C stat function

15.9 Handy File functions

ajFileName will return the filename of a file object as a char* pointer

ajFileStdin, ajFileStderr, ajFileStdout these test to see if file objects point to a particular stream and return ajTrue if they do.

16. READING FROM FILES

16.1 Reading a line from a standard file

Use the function ajFileReadLine. This has the advantage that it will strip trailing newlines.

Assuming an open input file object inf:

AjPStr str=NULL;

str = ajStrNew();

while(ajFileReadLine(inf,&str))
{
    /*  process the successive lines */
}

ajFileClose(&inf);

16.2 Reading a line from a buffered file

Use the function ajFileBuffGet. Its use is the same.

AjPStr str=NULL;
AjPFileBuff inf=NULL;

str = ajStrNew("my.file");
if(!(inf=ajFileBuffNew(str)))
    ajFatal("Cannot open file %S",str);

while(ajFileBuffGet(inf,&str))
    {
    /*  process the successive lines */
    }

ajFileBuffDel(&inf);

16.3 Reading in binary

Use a normal file object and the function ajFileRead, it's the equivalent of the C fread function.

17. SEQUENCES

You might expect this to be a very long section but you'd be wrong. Although sequences are the bread and butter of the molecular biologist most of the sequence functions in the library are concerned with reading them in transparently to the user. Once you've got the sequence it is effectively a string with a name, an accession number etc. Most of the work done by any application will first get the sequence into an AjPStr object and work on it there. The datatype is the AjPSeq. It does have a constructor called ajSeqNew but you will hardly ever need to use it unless you want to construct a sequence from scratch. Your sequences will mainly come from ACD and will therefore have been constructed for you.

17.1 Getting multiple sequences

You have seen that to get a single sequence you define a line in ACD like:

sequence: sequence [param: Y type: protein]

and then retrieve it with ajAcdGetSeq and the AjPSeq datatype. We haven't yet seen how to get multiple sequences. This is done with a line in ACD such as the following:

seqall: sequences [param:Y type: protein]

Then, in the code you use the AjPSeqall datatype. Here is an example:

    AjPSeqall
    seqs=NULL;
    
    AjPSeq seq = NULL;

    seqs = ajAcdGetSeqall("sequences");
    
    while(ajSeqallNext(seqs, &seq))
    {
    /* Do something with the sequence seq  */
    }

The ajSeqAllNext function loads the next sequence in the AjPSeq object each time around the loop. If you want to associate sequences as a set then use seqset in ACD and use the ajSeqSet family of functions. You'd probably do this for multiple alignments.

17.2 Getting information from a sequence

ajSeqGetName get the name. This is a pointer to the internal AjPStr

ajSeqName get the name. This is a pointer to the internal char*

ajSeqGetDesc get the description. This is a pointer to the internal AjPStr

ajSeqGetAcc get the accession number. This is a pointer to the internal AjPStr

17.3 Sequence begin and end points

If you specified a single sequence in ACD then use ajSeqBegin and ajSeqEnd. If you've used seqall in ACD then use ajSeqallBegin and ajSeqallEnd, or if seqset has been used then use ajSeqsetBegin and ajSeqsetEnd.

17.4 Getting a string object copy of a sequence

str = ajSeqStrCopy(seq);

17.5 Reversing and complementing

ajSeqReverse reverse and complement a sequence

ajSeqCompOnly just complement a sequence

ajSeqRevOnly just reverse a sequence

Note that you can reverse complement a STRING object using ajSeqReverseStr or just complement it using ajSeqCompOnlyStr. Simple reversal would be achieved by ajStrRev. With the first two of these functions be sure to pass a nucleic acid sequence string, there can be no checking.

17.6 Upper and Lower Case

ajSeqToUpper

ajSeqToLower

17.7 Sequence output

The sequence output datatype is the AjPSeqout. This datatype is mainly used after manual construction of a (set of) sequence object(s). This is rare. More frequently the string objects are used for output.

18. ACD RETRIEVAL FUNCTIONS

18.1 The Simple Functions

ajAcdGetBool

ajAcdGetCodon

ajAcdGetFloat

ajAcdGetInt

ajAcdGetInfile

ajAcdGetOutfile

ajAcdGetSeq

ajAcdGetSeqall

ajAcdGetSeqset

ajAcdGetSeqout

ajAcdGetSeqoutall

ajAcdGetSeqoutset

ajAcdGetString

18.1 List and Select

The ajAcdGetList and ajAcdGetSelect calls return an array of string objects. The value of the last element is a NULL pointer.

AjPStr *array=NULL;

array = ajAcdGetList("list");

19. REGULAR EXPRESSIONS

The regular expression routines used at the lowest level in EMBOSS are the Henry Spencer ones. These appear in the ajax directory as the hs* files. An AJAX interface has been made for them. The regular expression datatype is the AjPRegexp. A regular expression must first be compiled before being used to scan a target string object. Compilation is done using the ajRegComp or ajRegCompC functions, the scanning by the ajRegExec function. The compilation functions are constructors. The destructor for regular expressions is the ajRegFree function.

After a string has been scanned then any (sub)matches can be returned using the ajRegSubI function and the remainder of the string that did not match by the ajRegPost function.

Here is an example:


AjPRegexp exp=NULL;
AjPStr str=NULL;
AjPStr id=NULL;
AjPStr remain=NULL;


str = ajStrNewC(">KABAT wibble This is some more text");


id = ajStrNew();
remain = ajStrNew();


exp = ajRegCompC("^>[A-Za-z0-9_-]+[ \t]+([A-Za-z0-9_-]+)"); /* note the space before \t */


if(!ajRegExec(exp,str))
    ajFatal("No match found");


ajRegSubI(exp,1,&id);
ajRegPost(exp,&remain);
ajRegFree(&exp);

A regular expression is compiled by ajRegCompC. The compiled expression exp is used to scan the string str. A match is found and the first substring (the bracketed "()" bit of the regular expression) is returned into the id string by ajRegSubI so the id string object now contains "wibble". After the ajRegPost function the remain string object contains "This is some more text". The regular expression is destructed by ajRegFree.

20. MEMORY ALLOCATION

Here are some useful memory allocation macros:

AJALLOC(nbytes) equivalent of malloc

AJALLOC0(nbytes) a calloc of nbytes

AJCALLOC(count,nbytes) a malloc of count lots of nbytes

AJCALLOC0(count,nbytes) a calloc of count lots of nbytes

AJNEW(p) a pointer to an object gets an object allocated for it using malloc

AJNEW0(p) a pointer to an object gets an object allocated for it using calloc

AJCNEW(p,c) a pointer to an object gets c objects allocated using malloc

AJCNEW0(p,c) a pointer to an object gets c objects allocated using calloc

For non-C programmers "malloc" allocates memory but the contents are undefined whereas "calloc" allocates memory setting each location to zero.

In some ways a str=ajStrNew() call is almost the same as AJNEW0(str), but not quite, the AJAX library is just a little more clever than that! It's a close analogy though. The macro is very useful for allocating your own objects.

21. CLOSING REMARKS

I hope this guide is enough for most programmers toget to grips with EMBOSS. If you feel there is something you'd like to see added that is within the scope of the document then please contact me. A separate graphics guide is in the pipeline.