Bioinformatics Software Development Course
18th - 20th April, 2006

Hinxton Hall
Wellcome Trust Genome Campus
Hinxton, Cambridge
United Kingdom
BSDC Home
Announcements
Sponsors

Registration

Schedule

Practicals

Talks

Location
Directions
Accommodation

Sponsor Us !

BSDC Schedule

Bioinformatics Software Development Course

Apr 18-20 2006

Details of Talks



Talk 1 - Introducing EMBOSS

Background information for Practical 1

1. Introduction to EMBOSS

There is a short overview of EMBOSS and its key features which you should read now if you aren't already familiar with the project. EMBOSS is a big project and has much to offer to a diverse group including systems administrators, application programmers, bioinformaticians, training and education experts and, of course, the biologist end-user. This course is primarily for software developers.

You might need to use all the following pages to find background information, detailed information in a specific area or to solve a particular problem:

The user documentation is well worth a visit because it summarises some of the major themes in EMBOSS.

Practical 1 should give all the information on installing and compiling EMBOSS that you'll need for this course. There are, however, some notes below on know how to keep your installation up-to-date and configuration options, to read once you've done the practical. Further information is in the administrator documentation, especially our very comprehensive Administrators Guide.

2. Keeping your EMBOSS CVS copy up to date

You only need to check-out EMBOSS once, after that you need to update it. It is advisable to update your copy of EMBOSS regularly.

To update, first cd to the emboss directory containing the ajax and nucleus subdirectories. For example, if you checked-out EMBOSS in the directory /home/fred then you should "cd" into the directory /home/fred/emboss/emboss. Then type the following commands:

cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss login
  [password "cvs"]
cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss update -d
cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss update -P
cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss logout
The first ("login") command logs you on to the CVS server. The second ("update -d") ensures you'll pick up any new directory structures the core development team have added. The third ("update -P") will delete any obsolete files or directory structures that the core development team have removed. The last command ("logout") logs you off the server.
Note: The -d and -P flags are case-sensitive.

3. Associated libraries and compilation

On Linux systems (including those used on this course) and many other systems, most support libraries are installed in /lib, /usr/lib, /usr/X11R6/lib etc. However, whereas Linux distributions include RPMs for libgd, libpng and libz (which are required for PNG support in EMBOSS), other operating systems do not.
If you are installing these libraries and include files in somewhere other than /usr then you must specify their location when configuring. Assuming you have installed them in the /usr/local area (i.e. in /usr/local/lib and /usr/local/include) you would add the following switch to the configuration command line:

    --with-pngdriver=/usr/local


Talk 2 - Navigating EMBOSS

Background information for Practical 2

1. Meaning of AjSStr, AjOStr, AjPStr

AjSStr is the formal name of the string object, AjOStr is the datatype name for the object whereas AjPStr is the datatype name for the object pointer.

AjOStr could, in principle, be used in code to create an instance of the object in memory, but in practice AjPStr is used (memory is allocated to the pointer - more on this later). You'll notice in SRS that "AjPStr" is given after "Name" and that's because AjSStr or AJOStr are never really used in the code (other than in the object definition). For the sake of brevity, we often say "object" (to refer to an AjPStr for example) when what we really mean is "object pointer". Watch out for that in the course so that you're clear about what we're referring to.



Talk 3 - Your first EMBOSS application

Background information for Practical 3

1. Programming guide

For your future reference, there is a guide to programming EMBOSS application. However, during this course (or the first day at least) you should stick to the material in the talks and practicals.

2. helloworld in C

Here's the source code of a C program that prints "Hello, World!" to the screen.

#include < stdio.h >   /* Omit the space after '<' and before '>' */
int main(void)
{
  printf("Hello, World!\n");
  return 0;
}
The first line is a preprocessor directive telling the compiler to include the `header file' stdio.h. The program consists of a single function (main) which does not have any parameters and has an integer return tpye, in this case it returns 0 to the operating system after printing "Hello, World!" to the screen.
Lets say we saved the source code to a file called helloworld.c. To get an executable (runnable) version of the program you have to compile the source. Typing one of the following lines would do it:

Unix % gcc helloworld.c -o helloworld        (using gcc, an ANSI C compiler)

Unix % cc helloworld.c -o helloworld (using cc, the default C compiler)
Presuming there are no errors during compilation you will end up with an executable file called 'helloworld'. If you omitted '-o helloworld' the executable would be called 'a.out'. To run your program you simply type `helloworld' at the Unix prompt:

Unix % helloworld

Hello, World!
Unix %

3. helloworld in EMBOSS

A few more steps are involved in EMBOSS. The first thing to understand is that, in addition to writing the source code, you must also write an ACD file for your new application. ACD files control all of the user input operations. All of the parameters required for an application are prompted for before the application proper begins. The input values are read and held in memory, files are opened as required and so forth, so that all the parameters are available when the application proper starts. An EMBOSS application cannot ask the user for more information after several hours of processing!

It's good practice to write your ACD file before the source code because this forces you to think closely about the application inputs and outputs and exactly what's required from the user. You should then test the ACD file by using an EMBOSS application called acdc (more on this shortly). Finally, the application is added to EMBOSS, compiled and ran.

So, the basic steps involved in writing your first EMBOSS application are as follows:

  1. Decide on inputs and outputs
  2. Write ACD file
  3. Test ACD file
  4. Write source code
  5. Add application to EMBOSS
  6. Compile
There are additional steps essential to any software project:
  • Design (think about problem, design software)
  • Debugging (getting it working)
  • Testing (ensuring it works correctly under all conditions)
  • Documentation (describing how it works)
There'll be more on these basic elements of software engineering later in the course. For now, just remember that these steps are as essential as the programming itself.

1) Decide on inputs and outputs
To begin, the input and output is trivial. All the program has to do is print "Hello, World!" to the screen and so nothing is required from the user.

2) Write ACD file
It's no surprise then that the ACD file is pretty sparse:

application: helloworld
[
    documentation: "My first EMBOSS program"
]
Every ACD file must have the file extension .acd and it's sensible (but not mandatory) that the filename (without .acd extension) is identical to the application name.

Every ACD file must contain an application definition, and this should come first in the file. The definition consists of the application: token, followed by the application name and a block of attributes between square brackets. The definition above contains a single documentation: attribute. The text should be a succinct description of the program and will be printed to screen when the program is run.

If the documentation attribute is missing, a warning will be issued when you run the program. ACD files will be covered in great detail later on.

3) Test ACD file
Testing the ACD file is easy. You simply run acdc with your application name as a parameter on the command line:

Unix % acdc helloworld

My first EMBOSS program
Unix %
acdc reads helloworld.acd and reads in any required data just as if the application itself was running. It will also test anything you add on the command line. In this case there is no required data and nothing else on the command line, and all is well.

4) Write source code
Happy in the knowledge we have a working ACD file we can turn to the C source code itself, which looks something like this:

/* @source helloworld application
**
** @author: Copyright (C) Arthur Geek (ageek@ebi.ac.uk)
**                        
** @@
**
** This program is free software; you can redistribute it and/or
** modify it under the terms of the GNU General Public License
** as published by the Free Software Foundation; either version 2
** of the License, or (at your option) any later version.
** 
** This program is distributed in the hope that it will be useful,
** but WITHOUT ANY WARRANTY; without even the implied warranty of
** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
** GNU General Public License for more details.
** 
** You should have received a copy of the GNU General Public License
** along with this program; if not, write to the Free Software
** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
******************************************************************************/

#include "emboss.h"

/* @prog helloworld **********************************************************
**
** Prints "Hello, World!" to the screen.
**
******************************************************************************/
int main(int argc, char **argv)
{  
      embInit("helloworld", argc, argv);

      ajFmtPrint("Hello, World!\n");

      ajExit();
      return 0;
}

The source begins with a comment block including the copyright notice, disclaimer and authors contact details. EMBOSS applications are licensed under the GNU General Public License so its essential that these comments are included in the source.
Next we have the preprocessor directive #include "emboss.h". In contrast to to #include < stdio.h >, this imports the entire EMBOSS interface i.e. makes all the EMBOSS library calls available to you. This must be included at the start of every EMBOSS program. If we look inside the file we see eventually that stdio.h is itself included:

Unix % more nucleus/emboss.h 

#ifndef emboss_h
#define emboss_h

#include "ajax.h"
#include "ajgraph.h"
#include "embaln.h"
#include "embcom.h"
#include "embcons.h"
#include "embdbi.h"
.
.


Unix % more ajax/ajax.h

#ifdef __cplusplus
extern "C"
{
#endif
#ifndef ajax_h
#define ajax_h

#include "ajarch.h"
#include < stdarg.h >
#include < stdio.h >
#include < stdlib.h >
#include < string.h >
#include "ajassert.h"
#include "ajdefine.h"
#include "ajstr.h"
#include "ajtime.h"
#include "ajfile.h"
.
.

Every function, including main(), in an EMBOSS application must be documented. Undocumented code often has little value, with the explanation of code that is self-explanatory. Even then it's helpful to provide at least basic documentation.
EMBOSS uses a standard format for function documentation which you'll learn about later. For now, you need to know that the @prog token is used for documenting the main() function and @source is used for the application name in header documentation. These tokens are read by a program that parses the source code and automatically generates documentation that goes on the web and in SRS.

The source code proper begins with the main() function. The command line must be available therefore main must include it. This is done in the parameter list using int main(int argc, char **argv).

Three calls to the EMBOSS libraries are made. All EMBOSS applications must contain a call to embInit. This is the function that handles all of the user input processing and so must be called right at the start of the application. embInit reads in local database definitions, finds the right ACD file to use (the first argument is "helloworld" so it looks for "helloworld.acd" in the ACD directory) and processes the command line (it uses argc and argv from main). If our ACD file was more complicated, and required a sequence as input and a file as output for example, then by the time the call returned it would have read in the sequence and put it somewhere in memory and also opened the output file. ajFmtPrint is used to print text to the screen. ajExit calls some internal clean-up routines before calling exit with a successful code (zero).

5) Add application to EMBOSS
Once we have our C source code and an ACD file, we must add our application to EMBOSS before being able to compile it.
EMBOSS includes two files both called 'Makefile.am' which together contain information about each C source file and ACD file known to EMBOSS. To add helloworld to EMBOSS you must therefore edit these files. Assuming you checked EMBOSS out into /home/fred/emboss you have the following directories for the .c and .acd files respectively:

  /home/fubar/emboss/emboss/emboss      /* The 'executables directory'  (for C source files and executables) */
  /home/fubar/emboss/emboss/emboss/acd  /* The 'acd directory' (for ACD files)                               */

The files we have to edit are :

  /home/fred/emboss/emboss/emboss/Makefile.am
  /home/fred/emboss/emboss/emboss/acd/Makefile.am

The Makefile.am in the executables directory contains information about each C source file. First, you must add your program name to the "bin_PROGRAMS" list. This is usually done in alphabetical order. The before and after editing stages will look something like this:

i) Before editing

bin_PROGRAMS = aaindexextract abiview acdc antigenic \
...
garnier geecee getorf helixturnhelix hmoment \
...

ii) After editing

bin_PROGRAMS = aaindexextract abiview acdc antigenic \
...
garnier geecee getorf helixturnhelix helloworld hmoment \
...

N.B. Line continuation characters ('\') must be explicitly added (it's bad practice to have lines longer than 80 characters).

Secondly you must add the name of your source file to a SOURCES line. Again this is usually done alphabetically. The before and after appearance of the file would be as follows, given the above example.

iii) Before editing

...
geecee_SOURCES = geecee.c
getorf_SOURCES = getorf.c
helixturnhelix_SOURCES = helixturnhelix.c
hmoment_SOURCES = hmoment.c
iep_SOURCES = iep.c
infoalign_SOURCES = infoalign.c
...

iv) After editing

...
geecee_SOURCES = geecee.c
getorf_SOURCES = getorf.c
helixturnhelix_SOURCES = helixturnhelix.c
helloworld_SOURCES = helloworld.c
hmoment_SOURCES = hmoment.c
iep_SOURCES = iep.c
infoalign_SOURCES = infoalign.c

The Makefile.am in the ACD directory contains information about each ACD file. All that needs to be done for this file is to add the name of the new ACD file. Again, it is usual to do this alphabetically. Here are the before and after positions:

i) Before editing

pkgdata_DATA = codes.english \
        aaindexextract.acd abiview.acd ajbad.acd ajfeatest.acd ajtest.acd \
...
        garnier.acd geecee.acd getorf.acd helixturnhelix.acd hmoment.acd \
        histogramtest.acd iep.acd infoalign.acd infoseq.acd isochore.acd \
        lindna.acd listor.acd \
        marscan.acd maskfeat.acd maskseq.acd \
        matcher.acd

i) After editing

pkgdata_DATA = codes.english \
        aaindexextract.acd abiview.acd ajbad.acd ajfeatest.acd ajtest.acd \
...
        garnier.acd geecee.acd getorf.acd helixturnhelix.acd helloworld.acd \
        hmoment.acd histogramtest.acd iep.acd infoalign.acd infoseq.acd \
        isochore.acd lindna.acd listor.acd \
        marscan.acd maskfeat.acd maskseq.acd \
        matcher.acd

Again, line continuation characters ('\') must be added explicitly.

6) Compile

After the two Makefile.am files have been correctly edited, you can compile the application by typing "make helloworld" from the executables directory (quickest way) or "make" from either the executables or second emboss directory (slower as this will compile everything).

The GNU tools will recognise whether the Makefile.am files have been edited and reconstruct the Makefile files when a "make" command is given. It is bad practice to edit the Makefile files themselves.

Unix % pwd 
/home/fred/emboss/emboss/emboss/

Unix % make helloworld
/bin/sh ../libtool --tag=CC --mode=link gcc  -O2 -Wall -fno-strict-aliasing   -o helloworld  helloworld.o 
../nucleus/libnucleus.la ../ajax/libajaxg.la ../ajax/libajax.la ../plplot/libplplot.la -L/usr/X11R6/lib 
-lX11  -lm -lgd -lpng -lz -lm

gcc -O2 -Wall -fno-strict-aliasing -o .libs/helloworld helloworld.o  ../nucleus/.libs/libnucleus.so 
../ajax/.libs/libajaxg.so ../ajax/.libs/libajax.so ../plplot/.libs/libplplot.so -L/usr/X11R6/lib -lX11 
-lgd -lpng -lz -lm -Wl,--rpath -Wl,/home/jison/emboss_test_installation_for_course/emboss/lib

creating helloworld

And at last:

Unix % helloworld
My first EMBOSS program
Hello, World!
Unix % 


Talk 4 - Introduction to Objects Using Strings

Background information for Practical 4

1. EMBOSS objects

EMBOSS borrows the concept of objects from C++. In C++, an object can be thought of as a 'black box' which takes a defined input and produces a defined output. An object store its own data (member data) and knows how to perform certain actions (via member functions). From the perspective of the user of the object, it doesn't matter what is going on inside so long as the interface, i.e. the inputs and outputs, remains the same. The use of objects allows the programmer to model their program on the problem, breaking it down into smaller, easily managed pieces.

It's important to make the distinction between an object and an instance of an object. Strictly speaking, an object is a definition, or a template for instances of that object. The instance is the actual thing that can be manipulated. If you want to do anything you must create an instance, i.e. instantiate the object.

The objects in EMBOSS are C data structure definitions. Elements in the structure are the member data. There are no member functions as such, however, all the functions that use an object are documented above the structure definitions themselves, and are accessible from the web and SRS. Therefore EMBOSS maintains the link between the data and functions that act upon it. You can see that for yourself by inspecting a library header file, e.g. ajstr.h.

EMBOSS objects should be considered as black boxes, for instance, you should never access the elements of an instance directly; that is what the library functions are there for. The string object is one of the simplest so lets have a look at it now:

typedef struct AjSStr 
{
  ajint Res;
  ajint Len;
  ajint Use;
  char *Ptr;
} AjOStr;
#define AjPStr AjOStr*
typedef AjPStr* AjPPStr;

C programmers will see that we have defined a structure called AjSStr of 4 elements (Len, Res, Use and Ptr) and have created three new datatype names, AjOStr for the object itself, AjPStr for the object pointer, and AjPPStr for a pointer to an AjPStr.

Don't worry about the elements of the structure, we'll come onto them in a bit.

2. Managing objects

In principle, it's possible to instantiate the object in this way:

#include "emboss.h"
int main(void)
{
    AjOStr  my_structure;
}
This approach is NOT taken in EMBOSS however because it does not give the programmer the freedom to manage the memory of the object. Even if you only need one structure, you should avoid using AjOStr because it would be inconsistent with the rest of EMBOSS.

The standard way to instantiate an object is to dynamically allocate memory to the object pointer. Its for this reason (and because AjOStr is almost never used) that we often say 'object' instead of the more cumbersome 'object pointer' when referring to an AjPStr etc - so make sure you understand the difference. All AJAX pointers start with "AjP" so its easy to guess that a sequence object is an AjPSeq.

All objects should be allocated dynamically and free'd once you're done with them. This is easy because a constructor function (for memory allocation) and destructor function (for freeing memory) are provided for every type of object. Here is a code snippet illustrating this:

#include "emboss.h"
int main(int argc, char **argv)
{
    AjPStr  my_pointer=NULL;
  
    embInit("helloworld", argc, argv);

    my_pointer = ajStrNew();

    ajStrAssC(&my_pointer, "Hello, World!\n");
    ajFmtPrint("%S", my_pointer);
  
    ajStrDel(&my_pointer);

    ajExit();
    return 0;
}
AjPStr my_pointer=NULL; declares the object pointer and initialises it to NULL. Pointers should always be set to NULL when they are declared because EMBOSS functions presume that non-NULL pointers have had memory allocated to them. If you do not set the pointer to NULL, it may receive some junk value when the program runs and any function that uses it might mistakenly think memory had been allocated for it - which might lead to a segmentation fault!
ajStrNew() is the constructor function. This allocates a block of memory for the object and returns the memory address of the allocated block. The memory address is held in the variable my_pointer. Disregard the calls to ajStrAssC and ajFmtPrint for the time being.
ajStrDel() is the destructor function; this must not only free the memory but also set the pointer back to NULL so that it is ready for re-use. You see we pass it the address of my_pointer. You may be wondering, as my_pointer is a pointer anyway, why do we need to pass a pointer to it to get the memory freed? The answer is simple if you remember that in C, function arguments are passed "by value". A temporary copy of each argument is created and passed to the function, rather than the originals. While a copy of the pointer would be enough to free the memory that is pointed to, we need a handle on (a pointer to) the original if we want to set the original to NULL, hence the requirement for &my_pointer.

3. The golden rule

For consistency, all functions in EMBOSS use the following rule:

If the function is to change the object or the data pointed to in any way, the address of the object pointer must be passed.

If the function merely reads the data pointed to, just pass in the object pointer.

Armed with this rule, we can see that ajStrAssC, which copies text into an AjPStr, must receive the address of my_pointer, whereas ajFmtPrint which merely prints an AjPStr only requires my_pointer.

The above rule is one of the most important keys to coding in EMBOSS. There'll be more on pointers and objects later in the course, so don't worry if you don't fully understand what's going on yet - just remember and apply the rule you'll be fine. The library functions are well documented as to whether an object or address of an object is required so you can always find out what's needed if you're not sure.

4. Elements of the string object

Now back to the AjPStr:

typedef struct AjSStr {
  ajint Res;
  ajint Len;
  ajint Use;
  char *Ptr;
} AjOStr;
#define AjPStr AjOStr*
typedef AjPStr* AjPPStr;
The char *Ptr; is just a standard C pointer which holds a character string and the ajint Len; is its length. The character string may or may not be null terminated; the library functions for printing AjPStr objects look at the length field for how many characters to print; they won't stop at the first NULL if there is one.
The ajint Res; element internally lets the library know how much reserved dynamic memory is associated with the object. This is always at least equal to Len but is often more. Res is and should be outside your direct control. If you use a library call to add anything to the string then, if it'll fit within the memory given by Res then the operation is performed immediately; if the memory required is larger than Res then more memory is allocated and the Res item is updated. A little more memory than required is usually allocated.
ajint Use; is the string usage counter. Sometimes, you'll want to two or more references to a single string rather than making a genuine copy. EMBOSS functions that do this increment the strings usage counter. The usage counter is decremented when a call to destroy the string itself or a reference to it is made. When the usage counter reaches zero the object will be deleted. All of this of course is inside the "black box" which means you don't need to worry about it, so long as you don't play with the object internals directly. The important message is :

If you intend altering the contents of an object then safety is guaranteed if you use the available library functions.



Talk 5 - ACD Files : Basic Skills

Background information for Practical 5

The notes below should describe everything you need to do this practical. If you need further information on ACD files, more comprehensive documentation is available, and will serve as a reference for the rest of the course. Please note there might be some slight differences in the nomenclature used.

1. Introduction to ACD files


The ACD (Ajax Command Definitions) file describes the data that the program needs to run, this means all the program parameters including things such as input and output files. It indicates whether data items are mandatory or not and whether certain data have to be within limits; a gap penalty for an alignment must be higher then 0 for instance. It can also indicate whether the value of one data item is dependent on the value or the presence of another. For example, if the input sequence for an alignment program is DNA, it should not accept a protein comparison matrix. We'll cover some of these issues in this session and others later in the course.

2. General ACD syntax

All ACD definitions ("data items") have the same general syntax which is:

datatype: label 
[ 
attributes 
] 

The available datatypes are pre-defined and include sequence, integer, float and many other types. There is a complete list of supported datatypes.

The label can be anything you like within reason. This is the name by which the data item is referred to both from within your C source code and on the command line. The programmer must have a handle on each ACD data item from within the C source code and the label is used for this purpose, as you'll see later. On the command line, the value of a data item can be set if the label is specified. For this reason the label can be referred to as the qualifier name.

The attributes allow you to specify such things as informative text, default values, maxima and minima and so on. Global attributes apply to all datatypes whereas others are datatype-specific, and are assigned values (calculated in some cases) after the data item is validated. For example, once a sequence has been read in from file. Here is a complete list of the supported global and datatype-specific attributes.

Comment lines can be added to an ACD file and you begin the line of comments with '#'. Any whitespace in an ACD file is ignored.

3. Application group

The application definition was mentioned in Practical 3 and supports various attributes. Today, we're interested in the groups attribute which associates the application to programs which do similar things or different things in the same general area. The groups attribute is used by the seealso application which takes the name of an existing program in EMBOSS and gives a list of the programs which share some functionality with it. Please refer to the list of valid group names. The group names there can be preceded by 'DNA:' or 'PROTEIN:' if appropriate; 'PROTEIN: Alignment consensus' is valid for example.

4. Retrieving ACD values

Lets reiterate that for EMBOSS applications, all input is read and held in memory before the application proper starts. What this means in terms of C source code is that you've made a call to embInit which, amongst other things, will have read all the data items and placed them in memory by the time it returns.

The ajAcdGet family of functions is used to retrieve values for ACD data items. These functions are in the library file ajacd.c that is summarised here.

5. Qualifiers and parameters

As mentioned above, the value of a data item can be specified by the user on the command line and the label (or qualifier name) is used for this purpose. In fact, each ACD data item can be specified to be one of the following:
  • Parameter: "Y"
  • Standard: "Y" (this used to be 'required')
  • Additional: "Y" (this used to be 'optional')

Parameter: "Y" means that the data item is a parameter, i.e. you do not have to use the data label to specify a value for it on the command line. e.g. "myprog 10".

Standard: "Y" and Additional: "Y" mean that the data item is a qualifier, i.e. you DO have to use the data label to specify a value for it on the command line. e.g. "myprog -somevalue 10".

Values for parameters and standard qualifiers are always prompted for (with their default value) if not specified on the command line.
Values for additional qualifiers are not prompted for (a default value will be used) unless '-option' is given on the command line. A default value for additional qualifiers should always be given in the ACD file!

The standard and additional qualifiers also differ in where the information from the built-in "-help" qualifier is shown. Help info. from 'standard' qualifiers appears in the 'required' section of the help display. Help info. from the 'additional' qualifiers appears in the 'advanced' section of the help display.

If none of Parameter: "Y", Standard: "Y" or Additional: "Y" are specified then the data item defaults to an Advanced qualifier. An advanced qualifier is never prompted for and would appear in the "advanced" section of the documentation if the program was run with the '-help' qualifier. To specify a value for an advanced qualifier on the command line you must use the data label.

You wouldn't normally specify Paramter: "N", Standard: "N" or Additional: "N". The "Y" in the previous definitions is given for consistency because every ACD attribute, being a label:value pair, has to have a value. You'll see later, however, that the "N" can in exceptional circumstances be used to override the default behaviour of these attributes.



Talk 6 - EMBOSS String Handling

Background information for Practical 6

All the information you need is given in the practical.



Talk 7 (Part I) - The AJAX Library & The Software Cycle

Background information for Practical 7 (Part I)

1. Overview of the AJAX library

By now you should be reasonably familiar with the string library. There is however much, much more to AJAX than just strings. AJAX is the core library used by all EMBOSS applications and covers standard data structures, including strings, sequences, file handles, queues, hashes, heaps, lists, dictionaries, trees and dynamic arrays. It also covers standard algorithms including comparisons, pattern matching, sorting, and iterators.

You can't hope to cover everything, but you'll need to become familiar with several other library files to do the practicals today. The main libraries you'll need are:

  • ajacd - controls all aspects of processing the AJAX command definition syntax (ACD files), command line handling and prompting of the user.
  • ajfile - input and output data types are provided for handling files and directories of files. Includes convenience functions for reading data files (which are typically kept in specific directories), functions for parsing file names, paths and extensions from a full file name and buffering of lines so that files can be read again from memory; useful for files which cannot be rewound like stdin.
  • ajarr - these functions control all aspects of AJAX array handling. This includes dynamic 1, 2 and 3-dimensional arrays of long,double, float, int and short datatypes. There are convenience functions for reading arrays from formatted strings.
  • ajfmt - string formatting functions. These are similar to the C functions printf, fprintf, vprintf etc. AJAX formatting is a superset of ANSI C, with additional types for strings, booleans and dates. Output can be to a string with automatic adjustment of memory allocation. Null pointers are checked for and reported rather than causing failures (for example with %s).
  • ajlist - These functions create and control linked lists, which can be tricky to manage because of the ownership of the data items. Functions are provided for all manner of list manipulations: creating, destroying, appending to, iterating through, popping, peeking, sorting etc.
  • ajseq - Functions for managing individual sequences and database of sequences (which are processed a sequence at a time). A comprehensive library that provides many useful utilities, e.g. for probing sequence attributes and for handling substitution tables.
  • ajseqread - More sequence handling functions - for reading sequences from formatted input files.
  • ajseqwrite - Yet more sequence handling functions - for formatted output.

2. The Software Development Cycle

"The first step toward the management of disease was replacement of demon theories and humours theories by the germ theory. That very step, the beginning of hope, in itself dashed all hopes of magical solutions. It told workers that progress would be made stepwise, at great effort, and that a persistent, unremitting care would have to be paid to a discipline of cleanliness. So it is with software engineering today." - Fred Brooks, No Silver Bullet.

1) Suggested implementation steps under EMBOSS

There's much more to software than just coding; especially when programming for EMBOSS which is very widely distributed and deployed in production environments. You should familiarise yourself with the basic steps now, so that you know what's involved before starting a project and so can plan your work. This should help you develop your software efficiently and deliver your projects on time.

As mentioned before, the basic steps to write an EMBOSS application are:

  1. Design (think about problem, design software)
  2. Decide on inputs and outputs
  3. Write application
  4. Write ACD file
  5. Test ACD file
  6. Write source code
  7. Add application to EMBOSS
  8. Compile
  9. Debugging (get it working)
  10. Testing (ensure it works correctly under all conditions)
  11. Documentation (describe how it works)

Each of these steps is essential and takes time, so the first thing that should be obvious is that a software project will take much more time than that required to simply write the code.

2) Consider your users

The steps above would lead to a finished product, but that's only the start of the lifetime of your software. In practice, you must consider your users and there are four fundamental processes in this regard:

  1. Software specification : What are the requirements? What functionality is required? What are the constraints?
  2. Software development : Design the software and implement it, both in consultation with users where necessary.
  3. Software validation : Ensure that the software meets the users needs.
  4. Software evolution : Your software must evolve to keep pace with your users, whose needs are often evolving.

You should now appreciate that writing the first version of the code is the beginning rather than the end of the project. The four steps above have many implications, especially in terms of additional requirements (e.g. how to survey user feedback) and strategy (e.g. how best to implement for evolving requirements). These issues are beyond the scope of this course, but the take home message is:

Stay in touch with your users (even if you're the user:) at all stages in the software development cycle.

3) Models for developing and releasing software

In terms of developing and releasing software, there is no universal model but the following stages are typical:

  1. Planning
    • Specification of requirements.
    • Design
  2. Pre-Alpha : Software still under development or design, e.g. prototype software. Useful to get a gist of what's to come and for early feedback from users.
  3. Alpha : Software that's quite well developed with major features implemented and major bugs removed. Possibly redesigned from the initial prototype. Usually with very limited deployment, if any.
  4. Beta : The first significant release of the software, used for testing by users and to get feedback. Beta software might not be perfect, e.g. it might not work under all circumstances, but it should be relatively bug-free - the majority (if not all) known bugs should be fixed.
  5. Release Candidate : An early release if no significant bugs were reported from the Beta version. Usually features are "frozen" and only bugfixes are allowed from this point.
  6. Release (e.g. Version 1.0) : Fully tested software that's ready to be used in anger by your users.

Of course in practice early releases often contain many bugs and for this reason people are usually wary of software until it has matured over a period of months or even years.

In some cases, usually to adapt the software for evolving requirements, it may be necessary to move the software back into beta or even to start this entire cycle from scratch in cases where a complete redesign is necessary.

4) Software engineering and project management

There is a well-defined standard (IEEE 1074) developed by the Institute of Electrical and Electronics Engineers for creating a software life cycle process. The standard is intended for use by process architects (e.g. project managers) but should be useful for managing & performing software projects. You'll certainly not need it for this course but you might benefit from a quick look now:

IEEE 1074: Standard for Developing Life Cycle Processes

Process group Processes Clause Activities
Life Cycle Modeling Selection of a Life Cycle
Model
   
Project Management Project Initiation 3.1.3
3.1.4
3.1.5
3.1.6
3 Map Activities to Software Life Cycle Model
4 Allocate Project Resources
5 Establish Project Environment
6 Plan Project Management
Project Monitoring and
Control
3.2.3
3.2.4
3.2.5
3.2.6
3.2.7
Analyze Risks
Perform Contingency Planning
Manage the Project
Retain Records
Implement Problem Reporting Model
Software Quality
Management
3.3.3
3.3.4
3.3.5
3.3.6
Plan Software Quality Management
Define Metrics
Manage Software Quality
Identify Quality Improvement Needs
Pre-development Concept Exploration 4.1.3
4.1.4
4.1.5
4.1.6
4.1.7
Identify Ideas or Needs
Formulate Potential Approaches
Conduct Feasibility Studies
Plan System Transition (If Applicable)
Refine and Finalize the Idea or Need
System Allocation 4.2.3
4.2.4
4.2.5
Analyze Functions
Develop System Architecture
Decompose System Requirements
Development Requirements 5.1.3
5.1.4
5.1.5
Define and Develop Software Requirements
Define Interface Requirements
Prioritize and Integrate Software Requirements
Design 5.2.3
5.2.4
5.2.5
5.2.6
5.2.7
Perform Architectural Design
Design Data Base (If Applicable)
Design Interfaces
Select or Develop Algorithms (If Applicable)
Perform Detailed Design
Implementation 5.3.3
5.3.4
5.3.5
5.3.6
5.3.7
5.3.8
Create Test Data
Create Source
Generate Object Code
Create Operating Documentation
Plan Integration
Perform Integration
Post-development Installation 6.1.3
6.1.4
6.1.5
6.1.7
Plan Installation
Distribution of Software
Installation of Software
Accept Software in Operational Environment
Operation and Support 6.2.3
6.2.4
6.2.5
Operate the System
Provide Technical Assistance and Consulting
Maintain Support Request Log
Maintenance 6.3.3 Reapply Software Life Cycle
Retirement 6.4.3
6.4.4
6.4.5
Notify Users
Conduct Parallel Operations (If Applicable)
Retire System
Integral Processes Verification and Validation 7.1.3
7.1.4
7.1.5
7.1.6
7.1.7
7.1.8
Plan Verification and Validation
Execute Verification and validation Tasks
Collect and Analyze Metric Data
Plan Testing
Develop Test Requirements
Execute Tests
Software Configuration
Management
7.2.3
7.2.4
7.2.5
7.2.6
Plan Configuration Management
Develop Configuration Identification
Perform Configuration Control
Perform Status Accounting
Document Development 7.3.3
7.3.4
7.3.5
Plan Documentation
Implement Documentation
Produce and Distribute Documentation
Training 7.4.3
7.4.4
7.4.5
7.4.6
Plan the Training Program
Develop Training Materials
Validate the Training Programs
Implement the Training Program

If you're interested in learning more, there's a nice introduction to software engineering. Here are a few useful definitions of the term:

  • As the usual contemporary term for the broad range of activities that was formerly called programming and systems analysis.
  • As the broad term for the technical analysis of all aspects of the practice, as opposed to the theory of computer programming.
  • As the term embodying the advocacy of a specific approach to computer programming, one that urges that it be treated as an engineering profession rather than an art or a craft, and advocates the codification of recommended practices in the form of software engineering methodologies.
  • Software engineering is "(1) the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software, that is, the application of engineering to software," and "(2) the study of approaches as in (1)." -- IEEE Standard 610.12.
  • Software engineers model parts of the real world in the software. As the real world changes, software must also change, and software engineering is concerned with the evolution of these models and how they meet changing requirements.

But really the take-home message for today is ...

Take a disciplined, organised step-wise approach to implementing your software.



Talk 7 (Part II) - ACD files : Intermediate Skills

Further background information for Practical 7 (Part II)

1. Parameters and Qualifiers

Within ACD, all application parameters ("parameter" here is used in the computer science sense of the word, i.e. any value that's passed to an application) correspond to an ACD data item and are defined via the appropriate ACD attribute to be one of "parameter", "standard" or "additional" with the default of "advanced", using the following syntax:
Parameter:  "Y"
Standard:   "Y"   (used to be 'required')
Additional: "Y"   (used to be 'optional')

Their behaviour is as follows:

Type Prompt Flag Help info.
parameter Yes No Required section
standard Yes Yes Required section
additional Yes (-option)* or No (default needed) Yes Advanced section
advanced (default) No Yes Advanced section


Prompt - whether a value will be prompted for if one is not specified on the command line.
Flag - whether the qualifier name must be specified on the command line.
Help info - where the information from the built-in "-help" qualifier is shown.
* Additional qualifiers will only be prompted for if -option is given on the command line.

2. Summary

Parameter: "Y" means that the data item is a parameter, i.e. you do not have to use the data label to specify a value for it on the command line. e.g. myprog 10
Standard: "Y" and Additional: "Y" mean that the data item is a qualifier, i.e. you DO have to use the data label to specify a value for it on the command line. e.g. myprog -somevalue 10.

Values for parameters and standard qualifiers are always prompted for (with their default value) if not specified on the command line.
Values for additional qualifiers are not prompted for (a default value will be used) unless '-option' is given on the command line. A default value should always be given in the ACD file.

The standard and additional qualifiers also differ in where the information from the built-in "-help" qualifier is shown. Help info. from 'standard' attributes appears in the 'required' section of the help display. Help info. from the 'additional' attribute appears in the 'advanced' section of the help display.

If none of Parameter: "Y", Standard: "Y" or Additional: "Y" are specified then the data item defaults to an Advanced qualifier.
An advanced qualifier is never prompted for and would appear in the "advanced" section of the documentation if the program was run with the '-help' qualifier. To specify a value for an advanced qualifier on the command line you must use the data label.

You'd never normally specify Paramter: "N", Standard: "N" or Additional: "N".



Talk 7 (Part III) - Sequence Handling

Further background information for Practical 7 (Part III)

1. Uniform Sequence Addresses

The Uniform Sequence Address , or USA, is a standard sequence naming scheme used by all EMBOSS applications.

The USA syntax has the following types:

  • "file"
  • "format::file"
  • "format::file:entry"
  • "dbname:entry"
  • "@listfile"

Where "format" is the database format of a file ("file") you have provided and "entry" is the database entry code. Alternatively an entry can be retrieved from an installed database of format "dnmame". "listfile" is the name of a file which itself contains a list of file names.

The "::" and ":" syntax is to allow, for example, "embl" and "pir" to be both database names and sequence formats.

2. Sequence formats

Many different sequence formats are supported. You can specify the format of your input file on the command line by adding "-sformat format" on the command line or by giving it in the USA (Uniform Sequence Address) of the input filename, e.g.

embl::myfile.seq

The format is not required, however. When reading in a sequence, EMBOSS will guess the sequence format by trying all known formats until one succeeds.

When writing out a sequence, EMBOSS will use fasta format by default. The output format can be changed by adding "-osformat format" on the command line or by giving it in the USA (Uniform Sequence Address) of the output filename, e.g.

gcg::myresults.seq

3. Input sequence command-line qualifiers

There are other built-in command-line qualifiers that change the behaviour of the sequence input.

  -sbegin	integer		first base used
  -send		integer		last base used, default=seq length
  -sreverse	boolean		reverse (if DNA)
  -sask		boolean		ask for begin/end/reverse
  -snucleotide	boolean		sequence is nucleotide
  -sprotein	boolean		sequence is protein
  -slower	boolean		make lower case
  -supper	boolean		make upper case
  -sformat	string		input sequence format
  -sopenfile	string		input filename
  -sdbname	string		database name
  -sid		string		entryname
  -ufo		string		UFO features
  -fformat	string		features format
  -fopenfile	string		features file name

4. Output sequence command-line qualifiers

There are other command-line qualifiers that change the behaviour of the sequence output.

  -osformat           string     output sequence file format
  -osextension        string     file name extension
  -osname             string     base file name
  -osdirectory        bool       output sequence file directory
  -osdbname           string     database name to add
  -ossingle           bool       create a separate output file for each entry
  -oufo               string     feature file to create
  -offormat           string     features format
  -ofname             string     features file name
  -ofdirectory        string     features output directory

5. Supported output format

By default, fasta format will be used for output. The output format can be changed by adding "-osformat format" on the command line or by giving it in the USA of the output filename e.g. "embl::myfile.seq". There are many possible output formats which include:
  • "gcg"
  • "gcg8"
  • "embl"
  • "em"
  • "swiss"
  • "sw"
  • "fasta"
  • "pearson"
  • "ncbi"
  • "nbrf"
  • "pir"
  • "genbank"
  • "gb"
  • "gff"
  • "ig"
  • "codata"
  • "strider"
  • "acedb"
  • "experiment"
  • "staden"
  • "text"
  • "plain"
  • "raw"
  • "fitch"
  • "msf"
  • "clustal"
  • "selex"
  • "aln"
  • "phylip"
  • "phylip3"
  • "asn1"
  • "hennig86"
  • "mega"
  • "meganon"
  • "nexus"
  • "nexusnon"
  • "paup"
  • "paupnon"
  • "jackknifer"
  • "jackknifernon"
  • "treecon"
Some of these are obviously synonyms for another e.g. "embl" and "em".

6. AJAX library files for sequence handling

FunctionsData typesDescription
ajseq ajseq General sequence handling
  ajseqdata Sequence data types
  ajseqabi Sequence ABI trace data
ajseqdb   Sequence database access
ajseqread ajseqread Sequence reading
ajseqtype   Sequence types
ajseqwrite ajseqwrite Sequence writing


Talk 7 (Part IV) - ACD Files : Advanced Skills

Background information for Practical 7 (Part IV)

If you need further information on ACD files, very comprehensive documentation is available.



Talk 7 (Part V) - Software Consolidation

Background information for Practical 7 (Part V)

1. Coding standards

To ensure consistency in the EMBOSS code, all application and library code that you write should conform to basic standards. These are formally defined in the EMBOSS coding standards document.

Currently, you will probably find exceptions to these standards but in the future they will be enforced (or at least exceptions raised) by software which automatically checks submitted code against the standards.

You should at least familiarise yourself with the standards, most of which concern the layout of code and are summarised below.

Ease of reading

Your code should be easy to read. This is perhaps more important than the code actually working (if its easy to read then someone else stands a chance of fixing it).

a) Line length limit

No line should be longer than 79/80 characters. Line-wrap is ugly on the screen and sometimes even disappears on printouts.

Style

a) Braces

Matching braces should appear in the same column. Do not use braces unnecessarily.

Indentation

Indentation of 4 characters is recommended. If you find that indentation within nested loops results in many of the lines wrapping then you should check whether the code structure can be improved.

Position of main()

Declare your main() function as the first function after the preamble. This saves people from having to wade through countless functions before they find it. This is also a help in preventing implicit declarations cropping up.

Implicit declarations

Examples of these would be:

extern fubar(int x);
main(int argc, char **argv)
Do not use them. Always specify what a function should return. Having main() as your first function can go some way tomards alleviating the problem.

Use of 'int', 'ajint', 'long' and 'ajlong'

EMBOSS assumes an ajint is at least 32 bits. Use ajint, if 32 bits is enough, instead of 'int'

EMBOSS assumes an ajlong is at least 64 bits. Use this instead of 'long'. This circumvents any 'long' or 'long long' problems. Of course, if you are using an Alpha box then both your ints and longs will be 64 bits. In this case don't just use 'ajlong' out of laziness as your code will run more slowly on other platforms. Match your datatype to what you need.

Cases where int and long should be used are (e.g.) as parameters to C system library functions.

Global variables

Do not use them.

Static variables in functions

Such functions are not re-entrant so do not use them at all if you'd like the code to work in multi-threaded contexts. Otherwise, do not use them without very strong reason.

Variable declarations

Declare all variables at the top of each function. Don't be lazy, declare one variable per line.

Note that it is easier to read if both datatypes and variables line up.

Always initialise Object pointer variables to NULL. Initialise other datatypes as appropriate in the code. Align initialisations for easy reading.

Consider documenting key variables but do not document individual house-keeping variables, for example, loop counters.

Precedence of operators

Avoid confusion introduced by using operator precedence.

Organisation

a) Source File

Use the following organisation for source files:

  • includes of system headers
  • includes of local headers
  • type and constant definitions
  • global variables (there should not be any)
  • prototypes
  • functions

b) Header File

In header files, use the following organisation:

  • type and constant definitions
  • external object declarations
  • external function declarations

It is not necessary to make the declarations 'extern' (although it is arguably safer to do so. Current EMBOSS library code doesn't do this.)

Never use nested includes! (Look at Solaris header files to see how not to do things).

Avoid exporting names outside individual C source files; i.e., declare as static every function that you possibly can.

Always use full ANSI C prototypes.

Structures and unions

Always use the EMBOSS method of declaring structures and unions i.e. use typedefs. They must contain a structure name, object name and pointer name even if these are not used. You should only ever need to use the pointer declaration within your code.

Use AjS, AjO, AjP or EmbS, EmbO and EmbP as appropriate.

Use of this convention i.e. 'P' for pointers avoids problems with these abstract datatypes where a pointer could appear to be an object in its own right.

Use of the preprocessor

For constants, consider using:

enum { Red = 0xF00, Blue = 0x0F0, Green = 0x00F };
static const float pi = 3.14159265358;

instead of #defines, which are rarely visible in debuggers.

Macros should avoid side effects. If possible, mention each argument exactly once. Put parentheses around all arguments. When the macro is an expression, put parentheses around the whole macro body. If the macro is the inline expansion of some function then capitalise the name and precede it with 'M'.

Try to write macros so that they are syntactically expressions i.e. so you can put a semi-colon at the end of their use.

The preprocessor can be used for:

  • System dependent code (but see the ajsys corollary above.)
  • Commenting out code with #if 0
  • Conditionals arising from system limits

Function names

  • All function names should start with a lower case character.
  • All AJAX exported functions should begin with 'aj'.
  • All NUCLEUS exported functions should begin with 'emb'
  • All application functions should have the application name (followed by an underscore) prepended.
  • Once the current code refactoring exercise is complete, names for AJAX and NUCLEUS functions must adhere to strict rules that are defined in the header files.

Argument names

  • Function arguments should have meaningful or at least intuitive names but should not be too long.
  • Once the current code refactoring exercise is complete, argument names must adhere to strict rules that are defined in the header files.

Loops

Try to avoid the use of 'continue'.

Goto

Do not use it.

Memory and functions

Do not define large arrays as declarations within functions. They will go on the stack and can cause many problems.

Do not explicitly use malloc() or calloc(), instead use the AJAX macros e.g. AJALLOC, AJNEW, AJCNEW, AJNEW0, AJCNEW0 etc.

Use constructor functions explicitly, i.e.

    AjPStr  tmpstr = NULL;

    tmpstr = ajStrNew();
    ajStrAssC(&tmpstr,"Hello");

and not:

    AjPStr  tmpstr = NULL;

    ajStrAssC(&tmpstr,"Hello");
If at all possible put constructors at the start of the function. They will act as a reminder to put destructor functions at the bottom of the function. Doing the above will solve the single biggest cause of memory leaks. A good function has the following structure:

  • function declaration
  • variable declarations
  • constructors
  • body of function
  • destructors
Always explicity state a 'return' at the end of every function, even for void functions.

Separate functions by suitable whitespace (4 newlines are recommended).

Applications

A general rule is:

A separate application should only be written if it differs by more than one extra major parameter/qualifier to an existing program.

New applications should be put into the 'make check' area of the emboss/Makefile.am until full documentation has been submitted. See the documentation standards document for details.

Code should be tested for memory leakage before committing it to CVS. If you are unclear how to do this then ask.

General guidelines

Duplicated code

Duplicated code is error-prone and difficult to maintain. Do not duplicate blocks of code, write a function instead. Where two functions do essentially the same thing but have different arguments, make one function simply call the other.

Long functions

Big functions are difficult to understand. Smaller functions are easier to document therefore easier for the programmer to identify. Functionality split into smaller functions is more likely to be re-used. Consider breaking big functions down into smaller ones. If necessary, retain the function with the original name which can call the new, smaller functions. Avoid too many levels of the function calls though (see "Nesting of functions" below).

Long parameter lists

Functions with many parameters are difficult to understand, use and maintain. Where possible, consider passing an object pointer rather than the individual elements of an object. If the parameters do not belong to an object, consider definining a new object to encapsulate them and pass a pointer to that instead.

Managing change to code

Your code should be easy to modify for new functionality. Where you find yourself modifying multiple objects or functions to implement a single change it's likely your data model or program structure is not ideal. Consider defining a new object containg the elements you need or new functions as appropriate.

Managing variables

Functions with long lists of variables are difficult to understand and maintain. Where a group of variables are always used together, consider encspsulating them in a new object, especially where the group reoccurs elswhere in your code.

Switch statements

Consider using "switch" statements to improve the readability of code where you have excessively long chains of "if else" statements. Where the same switch statement is duplicated throughout the code, however, this will be difficult to maintain. Consider changing your code (probably the data model) so that the switch is not needed.

Over-engineered code

A common mistake is to waste time implementing functionality that you think you'll need one day, but never actually do. Over-engineered code is confusing and difficult to maintain. Only program what you need today, but design your code so that it can, if necessary, be extended in the future.

Keep objects clean

The purpose of each element in an object should be obvious. Objects containing variables that are only rarely used, for instance for house-keeping or to hold temporary variables, are difficult to understand. Review your code and establish whether the variable really needs to be in the object or whether it can be moved somewhere else.

Nesting of functions

Code which uses deeply nested chains of functions is extremely difficult to understand. Review your code and simplify it if necessary.

Object overlap

Where two or more different objects share common elements there is likely scope for removing redundancy throughout your code. Consider whether a new object encapsulating the common elements would make your code easier to understand and maintain.

Use of libraries

It is very wasteful to write code unnecessarily: often as not the functionality you seek will be available in the AJAX or NUCLEUS library. Check the libraries before implementing new functionality and contribute any new code so that it can be incorporated into the libraries.

2. Code documentation

a) Code documentation standards

All EMBOSS applications should adhere to formally defined code documentation standards which currently cover:
  • Comments before each function
  • Function documentation tag descriptions
  • Parameter codes
  • Comments before each datatype
  • Datatype documentation tag descriptions
  • Comments at the head of each library source file
  • Comments at the head of each application source file
You should review the standards now. Be aware that the standards will grow as part of a code refactoring exercise which will improve the quality, consistency and documentation of the EMBOSS code. Further code documentation information is below.

b) Comments

Comments can add immensely to the readability of a program, but used heavily or poorly placed they can render good code completely incomprehensible. It is better to err on the side of too few comments rather than far too many - at least then people can find the code! Also, if your code needs a comment to be understood, then you should look for ways to rewrite the code to be clearer. Do not write comments that might get out of date. An inaccurate or misleading comment hurts more than a good comment helps. Be sure that your comments stay correct.

Good places to put comments are:

  • a broad overview at the beginning of an application
  • data structure definitions
  • at the beginning of a function
  • tricky steps within the program or functions (see later wrt function definitions)
  • major logical steps within the program or functions
If you do something out of the ordinary then comment it. This will tip off others as to where to look first for bugs! If you do something clever then comment it. The technique may be useful for others.

Avoid fancy layout or decoration.

c) GPL licence

This should be used where possible. The appropriate header (see any EMBOSS application) should be placed at the top of each program.

d) main() function

This should be preceded by a header block matching the following format.

/* @prog water ****************************************************************
**
** Smith-Waterman local alignment
**
******************************************************************************/

e) Application functions

Documentation blocks should appear before the function. Functions should have a name beginning with "applicationname_" and adhere to the following format.

/* @funcstatic tcode_readdata ********************************************
**
** Read Etcode.dat data file
**
** @param [w] table1 [AjPFTestcode*] data object
** @param [r] datafile [AjPFile] data file object 
** @return [AjBool] true if successful read
** @@
******************************************************************************/
All function parameters should be marked read [r] or write [w], give the parameter variable name (e.g. datafile), the datatype (e.g. AjPFile) and a short description. Return values should be stated and described. Functions that return void use
** @return [void]

See the code documentation standards for a list of supported tags (@param etc).

f) Library functions

These should be documented in the same way as application functions. Static functions are labelled "@funcstatic" whereas exported functions are labelled "@func". For exported functions a prototype is declared in an associated header file (see library header files for examples of header file documentation)

3. Application documentation

Documentation of the code itself can be invaluable to other developers but is of little value to the biologist end-user. For this audience an entirely different sort of documentation is required.

Full application documentation in a format suitable for end-user biologists is available on-line.

Every EMBOSS application should be well documented and should adhere to the EMBOSS style, see for example documentation the seqret application.

a)APPLICATION DOCUMENTATION PROCESS

To document a new program, ensure you have an up-to-date set of programs compiled, and that any programs you've written have had their executable deleted, otherwise references to them might occur in the automatically-generated "See Also" sections (see below).

To generate the documentation, run the script autodoc.pl on each application you wish to document in turn:

/home/fred/emboss/emboss/scripts/autodoc.pl application_name (for EMBOSS applications)

/home/fred/emboss/emboss/scripts/autodoc.pl -embassy=embassy_package_name application_name (for EMBASSY applications)

You should replace embassy_package_name and application_name with something sensible. The following instructions presume you are working in the EMBASSY pacakage "myemboss" and are writing a program called myprogram.

  • Write the program and ACD file
  • cd to embassy/myemboss/emboss_doc/master
  • cp template.html.save myprogram.html
  • edit myprogram.html
    • Change ProgramNameToBeReplaced to myprogram
    • Write some documentation text where indicated
  • Run autodoc.pl, e.g. ../../../../scripts/autodoc.pl -embassy=myemboss myprogram
The output might look something like:
mytest 'Demonstration of sequence reading'
Doing test mytest-ex
/homes/pmr/cvsemboss/embassy/myemboss/emboss_doc/html/mytest.html *created*
/homes/pmr/cvsemboss/embassy/myemboss/emboss_doc/text/mytest.txt *replaced*

The script will run wossname to check that application_name really exists, then generate a template documentation file (for you to fill in) with include directives, plus include files for:

  1. One-line description
  2. Help table
  3. Documentation table
  4. "See also" list
The include files for the three parts below, however, are generated from the QA test (see 4. below). You should really write the QA tests before documenting your application.
  1. Usage example; the QA test as if run by a user (mytest.usage)
  2. Application input files (mytest.input)
  3. Application output files (mytest.output)
The following include files:
  1. Comments
  2. History
are created blank, and you should either complete them manually or leave them blank.

Use an editor of your choice to edit the template, adding documentation text. The template should be adequately commented for you to see how to fill it out. You should see the directives to read the include files, which are created for you by autodoc.pl and by the QA test procedure. Once you complete the template and save it, the application index file (to appear on the web - you don't need to worry about this file) will need to be updated. The entry for the new application will be inserted into the correct (alphabetic) position in the index file.

In brief, the QA test is performed as follows:

  • Edit test/qatest.dat
  • Find the tests for the EMBASSY package myemboss which all have the line "AB myemboss"
  • Add a new test for your application (see below).
The test should include the following types of lines. IN lines give responses to any requests for input from the program. One IN line (blank if the default response is accepted) for each prompt. The FI lines give the names of any files created, with an FC test for line count (or FZ test for file size) and one or more FP tests for file contents. A CL line can give any options on the command line for the test.

Example of entry in qatest.dat

ID myprogram-ex
AB myemboss
AA myprogram
IN
FI stderr
FC = 2
FP 0 /Warning: /
FP 0 /Error: /
FP 0 /Died: /
FI paamir.myprogram
FC = 5
FP /^Usa: tembl-id:PAAMIR\n/
FP /^Length: 2167\n/
//
To run the test type:
  • cd test/qa
  • ../../scripts/qatest.pl myprogram-ex

If it fails, check directory myprogram-ex which contains all the files. Update the test definition and try agaion.

When it works, rerun autodoc.pl as above and it will create the remaining 3 include files (usage example, input and output files).

When all is done, the complete HTML documentation is created in embassy/myemboss/emboss_doc/html/myprogram.html

For EMBOSS applications, work in the doc/programs/master/emboss/apps/ directory instead. Leave out the -embassy=myemboss qualifier from the autodoc.pl commandline. The test definition does not have the "AB myemboss" line, and has "AP myprogram" instead of "AA myprogram". Final documentation goes to doc/programs/html/myprogram.html

There is a little work to do to update the index.html file and the Makefile.am files in the html and text directories - we hope to automate this in time for release 4.0.0.

b) DOCUMENTATION FILES

The example below is for seqret. All paths are relatives to the directory e.g. /home/fred/emboss/emboss/doc/.

i) Template html file

./programs/master/emboss/apps/seqret.html

This is the template mentioned above with include directives. This should be copied along with the include files to the EMBOSS website on sourceforge (the EMBOSS team have a script for doing this).

ii) Include files

These files (and in most cases their contents) are generated automatically by autodoc.pl.
./programs/master/inc/seqret.ione      One-line description.    Taken from ACD file
./programs/master/inc/seqret.ihelp     Help table*              Generated by running the application with "-help".
./programs/master/inc/seqret.itable)   Documentation table**    Generated by running acdtable.
./programs/master/inc/seqret.isee      "See also" list          Generated by running seealso.
./programs/master/inc/seqret.usage     Usage example            Generated via QA tests.
./programs/master/inc/seqret.input     Application input files  Generated from QA tests.
./programs/master/inc/seqret.output    Application output files Generated from QA tests.
./programs/master/inc/seqret.comment   Comments                 Written manually - usually blank.
./programs/master/inc/seqret.history   History                  Written manually - usually blank.

(*  Application parameters and appropriate "in-built" qualifiers)
(** Application parameters)

iii) Raw documentation text

./programs/text/seqret.txt

Documentation (with included text) in plain text format and organised into sections. Used for manual pages and displayed when running application_name -help. This is generated by autodoc.pl.

./programs/html/seqret.html

Final html file, with all data included. Documentation for the web. These might be moved to ./doc/programs/html/apps/ or ./doc/programs/html/embassy/* in the future.

4. Application quality assurance tests

Each EMBOSS application is run on test data to ensure that it works as advertised. These tests are performed nightly to ensure that the applications are not broken, e.g. by recent changes to the library code.

A set of test data consist of input files, application parameters and the corresponding output files. As many sets of test data should be provided as possible, especially for unusual input conditions, to provide as robust a test as possible.

The tests are defined in the file /home/fred/emboss/emboss/test/qatest.dat. There is documentation at the start of that file which describes the records used to define a test.

Each test you define should write its output files to the same directory as the application is run in. It is possible to create sub-directories and write files to them though.

To run the test for a specific application, from the /home/fred/emboss/emboss/test/test/qa directory:

../../scripts/qatest.pl test_name  
where test_name is the name of the test given on the ID line of the appropriate entry in qatest.dat.

If qatest.pl is run on something not defined in qatest.dat it will report "Tests total: 0".
If it succeeds, all files are deleted (unless the test entry included a "DL keep" line and "-kk" was specified on the command line)
If it fails, it will say why and all results are written to an appropriate subdirectory (/test/qa/test_name)

To perform tests, you must must edit the .embossrc file located in the test directory. Make sure to set "emboss_qadata" to appropriate test directory (e.g. /home/fred/emboss/emboss/test)



Talk 8 - Data Input : Using Features

Background information for Practical 8

The notes below are a summary of the on-line feature documentation.

What is a feature?

A feature is a region of interest in a nucleic or protein sequence and consists of:
  • a start and end position.
  • a name describing the feature.
  • the name of the program or database from which it was derived.
  • the sense (in a nucleic sequence).
  • a score.

A feature table is a groups of features.

Examples of biological data corresponding to features include restriction enzymes cut sites, probabilities of the three states of a protein secondary structure prediction and tables of the start and end positions of things like predicted exons or motif matches. As most sequence analysis programs generate interesting regions in one form or another, there are a huge number of diverse file formats corresponding to features.

Feature formats

In EMBOSS, features are represented in a variety of standard file formats.

The standard formats provide a consistent look and feel to features, helping the user compare the features from different programs more easily. Standardisation also facilitates application interoperability (applications can more easily share their input and output).

The standard file formats will become the default way of reporting sequence features as the EMBOSS project matures.

What are the formats?

EMBOSS uses the well-defined and flexible feature formats that were developed for the major sequence databases (EMBL, Genbank, SwissProt, PIR) and for the input of features into the genome databases (GFF, acedb).

Feature tables are stored in one of three ways:

  • As part of a sequence file
  • As part of a database entry
  • As a raw feature table: a file that does not contain the sequence the features refer to.

In all cases, the feature format is identical to that used in the sequence database format of the same name, e.g. EMBL feature format is the same as the (subset of the) EMBL sequence format. This holds for when a raw feature table is output too. The following feature formats are understood by EMBOSS.

NameComments / Documentation
embl
em
The format used by the EMBL nucleic database.
gff The General Feature Format defined by the Sanger Centre
swissprot
swiss
sw
The format used by the SWISSPROT protein database.
The feature table keys are also defined
pir The format used by the PIR protein database.
>nbrf Only available for input - the same as PIR format

Uniform Feature Object

EMBOSS defines a 'UFO' (Uniform Feature Object) as a standard way to refer to a feature file. The UFO specifies:
  1. The format of the features in that file.
  2. The name of that file.
Similar to USAs, the UFO specifies the feature format, then a ':' and then the name of the file. e.g. embl:results.dat

UFOs can be used for input and output. If no format is specified, the default 'GFF' format is used. You can override the default with a different format when you run the program (see below).

ACD datatypes and built-in command line qualifiers for handling features

ACD provides two datatypes for handling features:

  • features for feature input
  • featout for feature output

These command-line qualifiers change the behaviour of a features ACD datatype:

  -fformat            string     features format                  Default: ""
  -fopenfile          string     features file name               Default: ""
  -fask               bool       prompt for begin/end/reverse     Default: N
  -fbegin             integer    first base used                  Default: 0
  -fend               integer    last base used, def=max length   Default: 0
  -freverse           bool       reverse (if DNA)                 Default: N

These command-line qualifiers change the behaviour of a featout ACD datatype:

  -offormat           string     output feature format            Default: "" 
  -ofopenfile         string     features file name               Default: "" 
  -ofextension        string     file name extension              Default: ""
  -ofname             string     base file name                   Default: ""
  -ofsingle           bool       separate file for each entry     Default: N
  -ofdirectory        bool       Output feature file directory    Default: ""
Their use is described below.

Note that the sequence, seqall, seqset & seqsetall datatypes (for sequence input) and seqout,seqoutall & seqoutset datatypes (for sequence output) can also read / write features if their features ACD attribute is set. If set, the sequence output will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).

Reading features

If the feature table is included in the sequence input file (as is generally the case when you reading the sequence from a database), then the feature table will be read with no problem.

To read a raw feature table from file, you must specify the -ufo in-built qualifier on the command line, e.g. '-ufo gff:results.dat'.

Alternatively, the '-fformat' and '-fopenfile' qualifiers can be used to specify the feature format and the file name individually instead of as part of a UFO.

Using '-ufo' or '-fopenfile' to read in a feature table will cause the new feature table to replace any existing feature table that is part of the sequence data.

In-built command line qualifiers for feature input

  -ufo                string     UFO features
  -fformat            string     features format
  -fopenfile          string     features file name

If you wish to combine feature table files from various sources, then the easiest way is to concatenate the feature files (must be in the same format!) into one file and to specify that file using '-ufo'.

Currently, programs that read and use the feature table of an input sequence include diffseq, extractfeat, maskfeat, seqret and showfeat. Currently there aren't any programs that read in a raw feature table (i.e. one that's not part of sequence).

Writing features

If a program is capable of writing out sequences with features (for example run "seqret -feature"), then the feature table will be written out as part of the output sequence file, if the format of the sequence file is one of embl, gff, swissprot or pir (i.e. if the sequence field was designed to hold a feature table).

If the sequence format cannot hold a feature table (e.g.'fasta'), then a file ('unknown.gff') is written with the raw feature table in GFF format.

This behaviour can be overridden by using the command-line qualifiers below. Even if a sequence format that is capable of holding a feature table has been specified, these will enable you to specify a name and format for output to a raw feature table file.

Output sequence command-line qualifiers

  -oufo               string     UFO features
  -offormat           string     features format
  -ofname             string     features file name

Many programs are capable of writing raw feature tables. The default output format for raw features tables is 'gff', but this can be changed by specifying '-offormat' followed by the format name.

Calculated attributes

The features ACD datatype has the following calculated attributes (these are "properties" of an input feature that can be queried within ACD):
fbegin (integer)   - start of the features to be used.   
fend (integer)     - end of the features to be used.     
flength (integer)  - total length of sequence            
fprotein (boolean) - feature table is protein            
fnucleic (boolean) - feature table is nucleotide         
fname (string)     - the name of the feature table       
fsize (string)     - number of features                  
and the following specific attributes:
type:     (string)  - defines whether the feature is "protein" or "nucleotide". 
                      There is a default based on the type of an input sequence,
                      but a value should always be specified.

nullok:   (boolean) - allows a default name for a feature to be replaced by an empty 
                      string or by -noxxx (where "xxx" is the ACD label for the feature) 
                      on the command line. The application must be able to run without
                      feature input. See below.
The featout ACD datatype has the following specific attributes:
format:      (string) Default feature format. 
name:        (string) Default base file name (use of -ofname is preferred).Default: ""
extension:   (string) Default file extension (use of -offormat preferred) 

type:        (string) Defines whether the feature output is "protein" or "nucleotide". 
                      There is a default based on the type of any input sequence, but 
                      a value should always be specified. 
multiple:    (Y/N)    Features for multiple sequences                      Default: N
nullok:      (Y/N)    Allows a default name for a feature to be replaced by an empty string 
                      or by -noxxx (where "xxx" is the ACD label for the feature) on the 
                      command line. The application must be able to run without feature 
                      output.  See below.
nulldefault: (Y/N)    Defaults to 'no file'                                Default: N

The output filename is constructed from the name: and extension: attributes in a $( name).$(extension) format. If the name: attribute is not defined in the ACD file, it will default to the calculated attribute name: of the FIRST sequence that is read in (or $(asequence.name), for a sequence parameter named "asequence").

The nulldefault: attribute overrides the default name generation, and uses an empty string (no feature output) as the default for programs where feature output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.



Talk 9 - Data Output : Using Reports

Background information for Practical 9

1. Too many formats
A problem with output from computer programs is that everybody invents their own output format, generally one format per program. While this displays commendable ingenuity and inventiveness it does little for standard parsing. For this reason EMBOSS is moving towards using a set of standard report formats. These should be used whenever appropriate.

Report formats take their origin from the need to deal with EMBL, GenBank and PIR feature tables. It was therefore a natural choice to extend these to cope with other output data. The first thing to consider is that all of the standard sequence feature tables (see Talk 8) are also report formats. These include:

  • embl
  • genbank
  • gff
  • pir
  • swiss
There are other formats to cater for more than standard sequence information:
  • dbmotif
  • diffseq
  • excel
  • motif
  • nametable
  • regions
  • seqtable
  • simple
  • srs
  • table
  • tagseq
As you can see, the output options are very flexible. The problem with flexible software is that it seems a little complex to code at first.

2. Report format theory
You will probably know that feature table information is held as tag/value pairs. To cope with non-standard feature values the AJAX library uses the default '/note' tag. This is transparent to the user/programmer (unless you select EMBL or another output format in which the '/note' tag is visible). Internally, all reports are held, as near as dammit, in GFF (the General Feature Format). It was suggested that this be called the Flexible Open Feature Format (FOFF, pronounced eff off) but this name was sadly rejected. As the features are held in such a flexible format they can easily be converted to any of the report formats given above. This can be selected by using the -rformat built-in qualifier.

GFF has some field that must exist. One is the sequence itself (this may or may not be output depending on the output report format used), the start and end positions of the features, a score (in case the program reports a score for the sequence using whatever algorithm) and a few others. The score and most of the rest of the standard GFF tags can be, and often are, set to zero or NULL automatically by the library. You have to explicitly use them (assign values to them) if you need them, otherwise they'll be ignored.

3. Method to output reports

Each report has a header, body and tail information. The process of outputting using reports is as follows.

  1. Define a suitable report datatype in ACD.
  2. Collect the ACD report information within the application in a report object (AjPReport).
  3. Create a feature table object (AjPFeattable) using ajFeattableNewSeq.
  4. For each 'line' of information in the body of the report, create a feature object (AjPFeature) and load the feature object with your data values (by using ajFeatNewII and ajFeatTagAdd).
  5. Set any data for the header and tail of the report.
  6. Pass the report AjPReport) and feature table (AjPFeattable) objects to the general report writing function (ajReportWrite).
  7. Clean-up memory as normal and exit.

4. Example application

This looks hard but isn't. As an example we develop below a simple application that will take a sequence and produce a tabular report format showing its length and molecular weight.

4.1 ACD for a molecular weight application.
Here is an ACD file which will do the job:

application: wreport 
[
  documentation: "Example report program"
]

section: input [ info: "input Section" type: page ]
sequence: sequence  
[
  parameter: "Y"
  type: "Protein"
]

endsection: input


section: advanced [ info: "advanced Section" type: page ]

datafile: aadata  
[
  information: "Amino acid data file"
  help: "Molecular weight data for amino acids"
  default: "Eamino.dat"
]

endsection: advanced


section: output [ info: "output Section" type: page ]

report: outfile  
[
  parameter: "Y"
  rformat: "table"
  multiple: "N"
  precision: "1"
  taglist: "float:molwt int:len"
]
endsection: output
The section and endsection definitions provide a means by which GUIs can be instructed to organize the ACD information on the screen. Each section must always have a corresponding endsection. It is standard practice to have at least an input and an output section definition, adding others as appropriate.

The sequence input has been met before. The datafile datatype is just there so a file of molecular weight data in the EMBOSS data area can be read. It is the report datatype we are interested in ...

The rformat attribute is the equivalent of the "default" attribute in other datatypes. It is the default report format that will be printed if the user doesn't change it with -rformat on the command line.

The multiple attribute says whether multiple reports will be given in the output. This will generally have a value of "N" if you are using the "sequence" datatype and "Y" if you're using the "seqall" datatype.

The precision attribute is for floating point numbers. It gives how many decimal places will be printed in the output.

The most interesting attribute is taglist. This shows, in order, the datatype/column name pairs that will be used in the report. float:molwt therefore means that one of the columns is called 'molwt' and it will contain floating point values. Typical taglist datatypes are :

  • "float"
  • "int"
  • "str"
The name is vitally important as it will be used in the C program.

4.2 C source code
Here it is:

/* @source wreport application
**
** Show sequence length and molwt as a report
**
** @author: Copyright (C) Alan Bleasby (ableasby@embnet.org)
** @@
**
** This program is free software; you can redistribute it and/or
** modify it under the terms of the GNU General Public License
** as published by the Free Software Foundation; either version 2
** of the License, or (at your option) any later version.
**
** This program is distributed in the hope that it will be useful,
** but WITHOUT ANY WARRANTY; without even the implied warranty of
** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
** GNU General Public License for more details.
**
** You should have received a copy of the GNU General Public License
** along with this program; if not, write to the Free Software
** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
******************************************************************************/

#include "emboss.h"




/* @prog wreport **************************************************************
**
** Show sequence length and molwt as a report
**
******************************************************************************/

int main(int argc, char **argv)

{
    AjPSeq       seq    = NULL;
    AjPReport    report = NULL;
    AjPFeattable ftable = NULL;
    AjPFeature   feat   = NULL;
    AjPStr       tmpstr = NULL;

    double       molwt;
    int          len;

    AjPStr       datafn = NULL;
    AjPFile      mfptr  = NULL;


    embInit ("wreport", argc, argv);

    seq       = ajAcdGetSeq ("sequence");
    report    = ajAcdGetReport("outfile");

    /* This bit just reads in an EMBOSS data table of molwt info */
    mfptr     = ajAcdGetDatafile("aadata");


    embPropAminoRead(mfptr);
    /* End of data file reading */

    /* Calculate the values to output */
    len   = ajSeqLen(seq);
    molwt = embPropCalcMolwt(ajSeqChar(seq),0,len-1);


    /* Create a feature table */
    ftable = ajFeattableNewSeq(seq);


    tmpstr = ajStrNew();


    /* Fill head and tail information for the report */
    ajFmtPrintS(&tmpstr,"This is some Header Text");
    ajReportSetHeader(report, tmpstr);

    ajFmtPrintS(&tmpstr,"This is some Tail Text");
    ajReportSetTail(report, tmpstr);


    /* Create feature object and load with the output values */
    feat = ajFeatNewII(ftable,1,len);

    ajFmtPrintS(&tmpstr,"*molwt %.1f", (float)molwt);
    ajFeatTagAdd(feat,NULL,tmpstr);

    ajFmtPrintS(&tmpstr,"*len %d", len);
    ajFeatTagAdd(feat,NULL,tmpstr);


    /* Write report and clean up */
    ajReportWrite(report,ftable,seq);


    ajFeattableDel(&ftable);
    ajStrDel(&tmpstr);
    ajFileClose(&mfptr);

    ajExit ();
    return 0;
}

This is a minimalistic program for clarity (e.g. no notice is taken of any -sbegin or -send values). Taking it step by step, here are the declarations:

    AjPSeq       seq    = NULL;
    AjPReport    report = NULL;
    AjPFeattable ftable = NULL;
    AjPFeature   feat   = NULL;
    AjPStr       tmpstr = NULL;

A sequence (AjPSeq) object is obviously needed. A report object (AjPReport) is declared which is used to pick up the ACD information using ajAcdGetReport. The feature table object (AjPFeattable) will be used to hold the feature objects (AjPFeature) containing the column values. A temporary string object is declared into which the column values will be printed. This string will then be used to load the feature object.

After a bit of code to load in a data file containing amino acid molecular weight data, the molecular weight and length values to be reported are calculated by the lines:

    len   = ajSeqLen(seq);
    molwt = embPropCalcMolwt(ajSeqChar(seq),0,len-1);

Note that the code uses len and molwt for both the variable names and for the column names. This is not necessary but does make the code more clear. The feature table object is then instantiated. Only one feature table is required per report:

    ftable = ajFeattableNewSeq(seq);
The sequence object is passed as a parameter so that the sequence name can be automatically loaded into the internal GFF format.

Next, some head and tail information is loaded into the report object. This could have been done any time after the ajAcdGetReport but before the report is printed. They are optional but recommended. The temporary string object is used for this.

    ajFmtPrintS(&tmpstr,"This is some Header Text");
    ajReportSetHeader(report, tmpstr);

    ajFmtPrintS(&tmpstr,"This is some Tail Text");
    ajReportSetTail(report, tmpstr);

A feature object can now be created into which the column values can be loaded

    feat = ajFeatNewII(ftable,1,len);

This required 3 parameters. The first is the feature table object, the last two are the sequence start and end positions. Rather naughtily they are hard coded in this example (as mentioned above). A feature object has to be created per line of output in the final report.

Now, the molecular weight and length values are loaded into the feature object:

    ajFmtPrintS(&tmpstr,"*molwt %.1f", (float)molwt);
    ajFeatTagAdd(feat,NULL,tmpstr);

    ajFmtPrintS(&tmpstr,"*len %d", len);
    ajFeatTagAdd(feat,NULL,tmpstr);
This is the bit where the ACD column names (molwt and len) must match the ones in the ajFmtPrintS calls. Within the C program these names must be preceded by an asterisk! The datatypes specified in the ACD file (float and int) must also match what's given in the C code. The NULL parameter just means that only a value is being added (the library will add the /note tag automatically).

All that remains is to print out the report.

    ajReportWrite(report,ftable,seq);
Three objects are passed. The reason for the sequence object being passed is in case you choose a report format that prints out the (sub)sequence used.

Finally, the dynamic memory is recovered in a clean-up.

    ajFeattableDel(&ftable);


Talk 10 - Objects, Pointers and Memory Management

Background information for Practical 10

No course in EMBOSS programming would be complete without a treatment of pointers. Here you will get a clear explanation of the use of pointers with particular reference to the management of memory for objects in EMBOSS. This coverage of pointers was inspired by Chuck Allison's 'Code Capsules' (find it on the web!)

Pointer basics

Pointers are one of the most feared aspects of C and their missuse leads to more problems than any other part of the language. That is not to say that pointers are the problem, it's just that many programmers aren't ready for them yet. With a proper understanding of the underlying principles, it is easier than you imagine to get to grips with all aspects of using pointers and their specific implementation in EMBOSS. To become good at EMBOSS programming you have to master at least the basics of pointers. The trick with pointers is, they're easy to use, so long as you understand the principles, so ...

The very first thing to understand is that, with the exception of register variables, every variable you declare in your program resides somewhere in memory, that "somewhere" is the memory address of the variable. So, when this line of the program is executed ...
   ajint x=0;
... sufficient memory to hold an integer (usually 4 bytes) will be reserved for use by our program. The value of those 4 bytes is set to zero. To find the memory address of our variable, we use the & (address) operator. And to get to the value held at a particular memory address you use the * (pointer) operator (this is called dereferencing the pointer or getting a value by indirection).

If you're O.K. with the idea that 'x is an integer' then in the same way understand that 'a pointer is a memory address'. Spelling that out ...

A POINTER IS MERELY A VARIABLE WHICH HOLDS A MEMORY ADDRESS

If you don't overcomplicate the above idea, you've already gone a long way to understanding pointers.

Example pointer code

Consider the following:
main()
{
   ajint x=0;
   /*1*/   printf("Value of x : %d\n", x);
   /*2*/   printf("Memory address of x : %p\n", &x);
   /*3*/   printf("Value of x by indirection : %d\n", *(&x));
}

/* Output:

Value of x : 0
Memory address of x : #1 
Value of x by indirection : 0

(In reality, a hexadecimal number would be printed instead of '#1', but '#1' is easier to follow).
*/

The variable name x is our handle on the reserved memory, it is used to refer to an integer value that happens to live at memory address #1. We usually say that "x is an integer" or "x holds an integer" rather than the more acurate and cumbersome "x is a variable name referring to a reserved area of memory of sufficient size to hold an integer".

In the code:

  • /*1*/ First we are printing the value of x.
  • /*2*/ Then we are using & to get the memory address of where x is stored and printing it.
  • /*3*/ Finally we are using the * operator to print the value stored at this address by indirection.

In practice, a pointer holds the memory address of a specific data object such as an integer, C data structure or even another pointer. You have to specify the type of data pointed at when you declare your pointer. This is not because the memory address of an integer is any different to that of a float, it's so that the compiler knows how the pointer can be used in the source code. For example, the computer must know the type of data pointed at to be able to print a value by indirection.

Pointers to pointers

So how do we declare a pointer variable? Easy ...

   ajint *ptr=NULL;

The * means that ptr is a memory address and the ajint tells us that it's the address of an integer.

When that line of the program is executed, sufficient memory to hold a memory address will be reserved for use by our program. This, like an integer, is normally 4 bytes. The value of these 4 bytes is set to NULL.

Its important to know that its only in the context of a variable declaration that *ptr=NULL means set the value of the pointer to NULL. If *ptr=NULL was found elsewhere in the program it would mean, "set the value held at memory address ptr to NULL". The final thing to mention is that we've used NULL for the pointer and 0 for the integer in the declarations; they achieve the same thing but cannot be used interchangeably as they are not of the same type; you would get serious bitching from the compiler if you tried!

You can see that in the code below:

main()
{
/*1*/  ajint x=0;
/*1*/  ajint *ptr=NULL;
/*2*/  printf("Value of x : %d\n", x);
/*3*/  ptr = &x;
/*4*/ *ptr=5;
/*2*/  printf("Value of x : %d\n", x);
}

/* Output:
Value of x : 0
Value of x : 5
*/

In the code:

  • /*1*/ Declare an integer and a pointer to an integer.
  • /*2*/ Print the value of x.
  • /*3*/ Give ptr the value of the address of x.
  • /*4*/ Set the value of x to 5 by indirection.

In the above example, you would normally say that "ptr holds the address of x" or simply "ptr points to x".

It was mentioned above that a pointer can hold the memory address of another pointer. This is obvious when you think that a pointer, like any variable, resides somewhere in memory. So if a pointer that holds the memory address of an integer is a 'pointer to an int', then a pointer that holds the memory address of another pointer is, of course, 'a pointer to a pointer'.This bit of code shows how we declare a pointer to a pointer-to-an-int:

    ajint **ptrto=NULL;
The second * means that ptrto is a memory address. The ajint * tells us that it's the address of a pointer-to-an-integer. When the code is executed, enough memory to hold an address is reserved for our use and the value of the bytes is set to NULL.
Of course, the & (address) and the * (pointer) operators still work with pointers to pointers. Where you have multiple levels of pointers you can use multiple * (pointer) operators for dereferencing. *ptrto would dereference once and retrieve an address (a pointer to an integer).**ptrto would dereference twice and retrieve an integer. You can see this in the code below:


main() 
{ 
/*1*/ ajint x=0;          /* an integer */
/*1*/ ajint *ptr=NULL;    /* a pointer to an integer */
/*1*/ ajint **ptrto=NULL; /* a pointer to a pointer-to-an-integer */

/*2*/ printf("Address of x : %p\n", &x); 
/*2*/ printf("Address of ptr : %p\n", &ptr); 
/*2*/ printf("Address of ptrto : %p\n", &ptrto); 

/*3*/ ptr = &x; 
/*3*/ ptrto = &ptr; 

/*4*/ printf("Value of x : %d\n", x); 
/*4*/ printf("Value of ptr : %p\n", ptr); 
/*4*/ printf("Value of ptrto : %p\n", ptrto); 

/*5*/ printf("Value of x by dereferencing ptr : %d\n", *ptr); 
/*5*/ printf("Value of x by dereferencing ptrto : %d\n", **ptrto); 
} 

/* Output: 
Address of x : #1 
Address of ptr : #2 
Address of ptrto : #3 

Value of x : 0 
Value of ptr : #1     /* i.e. the address of x*/
Value of ptrto : #2   /* i.e. the address of ptr*/

Value of x by dereferencing ptr : 0 
Value of x by dereferencing ptrto : 0 
*/ 

There are no new concepts in the above code, its merely an extension of what you already know about pointers. In the code we:

  • /*1*/ We start by declaring 3 variables called x, ptr and ptrto. x is the integer, ptr is a pointer-to-an-integer and ptr to is a pointer to a pointer-to-an-integer.
  • /*2*/ We then print the address of each variable; x lives at #1, ptr at #2 and ptrto at #3.
  • /*3*/ We assign the address of x to ptr. The address of ptr is assigned to ptrto.
  • /*4*/ Print the value of each variable. x has a value of 0, ptr has a value of the address of x, i.e. #1 and ptrto has the value of the address of ptr, i.e. #2.
  • /*5*/ Print x out by indirection.

You already know what *ptr means. Further on you see we dereference the ptrto twice, which is what you've got to do if you want to get to the integer from it. The first time you dereference ptrtor you get to ptr, the second time you are effectively dereferencing ptr, which takes you to x. Simple !

This and in fact all problems in pointers are very easily understood if you sketch what's happening on a piece of paper.

Always draw a diagram if you're not sure what's happening with your pointers.

Its gets a bit of a mouthfull when put into words, but here is the explanation of exactly what is happening in **ptrto
ptrto holds the address of ptr (#2). So dereferencing ptrto once will give you the value of ptr (held at #2), i.e. the address of x (#1). By dereferencing ptrto a second time, you effectively dereference ptr, which will give you the value that is held at #1, i.e. the integer value 0

If you've got this far with your head intact then you're closer than you think to having the basics of pointers licked, and you've got more or less all you need to master objects, pointers and memory management in EMBOSS.

Object definition

Consider the following object definition:

/* @data AjPPdbtosp *******************************************************
**
** Ajax Pdbtosp object.
** 
** Holds swissprot codes and accession numbers for a PDB code.
** 
** Pdb is the pdb code.
** n is the number of Acc / Spr pairs for this pdb code.
** Acc is the accession number
** Spr is the swissprot code
**
** AjPPdbtosp is implemented as a pointer to a C data structure.
**
** @alias AjSPdbtosp
** @alias AjOPdbtosp
**
** @@
******************************************************************************/
typedef struct AjSPdbtosp
{   	
    AjPStr     Pdb;    
    ajint      n;      
    AjPStr    *Acc;    
    AjPStr    *Spr;    
} AjOPdbtosp, *AjPPdbtosp;

Note how the structure is nicely documented - your object definitions should do the same! There is nothing new here other than Acc and Spr which are both pointers to AjPStr objects. As an AjPStr is itself a pointer (to the AjOStr object proper) you can see that we're dealing with pointers to pointers. In this case, Acc and Spr are going to be used to create two arrays of strings as we can see in the constructor function below:

Constructor function

/* @func ajXyzPdbtospNew ***********************************************************
 **
 ** Pdbtosp object constructor. Fore-knowledge of the number of entries is 
 ** required. This is normally called by the ajXyzPdbtospReadC / ajXyzPdbtospRead 
 ** functions.
 **
 ** @param [r] n [ajint] Number of entries
 **
 ** @return [AjPPdbtosp] Pointer to a Pdbtosp object
 ** @@
 ******************************************************************************/
AjPPdbtosp ajXyzPdbtospNew(ajint n) /*1*/
{ 
 
   AjPPdbtosp ret = NULL;           /*2*/ 
   ajint i=0;
 
   AJNEW0(ret);                     /*3*/
 
   ret->Pdb = ajStrNew();           /*5*/ 
 
   if(n)
   {
      AJCNEW0(ret->Acc,n);          /*4*/ 
      AJCNEW0(ret->Spr,n);          /*4*/ 
      for(i=0; iAcc[i]=ajStrNew();  /*6*/ 
           ret->Spr[i]=ajStrNew();  /*6*/ 
 	}
     }

/*7*/
     ret->n = n;
 
     return ret;
 }
 

We'll go through this line by line.

  • /*1*/ The first line declares that the function returns an object pointer of type AjPPdbtosp. The parameter ajint n is the size of the Acc and Spr arrays, i.e. the number of pairs of Acc / Spr values that the object will hold.
  • /*2*/The next line declares a variable called ret. This is the object pointer that is going to have memory allocated to it , and will then be returned to the calling function.
  • /*3*/ AJNEW0(ret); is the line that allocates an object proper to the pointer ret; ret will now point to an AjOPdbtosp object in memory. By the time AJNEW0(ret); returns, memory space for an AjOPdbtosp object will be reserved, this means enough space for an AjPStr, an ajint and two pointers. Note that we have not got our two arrays or any string objects proper yet!
  • AJNEW0 sets all the structure elements to 0, this means n is set to 0 and the pointers (including the AjPStr) are set to NULL.
  • AJNEW0 is a macro that's why this line looks a bit weird. It will allocate a single object of the correct type to any pointer that is passed to it - it can be used with any object!
  • /*4*/ Compare AJNEW0 to the two AJCNEW0 lines. These allocate an array of n AjPStr object pointers to each Acc and Spr. In other words, ret->Acc and ret->Spr will point to blocks of memory each holding n n AjPStr object pointers each.
    Like AJNEW0, AJCNEW0 will work with objects of any type and also initialises the new variables to 0 or NULL.
  • /*5*/ So, we now have our arrays but still no strings yet. ret->Pdb = ajStrNew(); allocates memory for a string object to the pointer Pdb in our new object. Notice that -> is used to dereference the pointer and get to the Pdb element. This is the standard way in C of accessing elements in a data structure when we have a pointer to that data structure.
  • /*6*/ The lines ret->Acc[i]=ajStrNew(); and ret->Spr[i]=ajStrNew(); allocates memory for n string objects for each array. It also illustrates how pointer and array notation can be used together. In this case, we are accessing the i'th element of the arrays that ret->Acc and ret->Spr point to. The elements in these arrays are AjPStr (object pointers) and we are allocating a string object to each of them.
  • /*7*/ The rest is obvious. The integer in the object is set to the size of the arrays and the pointer to the new object, complete with an allocated string and two arrays of strings, is returned to the calling function by return ret;.
  • It is the job of the calling function to free the object itself and all the other allocated memory the elements of the object point to. Which brings us neatly onto the destructor function:

Destructor function

 /* @func ajXyzPdbtospDel ***********************************************************
 **
 ** Destructor for Pdbtosp object.
 **
 ** @param [w] thys [AjPPdbtosp*] Pdbtosp object pointer
 **
 ** @return [void]
 ** @@
 ******************************************************************************/
 
 void ajXyzPdbtospDel(AjPPdbtosp *thys)
 {
     AjPPdbtosp pthis = NULL;
     ajint i;

     if(!thys)	return;
     pthis = *thys;
     if(!pthis)     return;
 
     ajStrDel(&pthis->Pdb);
 
     if(pthis->n)
     {
 	for(i=0; i< pthis->n; i++)
 	{
 	    ajStrDel(&pthis->Acc[i]);
 	    ajStrDel(&pthis->Spr[i]);
 	}
 	AJFREE(pthis->Acc);
 	AJFREE(pthis->Spr);
     }
 
     AJFREE(pthis);
     (*thys)=NULL;
 
     return;
 }

It is your task in Practical 10 to figure out exactly what is going on in this destructor function. If you can do that, then you can be happy that you are on your way to become adept at objects, pointers and memory management in EMBOSS.

Suffice it to say that this function safely clears up all of the memory that was allocated by the constructor, this is achieved by calling appropriate destructor functions and by using AJFREE. AJFREE will free the memory pointed to by its argument. There are 3 places its used, twice to free the arrays and once to free the Pdbtosp object itself. Note! AJFREE as used here will free the arrays but will not free the string objects proper that are pointed to (this is the job of the ajStrDel calls in the preceding code).

The function also sets the object that was passed in to NULL. This is a requirement of all destructor functions for reasons explained in Talk 4.

Calling constructor and destructor functions

Following is a code snippet illustrating how the object and constructor and destructor functions could be used. You'll notice they're used in just the same way as you've been managing memory for strings.
  main()
  {
  AjPPdbtosp ptr=NULL;

  ptr = ajXyzPdbtospNew(10);
  ajXyzPdbtospDel(&ptr);

  /* ptr will have been reset to NULL now, and is ready for reuse */
  ptr = ajXyzPdbtospNew(10);
  ajXyzPdbtospDel(&ptr);

  }

EMBOSS memory allocation macros

The final thing is to give a summary of the EMBOSS memory allocation macros:
  • AJALLOC(nbytes) equivalent of malloc
  • AJALLOC0(nbytes) a calloc of nbytes
  • AJCALLOC(count,nbytes) a malloc of count lots of nbytes
  • AJCALLOC0(count,nbytes) a calloc of count lots of nbytes
  • AJNEW(p) a pointer to an object gets an object allocated for it using malloc
  • AJNEW0(p) a pointer to an object gets an object allocated for it using calloc
  • AJCNEW(p,c) a pointer to an object gets c objects allocated using malloc
  • AJCNEW0(p,c) a pointer to an object gets c objects allocated using calloc

For non-C programmers "malloc" allocates memory but the contents are undefined whereas "calloc" allocates memory setting each location to zero.



Talk 11 - The NUCLEUS Library

Background information for Practical 11

The NUCLEUS library provides higher-level functions and algorithms, mostly for molecular sequence analysis, including sequence comparisons, translation, codon usage and annotation. See the NUCLEUS Library Documentation for the documentation of individual functions and datatypes.

The available NUCLEUS libraries are listed below.

  • Sequence Alignment (embaln) - sequence alignment.
  • Database Indexing (embdbi) - B+ Tree Indexing plus Disc Cache.
  • Database Indexing (embindex) - general database indexing.
  • EST Methods (embest - EST alignment functions.
  • Protein properties (functions) - amino acid residue/sequence properties.
  • Molecular fragments (functions) - Routines for molecular weight matching.
  • Pattern Matching Methods (embpat) - General routines for pattern matching.
  • Protein data bank (embpdb) - To handle protein structural data.
  • Sequence Properties (embprop) - Amino acid residue/sequence properties.
  • Show (Display) Methods (embshow) - general routines for sequence display.
  • Word Methods (embword) - wordmatch routines
  • Comparison Matrices (functions) - General sequence match routines.
  • Domain Methods (embdomain) - data structures and algorithms for handling protein domain data.
  • Word (n-mer) Methods (embnmer) - General word match routines.
  • Data Tables (embdata) - general routines for data tables.
  • Signature Methods (embsig) - data structures and algorithms for use with sparse sequence signatures.
  • Comments (embcom) - general routines for program complex
  • Consensus (embcons) - general routines for program consensus
  • Domainatrix Methods (embdmx) - data structures and algorithms used by the protein structure EMBASSY applications.
  • Xyz Coordinate Methods (embxyz) - miscellaneous stuff for protein structure.
  • Groups (functions) - functions for parsing ACD files.
  • Initialization (embinit) - functions for application initialisation.
  • Exit Methods (embexit) - functions for exiting cleanly.
  • Miscellaneous Methods (embmisc) - miscellaneous routines.
  • Reading data files (embread) - data file reading routines.

In contrast to the applications and to the AJAX library, NUCLEUS is not as well developed or documented. In future code refactoring, some of the libraries may be merged together.

In several cases, algorithms and data structures that you would expect to find in NUCLEUS are in fact kept in AJAX. An example of this is the domain handling code, most of which is kept in ajdomain.c/h rather than embdomain.c/h. The reason for this is for purposes of compilation: the ACD file-handling code (which is part of AJAX) must call these functions which therefore must live in AJAX. This may change in a future code refactoring exercise.



Last modified on 2005 by Jon Ison.