18th - 20th April, 2006

Hinxton Hall
Wellcome Trust Genome Campus
Hinxton, Cambridge
United Kingdom

Bioinformatics Software Development Course

Apr 18-20 2006

Details of Practicals

DAY 1: The main objective is to get an overview of EMBOSS programming and become familiar with string-handling. Your main task is to implement a program to perform some simple string manipulation tasks.

Practical 1 - Introducing EMBOSS

For background information, see Talk 1

To begin, you'll learn how to install and compile EMBOSS on a PC running the Linux operating system.

1. The download area

Go to the EMBOSS homepage and click on the "Downloads" link and briefly review the Downloads page. You will notice there are two flavours of download:

The "Stable release" (e.g. EMBOSS-x.y.z.tar.gz) is a quality assurance-tested "freeze" of stable code, suitable for environments where stability is of paramount importance.
The "Developers (CVS) release" is the up-to-date code used by the EMBOSS Developers.

You are a software developer therefore you should download the CVS version of EMBOSS.

2. First time installation

Before you start make sure you're in your home directory.

From the Downloads page follow the link to "Developers (CVS) release" (or click here). Type or cut and paste the following lines:

 
    cd 
    cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss login

You are now logged onto the CVS server. It will ask you for a password, which is "cvs". Now check-out (download) EMBOSS by typing the following command.

    cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss checkout emboss

This will take some time as it's downloading several megabytes of source code and data from the USA (the EMBOSS code repository lives on the same machine as the BIO* projects). Once the download is complete, issue the following command to terminate your CVS session:

    cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss logout

You only need to check-out EMBOSS once for each machine you install a copy on. After this you'll just need to keep it updated. Read the notes from Talk 1 on how to do this.

3. Examine the checked-out directory structure

Let's assume your HOME directory is /home/fred. After having checked out the source code you'll have a /home/fred/emboss directory. To see what's there type:

    cd /home/fred/emboss
    ls -l

You'll see two directories, one is called "CVS" and the other is "emboss".
cd into the emboss directory i.e. to the directory /home/fred/emboss/emboss. Listing this directory reveals the contents of the EMBOSS source tree. You should see directories called "ajax" and "nucleus" (programming libraries), "emboss" (application code) and "embassy" (EMBASSY application code), amongst other stuff.
The reason why there is an "extra" emboss directory (/home/fred/emboss/) is that it provides a convenient place to install the bin, lib and share directories when a "make install" command is issued (more on this shortly).

4. Prerequisites for compilation

To compile EMBOSS the following GNU configuration tools must be installed on your system:

autoconf
atomake
libtool
make
m4

The training room machines run RedHat Linux which provides all these tools (or at least bundles the RPMs) so nothing needs to be done for this course. Should you require them in the future (and it is important to keep up-to-date versions of these tools) they can be downloaded from ftp.gnu.org/pub/gnu.

5. Configuration and compilation

The first thing to do is configure the package, but you must first build a script to do that. From the second EMBOSS directory (/home/fred/emboss/emboss) type:

    aclocal -I m4       [note: aclocal is part of the automake package]
    autoconf
    automake -a

This will build the "configure" script from the files "Makefile.am" and "configure.in" (in the EMBOSS checkout) and "aclocal.m4". Specifically:

aclocal creates aclocal.m4 containing m4 macros used by the auto* tools.
autoconf reads configure.in and creates the "configure" script.
automake reads Makefile.am and creates Makefile.in

Running "./configure" will

Checks whether your system has the necessary functionality and libraries to compile EMBOSS.
Reads Makefile.in and generate platform-specific Makefiles (used later).
Configures your system, e.g. sets install path, sets #defined variables, decides which source files to compile.

"./configure" is controlled by command-line arguments (generally used to switch on features autoconf was unable to detect) and by environment variables (generally used to set build information such as compiler options).

If you intend to compile using "make install" (see below) you must specify an installation area for the executables and supporting files, however, it's good practice to specify this even if you intend to compile with a plain "make". Further, as you'll be using the gcc compiler it's a good idea to turn warnings on (in EMBOSS most warnings are bad errors). To do this, type

    ./configure --prefix=/home/fred/emboss --enable-warnings

Note that the '--enable-warnings' switch is for GCC compilers only. Finally, type:

     
    make

This will, by using the Makefiles, compile all the source files into executable binaries within e.g. /home/fred/emboss/emboss/emboss. Alternatively, compiling with "make install" will install the binaries and supporting files into the bin, lib and share directories into the directory specified after "--prefix" on the configure line; in this case, the "extra" (first) emboss directory level from the CVS checkout (/home/fred/emboss). Had you not specified "--prefix=/home/fred/emboss" they'd be installed to "/usr/local" by default instead.

6. Environment variables and PATH

Assuming you are using a csh style shell then the following commands will set up the path and the current graphics library location:

    set path=(/home/fred/emboss/emboss/emboss $path)
    rehash
    setenv PLPLOT_LIB /home/fred/emboss/emboss/plplot/lib

Or, if you'd done a "make install" rather than just a "make":

    set path=(/home/fred/emboss/bin $path)
    rehash
    setenv PLPLOT_LIB /home/fred/emboss/share/EMBOSS/

7. Adding databases to the development installation

Database locations are defined in the files emboss.default or .embossrc. The emboss.default file should be kept under the third emboss directory and you'll find a template file in the distribution called e.g. /home/fred/emboss/emboss/emboss.default.template. For this course, however, we recommend using a .embossrc file, which should live in your home directory. For the moment this need just access the EMBL and swissprot databases via SRS. Your .embossrc file therefore needs to look like this:

DB swissprot 
[ 
    type: P method: url format: swiss
    url: "http://srs.sanger.ac.uk/srs7bin/cgi-bin/wgetz?-ascii+-e+[swissprot:'%s']"
    comment: "Swissprot via Sanger SRS" 
]

DB embl 
[ 
    type: N method: url format: embl
    url: "http://srs.sanger.ac.uk/srs7bin/cgi-bin/wgetz?-ascii+-e+[embl:'%s']"
    comment: "EMBL via Sanger SRS" 
]

8. Testing

EMBOSS makes use of the GNU libtool, therefore the executables can be run from either the directory they're compiled in or from their 'make install' location. You should now be able to type:

    seqret embl:hsfau    
    seqret swissprot:mel_apido

and get a sequence retrieved in both cases.

Practical 2 - Navigating EMBOSS

For background information, see Talk 2

EMBOSS is big. You will learn how to efficiently navigate the package and its libraries to find stuff you need, by using SRS, documentation on the web and the source code itself.

1. Introduction

EMBOSS contains hundreds of library calls within the AJAX (low-level) and NUCLEUS (algorithm) libraries; navigating it can be daunting at first. Three methods are provided for this purpose:

SRS
Web documentation
The source code itself

2. SRS

Getting started
The source code is divided into two SRS databases, "EMBOSS Data Structures" (EDATA), and "EMBOSS Functions" (EFUNC). EDATA contains information about the objects, EFUNC about the functions. Both EDATA and EFUNC are available for the latest stable release and for the developers (CVS) release. To access these databases go to http://www.ebi.ac.uk/srs/. From there ...

Click on the "Library Page" tab at the of the screen.
Expand the "Other databases" section by clicking on the "+" to the left of "Other databases". You will see EDATA and EFUNC listed.
Highlight the check-box next to "EMBOSS Data Structures (CVS)" and then click on the "Query Form" tab.
Change one of the "AllText" options to "ID" and type a "*" character in its associated box, then click on "Search".

You will see a list of every available datatype. Let's try a more specific search:

Return to the query form and replace the "*" by "ajpstr" (the AJAX string object).
Click on "Search".

You'll see that two entries are returned, AjPStr and AjPStrTok. They will be explained later in the course but for now click on the link for AjPStr.

The documentation here is broken into several sections. The first 3 give the name, description and "aliases" of the object. AjSStr is the formal name of the object, AjOStr is the datatype name for the object whereas AjPStr is the datatype name for the object pointer. A further explanation is available in Talk 2.

After the "Aliase(s)" section you'll see several more blocks which correspond to groups of functions. Each block contains a list of available functions within that group. The groups you see will depend upon the library file and are subject to change from the current code refactoring exercise, but might include:

Iterators - iteration, e.g. over individual characters in a string.
Constructors - create new instances of an object (allocate memory).
Destructors - destroy instances of an object (free memory).
Assignments - initialise an object, replace contents if necessary.
Modifiers - change or replace the contents of an object.
Operators - use, but do not change, the contents of an object.
Outputs - write the contents of an object to an external file.
Casts - convert an object into an object or data of another type.

All the listed functions link to the EFUNC database in SRS.

At the bottom of the page you'll see the "Related Data", "Attributes" and "Body" sections. "Related Data" lists objects that are related to the AjSStr object, "Attributes" lists the elements of the object (C data structure) whereas "Body" gives you the C code for the object definition.

Try clicking on an entry for a function e.g. ajStrAppendS. You will see the marked-up source code for the function.

Searching EFUNC directly
The EFUNC database can be searched directly. This is useful if you know the kind of function you want but don't know the name. The function names and documentation are being standardised to be as intuitive and consistent as possible as part of a code refactoring exercise, so finding stuff you need shouldn't be difficult. Let's assume you want to search for a function that appends one string to another.

Return to the SRS databases page, uncheck the EDATA database and check the check-box for the EFUNC database.
Then select the query form.
It's often best to limit the search to the description field so, change "AllText" to "Description".
Type "append & string" into the associated box, then click on "Search".

A list of approximately 10 functions will appear. You can only use those functions that begin with "aj" or "emb", being those that are in the AJAX and NUCLEUS libraries respectively. The others are hidden functions; used for handling the internals of EMBOSS and not for general use.

It should be obvious from looking at the names that the likely candidates for what you want are those in the ajStrAppend family. Have a look at those functions (and some of the others) to see why the search picked them up. You'll see that some of the functions accept other string objects, character strings or just single characters. This method is of course limited by the vocabulary used in the descriptions of the functions. For instance, we use the term "append" rather than "catenate". Prove this to yourself by repeating the above search using "catenate & string".

To show the advantage of limiting the search, change to "Description" field back to "AllText" and repeat the "string & append" query. You'll see that there is a significant amount of noise in the results list.

Viewing the source code
Of course you can use SRS if you know the name of a function and need to examine the source code.

Return to the EFUNC page and change "AllText" to "ID".
Now use "ajstrappend" as the search term. Perform the search and then click on EFUNC:ajStrAppendS.

You should see the source code for ajStrAppendS on screen. Like the datatype display the output is in several sections. The name of the function indicates the source library file in which it is to be found; the "str" of ajStrAppend indicates the ajstr.h/c library. The description field gives the text you search with a "Description" search.

The most useful information for a user of the library user are the Input, Returns and Prototype fields.

The "Input" field shows that this function takes the address of a string object pointer as its first parameter and a string object pointer per se as its second parameter. The "Returns" field shows, as expected, the return value (AjBool; a boolean value) of the function. All this information is given at-a-glance in the "Prototype" field for the function (the prototypes are built into the library and you don't declare them in your programs). A prototype tells the compiler what a function is expecting and what it will return.

Below the prototype is the body of the function. This patently contains the source code of the function. C language reserved words are highlighted in red. The source code is marked-up with any calls to other EMBOSS functions; unhighlighted function calls are standard C library calls. You could click on, for example, ajFatal and see the code for that function.

Clicking on the red arrow on the prototype line will show all the EMBOSS functions that use this particular function. Clicking on the blue arrow will show all the EMBOSS functions that are called by this particular function.

As an EMBOSS application programmer you really don't need to know most of the detailed information above, just the inputs and returns. As a library developer, all the information is useful.

3. Web documentation

Details are subject to change with progress in revisions to the on-line documentation Another useful route to the function documentation is via the EMBOSS website.

From the EMBOSS homepage, click on "Developer docs". The relevant links in the contents section are "Ajax Library documentation (CVS & stable releases)" and "Nucleus Library documentation (CVS & stable releases)".
Click on the "AJAX library documentation (CVS & stable releases)" link in the contents. You'll see that documentation is available for the CVS (developers) and major releases of the stable version of the source code. Follow the "CVS (developers) release" link.

You'll see a table with each row corresponding to an individual library file, e.g. for Alignments, Array handling, Assert Functions etc. The links to "Datatypes" and "Functions" bring up information on the available datatypes and functions for that library file, whereas "Notes" links to general documentation for that library file.

Find "String manipulation" in the table and follow the link to "Functions".

You'll see details for all the available functions in the string handling (ajax/ajstr.c) library file. All the critical information is given including the name, description, prototype, parameters and return type. You'll notice each "Parameters" is marked as "Input", "Output" and "Update", which reflects the relationship between the function and parameter. So, "Input" parameters are read by the function, "Output" parameters are written by it and "Update" parameters may be read and written. There may well be several fields which are blank - these will be completed as we progress in documenting the software libraries.

Go back a page, then follow the link to "Datatypes".

The page has a section for each datatype in the library file. Each section contains a table, corresponding to the classes of functions described above, that gives a link and description of each function that uses that datatype. There are also tables for "Other related data structures" and "Attributes", which describes the elements of the object (C data structure) itself.

4. The source code

The source code itself is a vital reference, especially once you're more familiar with the libraries.

Open in a text editor ajstr.c and ajstr.h (found in e.g. /home/fred/emboss/emboss/ajax/ajstr.h).
Open the source for an application, e.g. cusp or sigcleave (found in e.g. /home/fred/emboss/emboss/emboss/cusp.c)

Get a feel for how the source code is layed out in header file and the application code. You might notice that the style of some of the code differs as EMBOSS has multiple authors. Increasingly, though, the code is conforming to defined standards (more on that later in the course).

Searching for keywords in the .c files is a direct method to find example code for the task at hand. If you are unsure how to do a particular task, e.g. read in a data file, then find a program that does what you need and look at the source code to see an example of how it can be done. Bear in mind there are many ways to solve a problem and the example you see might not necessarily be the best.

Practical 3 - Your first EMBOSS application

For background information, see Talk 3

Learn the basic steps needed to develop any EMBOSS application by implementing helloworld! under EMBOSS. To develop this application, which prints "Hello, World!" to the screen, you will write an ACD file and a file of C source code that uses the EMBOSS libraries.

Following the steps below, write an EMBOSS application to print "Hello, World!" to the screen. Limited information only is given below, if you get stuck, refer to Talk 3. For this practical and throughout the course you will need to look up appropriate functions for the task at hand. Background information on how to navigate EMBOSS is given in Practical 2.

Think about the task at hand and design your software
This is a simple program but you should still design it before coding. Think about any inputs and outputs and the major logical steps in the source code.
Write ACD file
The ACD file should contain the application definition with a single documentation: attribute only. Save it in your ACD directory (e.g. /home/fred/emboss/emboss/emboss/acd).
Test ACD file
Make sure your ACD file is ok by using acdc.
Write source code
Remember to include copyright notice and disclaimer, import the EMBOSS interface, process any user input before the application proper starts and exit cleanly.
Add application to EMBOSS
Remember to edit the Makefile.am in the executables directory (two changes to make) and the Makefile.am in the ACD directory.
Compile
The GNU tools will recognise whether the Makefile.am files have been edited and reconstruct the Makefile files when a "make" command is given. Don't edit the Makefile files themselves !
Debugging
Testing
Documentation
Ensure that the main() function is appropriately documented and the .c file includes the header documentation - all good practice for later.

If you finish in good time, consider the following:

There are tidy ways of reporting various different types of message (e.g. information, warnings and errors) to the user - can you find the appropriate functions?

Practical 4 - Introduction to Objects Using Strings

For background information, see Talk 4

EMBOSS programming objects (C data structures and functions) form the core of the EMBOSS libraries. You'll get a gentle introduction to objects by adapting helloworld.c to perform some simple string manipulations. You'll learn about two special types of functions called constructors and destructors that are used to allocate and free memory for objects.

Following the steps below, modify helloworld.c to perform various string manipulation tasks. For each task you'll need to find the appropriate function(s) using SRS etc, as you did in the previous practical. When using the functions be very careful to pass the object pointer or its address as required. If you get stuck on the use of pointers, refer to Talk 4.

1. Modify helloworld.c to use an AjPStr (call it string1); don't do anything with it yet other than correctly declaring and initialising the object, allocating it (use the default constructor function) and freeing it (use the destructor function).

2. Comment out the line that prints "Hello, World!" to the screen. Now, using the appropriate functions, copy the text "Hello, World!" into string1. Print string1 to the screen; you'll have to use %S with the print function to print an AjPStr, rather than %s which is used for char * strings.

3. Can you find a different constructor function that does the job of both the default constructor and your assignment function? Implement the change!

4. Now add two new AjPStr's to your program to make a total of 3. You'll now copy your string1 to both of the new strings. For one of them(call this string2), use a constructor function that uses the reference count, ie. makes a reference to (not a real copy) of string. For the other (call this string3) make a genuine copy instead. Keep your memory management clean, which means calling a destructor for all of your strings - including the string you construct by reference, so that the reference count and ultimately the memory are managed correctly (see Talk 3 for more information). Print all 3 strings out to screen and print the usage count of string1 after each copy and after calling the destructor functions.

What effect on the usage count do the copy functions and destructor functions have?
How many objects do you have in memory at each stage?
And how many references to these objects?

You might notice that in certain circumstances you can get away without explicitly calling the constructor and destructor functions. This is because most of the EMBOSS functions will allocate memory for you as required if the application code hasn't done so. This should be considered a safety mechanism which you should not rely on - instead, strive to write clean code which calls constructor and destructor functions as logic dictates. In principle, the operating system should free any memory allocated to your program once it terminates, however its bad programming practice to rely on that, so you should clean up as described before exiting.

If you used the ajStrNew constructor function you might be surprised at the value of the usage count, which is higher than you might have expected. The explanation is that a call to ajStrNew doesn't immediately instantiate an AjSStr object, it just returns the address of the "NULL String" (an object which is defined globally in AJAX), whose reference count may well be in the hundreds owing to the call to embInit (which itself makes, indirectly, many calls to ajStrNew). Its only when the char * string is given a non-NULL value (by whatever means) that memory for a string object proper is allocated. AJAX is programmed in this way for maximum speed and efficiency of string handling. You can illustrate this idea nicely:

Replace any calls to ajStrNew with ajStrNewRes (you should look up what this function does).
Now how does the usage count respond?

5. Note that if you do anything (e.g. via an EMBOSS library call) to the original string1 such that its length would exceed the reserved string length forcing it to be reallocated (done by AJAX for you automatically), string2 will still point to the original string and will not be a reference to the new string1.

The use of pointers is a common source of error in C programming and EMBOSS too. The important point is this: always ask yourself whether anything is really gained from referencing rather than true copying; occasionally it has a place but not very often. Things are much more intuitive (and so your code is cleaner and more readable) if you make real copies. So now is a good time to convert the code so that string2 is a genuine copy of string1.

6. It's easy to find out whether a function reallocated a string that was passed to it. Most of the functions in ajstr.c that have an AjPStr* parameter return an AjBool. The '*' indicates an address of the string object pointer and that the function might change the object or the data pointed to, i.e. the string might be reallocated. The AjBool (boolean value) may have a value of ajTrue or ajFalse, in this case, ajTrue means that the string was reallocated.

To illustrate this:

Implement a loop that runs a 100 times, each time appending string2 onto string1 and printing out the return value of the appropriate append function.
You will probably use an AjBool variable in doing this; tip - use %B to print out a boolean value.

You should see that it doesn't take long for the string to be reallocated.

Automatic dynamic memory allocation of strings is one of the great features of EMBOSS

- you don't have to worry about resizing memory for your string - EMBOSS does it all for you. This is also the basis for the safety mechanism mentioned above. EMBOSS also includes dynamic arrays and you might get to use them later.

7. Now we'll look at functions for copying substrings to string objects and how EMBOSS strings (AjPStr) can be used with normal C strings (char *). Your task is to copy the substring "Hello" from string1 to string2, and copy the substring "World" from the string1 to string3. There are two functions (you'll need to find them) to copy a substring, one that works with an EMBOSS string (AjPStr) and another that uses a C-type string (char *). Experiment with both. For the later, you'll need to get to the char * element within string1: your code shouldn't probe the structure directly, therefore use library function ajStrGetPtr instead.

Can you find something in the library that achieves the same ends as ajStrGetPtr without making a function call? (Clue - look in ajstr.h).
What effect does using -1 as the value of the last argument to ajStrAssignSubS have?

8. That completes our first treatment of string handling which is probably the most developed of all the EMBOSS library files. We've only skimmed the surface of ajstr.c but later today in Practical 6 you'll review the entire string library by way of a worked example. There are some further exercises there too (at the end of the practical): you can start that practical now if you've completed this one early.

Practical 5 - ACD Files : Basic Skills

For background information, see Talk 5

EMBOSS allows very flexible definition of application interfaces via the ACD language. You'll get a gentle introduction to ACD programming by adapting your application for user-defined input and output and file-handling. This is a good treatment of the basics of writing ACD files and the code required to process them. You'll learn about various ACD datatypes and attribute, and file handling.

1. Getting started

Your first task is to modify helloworld.c to print any user-defined string a user-defined number of times to the screen. Then modify your program to read the string from an input file instead, and write it to an output file.

First of all, read Talk 5 to familiarise yourself with ACD files.
Open the ACD file for helloworld in a text editor. It should look something like this

application: helloworld
[
    documentation: "My first EMBOSS program"
]

2. Application groups

Now add an appropriate groups attribute to the application definition.

A list of valid group names is available here. You've already begun to turn helloworld into a general sequence manipulation program so pick a group that is appropriate.
Add a comment line indicating the date your ACD file was last edited.
After editing your file, use the seealso application to produce a list of the programs which share some functionality with it.

3. Getting started

Now modify your ACD file to support the replacement of the hard-coded string "Hello, World!" with a user-defined string.

You'll need to add a new definition to your ACD file. Pick the appropriate datatype from the list given here.
Give your definition an appropriate label.
Add attributes to the definition that give it a default value (of "Hello, World!" to begin with) and which provide textual information about the data item. Pick the appropriate attributes from the list given here.

Now modify your C source code to support the user-defined string.

Find the appropriate function for retrieving the value of your new data item. You'll need to look in the ajacd library file. "Text token name" is mentioned in the documentation and this refers to the label of the data item in the ACD file.
Then make appropriate changes to your program.

Now modify helloworld so that the only string it prints out is the one specified in the ACD file; comment out all hard-coded print statements that print "Hello, World!" or some derivative. When you run the program, you should get output that looks like this:

Unix % 
Unix % helloworld
My first EMBOSS program
Hello, World!

It should be obvious that it's using the default value, the string can be user-defined though, you just have to specify the appropriate qualifier (the label of the data item, "text" in the example below) on the command line when you invoke the program:

Unix % 
Unix % helloworld -text "Hello, Sailor!"
My first EMBOSS program
Hello, Sailor!

4. Watch out for memory leaks!

These can occur when, in your source code, you lose a reference to an allocated block of memory, usually by accidentally making a pointer that holds the address of that memory point somewhere else, without first freeing the memory or keeping track of where its held. Bear in mind that the call to embInit will have allocated all the memory required for holding the ACD data items, which means you don't have to. Manage your memory carefully!

When retrieving for example a string (AjPStr) data item by using a call to ajAcdGetString, the function will return a pointer to the string created by embInit. This means that you do not have to allocate memory for the string first. In fact, if you do, you will have caused a memory leak; your call to the string constructor function would set aside some memory, but you'd then lose your handle on (reference to) this memory (creating the memory leak) when you catch the value returned by ajAcdGetString. You do have to free the string afterwards though to clear the memory allocated by embInit (although in the future a call to ajExit will do this). The following pseudocode snippet illustrates this:

AjPStr  mystring=NULL;

embInit(...);                   /* This allocates memory for all ACD data items and assigns their values. */
mystring=ajStrNewRes(...);      /* Allocating some memory ... unnecessarily. */
mystring = ajAcdGetXXX(...)     /* Now mystring points somewhere else, losing the handle on memory allocated 
                                   by ajStrNew - this is a memory leak! */
ajStrDel(&mystring);            /* This must free the memory allocated by embInit.  Omit this and you'll 
                                   have another leak. */

5. Qualifiers and parameters

Having to use the '-text' option is a bit cumbersome, it would be nice if we could specify the text to print without having to know the label name. For this we use the 'parameter' ACD attribute. If a data item is defined with the parameter attribute (parameter:), using the name of the parameter on the command line is not mandatory, i.e. you can just type helloworld "Hello, Sailor!".

Add the attribute 'parameter: Y' to your ACD file now.
Run helloworld, first by typing helloworld. You should notice that you are prompted for the value; a data item that is a parameter is always prompted for if not specified on the command line.
Now run helloworld by typing helloworld "Hello, Sailor!". Hey presto, it accepts the parameter and prints it.

Your next task is to add an integer data item to your ACD file; this integer should control how many times the user-defined string is printed to the screen.

To add the integer, just repeat the steps you went through for adding the string.
Do not make this data item a parameter just yet - experiment by calling helloworld with different command lines. You should notice that it doesn't matter whether the parameter comes before or after the qualifier for your new integer.
Now make the integer a parameter in your ACD file. Experiment again with calling helloworld, and make sure you find out for yourself that the order of parameters on the command line is important - they have to match the order used in the ACD file!

6. File input and output

You'll now learn about basic file input and output. All the functions you need for this are in ajfile.c. We're going straight in at the deep end: your task is to adapt helloworld so that it reads text from an input file, and writes this same text a user-defined number of times to an output file. Use the following steps as a guide, but you'll have to fill in the detail from what you've learned so far:

Add two new data items to your ACD file, one for an input file and one for an output file.
Now, by the time embInit has returned you will have two open file streams, one to the input file and one to a new output file that will have been created for you. You'll need two AjPFile objects in main(), one each for the input and output files.
Add code to retrieve the opened file streams from ACD: make your AjPFile objects catch the return value of the appropriate calls to ajAcdGetXXX.
Remember to close your files further on in code, you'll be leaking memory otherwise.
Implement the ajFileReadLine function for reading a line of text from the input file. Create a new string to hold this - watch your memory management !
Implement the ajFmtPrintF function for writing the text to your output file. This function is described in ajfmt.c.

Your last task is to think how the previous steps could be done in a different (ie. hard) way. Look up ajFileNewIn and ajFileNewOut and think about what the code would look like if you used those functions to create the files for you, rather than relying on file objects in your ACD file. What would you need in your ACD file instead of your infile and outfile data items? This is a worthwhile task because it is not always suitable to have embInit create files for you, for example you may need to create temporary files "on the fly".

When you're done, make a safe copy of helloworld.c and helloworld.acd. It'll be handy to have a basic, working programme which you can experiment with when learning about ACD files and EMBOSS programming. It's possible to achieve the same ends as your helloworld program without even having a file of C source code. Can you think how ?

Practical 6 - EMBOSS String Handling

For background information, see Talk 6

Efficient string handling is fundamental to molecular sequence analysis. You'll deepen your familiarity of the powerful AJAX string library by way of a worked example.

EMBOSS will soon include for each library file, an application which illustrates the correct usage of each function in that library. Currently, these demo applications are kept in the applications directory and have the prefix "demo". For example, /home/fred/emboss/emboss/emboss/demostring.c illustrates the use of the string library. Bear in mind this is a work in progress, you'll notice demo applications are only available for a few of the library files, and of those, only demostring.c to is guaranteed to be up-to-date and reliable.

Of course, there is an ACD file for each demo* application, and you could compile them as you would any other EMBOSS application. However, the Makefile.am files are already set up for you:

Open the two Makefile.am files (in the executables and ACD directories)
Find the entries for demostring

You'll notice it occurs in the check_PROGRAMS section of the "executables" Makefile.am file. This section is reserved for programs that are undergoing active development or have yet to be fully tested, or which should be considered incomplete for some other reason, e.g. they are undocumented or quality-assurance test data are not yet available for them. The demo* applications are in there, reflecting their "work in progress" status.

To attempt a compilation of the "check_PROGRAMS" applications, type "make check" from the applications directory. This will compile all that "check_PROGRAMS" applications.
If that fails, edit the Makefile.am files accordingly so that demostring.c is treated as a normal application, and compile it as usual.

You should now have a working demostring application. Your objective is to get a feel for the scope of the string library. To do that you'll need to

Run demostring
Inspect the output and the corresponding source code for each function call in turn.
If it's not obvious what each function is doing, or to test your understanding, edit the source code, recompile and run demostring again.

Once you've had enough of that, apply your knowledge of ACD files and your deeper knowledge of the string library to implement various simple string manipulations of your own choosing in your own program. Choose manipulations which involve some input from the user, such as an integer or a string value and which thus require you to extend your ACD file and the C code to process it.

Here are a few further exercises to test your knowledge of strings, which builds on what you did for helloworld :

Implement a char variable in your program and copy the '!' from one of your strings into it. Do not probe the string internals yourself, there is a function to do this !
Use a single function call to replace all occurrences of "World" with "Sailor" in string1 (the long version that was appended to 100 times).
Write "Hello, World!" back into string2 and string3 anyway you like. What is the effect of calling ajStrFmtUpper on one of the strings?
Compare string2 and string3 using ajStrMatchS; print out the return value of the function. Now do the comparison with ajStrMatchCaseS; what's the return value now?

DAY 2: The main objective is to deepen your knowledge of programming and software development under EMBOSS. Your main task is to adapt your program to incorporate the functions of several existing EMBOSS applications.

Practical 7 (Part I) - The AJAX Library & The Software Cycle

For background information, see Talk 7 (Part I)

Here you'll get an overview of the AJAX low-level library functions. But there's much more to software than just coding; you'll plan your days work considering your design and the essential steps for efficient development.

1. Planning your work

Before you start programming today, make sure you read the whole of the text below.

Your main task for today is to develop your application into a general purpose sequence manipulation program which combines several existing EMBOSS applications into one. This will deepen your understanding of ACD files. You will have to apply much of what you have learnt so far, as well as methods for processing biological sequences that you'll cover today.

You should aim to incorporate functionality from at least two applications. Its up to you what to incorporate, but here are some suggestions:

Incorporate the functionality of pasteseq. This simple editing program allows you to insert one sequence into another sequence after a specified position and to then write out the results to a formatted (sequence) file. (We recommend you start with this one.)
Incorporate the functionality of cutseq. This simple editing program allows you to cut out a region from your sequence by specifying the begin and end positions of the sequence to remove. It removes the sequence from the specified start to the end positions (inclusive) and returns the rest of the sequence in the output file.
Incorporate the functionality of maskseq. This simple editing program allows you to mask off regions of a sequence with a specified letter. Why would you wish to do this? It is common for database searches to mask out low-complexity or biased composition regions of a sequence so that spurious matches do not occur. It is just possible that you have a program that has reported such biased regions but which has not masked the sequence itself. In that case, you can use this program to do the masking.
Incorporate the functionality of infoseq. This is a small utility to list the sequences' USA (Uniform Sequence Address), name, accession number, type (nucleic or protein), length, percentage C+G, and/or description. Any combination of these types of information can be easily selected or unselected. By default, the output file starts each line with the USA of the sequence being described, so the output file is a list file that can be manually edited and read in by any other EMBOSS program that reads sequence(s)to be analysed.
Incorporate the functionality of degapseq. degapseq reads in one or more sequences and writes them out again minus any gap characters. In effect it removes gaps from aligned sequences. In fact, if does more than just this as it removes ANY non-alphabetic character from the input sequence, so as well as removing the gap-characters, it will remove such things as the '*' in protein sequences that indicates the position of a 'translated' STOP codon.

The first thing to do is to decide which applications take your fancy and will be incorporated. Feel free to not restrict yourself to the list above.

2. The Software Development Cycle

Having decided on the basic functions of your program you are probably eager to get coding. There is however much more to software development than coding and you should first work up a design for your application and consider the basic steps to implementation.

First of all, read the relevant notes from Talk 7 (Part I). You'll not need to apply all of it today, but it should inform you of the scale of the task and the steps involved. After reading the notes, you should work to your own plan for the rest of the day, or follow (some or all of) the steps below:

Think about the problem: what exactly are you trying to achieve?
Think again. Are you certain you've considered everything? Once you start coding it'll be much more difficult to incorporate new ideas than had you prepared in advance. The more detail the better here!
Design your software. Use a flowchart, logical steps in english, explain it to someone else etc; whatever helps you get a crystal clear idea of what you intend to do.
Document the inputs and outputs (this can go straight into your ACD file). Document what your application will do in a nutshell (a couple of sentences will do!)
Consider whether you need a new data structure. They'll be more on writing your own data structures and constructors and destructor functions later in course. But if you're brave, you can implement such based on what you know about the AjSStr object.
Write numbered comments for all the major logical steps to your program. These can go straight into your .c file.
If it'll help, write pseudo-code for each major steps to capture the program logic, loop structures etc.
Insert code for the basics of the application, including copyright notice, #include statement, main() function and function prototypes (from your design!)
Write the ACD file (this should be easy as you have your design so already know your inputs and outputs).
Test ACD file (using acdc).
Implement C code for reading any ACD data items and for cleaning up memory at the end of the application. It's sensible to put this in now given that you've just completed your ACD file.
Research the functions you'll need for the application (more on this below).
Write the remaining C source code for the application.
Add the application to EMBOSS (edit Makefile.am files)
Compile. Fix any compilation errors.
Preliminary debugging : get it to run without crashing.
Further debugging : get it to run correctly on some sample data.
Testing : ensure it works correctly under different conditions (inputs).
Documentation: there's more on this later today, but you should, at a minimum, describe the basics of the inputs and outputs and what your application does. Insert the documentation into an appropriate place in the .acd and .c files A few sentences will do at this stage.

3. Overview of the AJAX library

For this practical and throughout the rest of the course you'll need to use many functions other than those in the string library. You should read the relevant notes from Talk 7 (Part I) to get an overview of what's available.

Before you start coding, ensure you have an appreciation of what functions you'll need (see implementation steps above). It's sensible to research these functions before you start coding, so that everything you need is available to you. Conceptually, this is just the same as collecting all your tools together before starting a DIY project.

If you start coding before you know what you need and what's available, your progress will be slower as you'll have to repeatedly break off to research functions, when what you should be doing is concentrating on the program logic.

4. Getting started

We recommend you make a start by incorporating the functionality of the pasteseq program.

You can adapt helloworld.c or start from scratch if you think that'd be cleaner. Your application will use string handling, so you might want to look back over previous talks. It will also require sequence handling and ACD topics that are covered in later practicals and talks. Therefore, you might want to look at Talk 7 (Part I) and Talk 7 (Part II) now, which are really just a continuation from what you've covered so far.

Of course it's possible to cheat by simply pasting in the code from pasteseq.c, but you'll learn much more if you have a go from first principles and ask questions if you get stuck!

To help you out here, in no particular order, are some functions which may be of use (remember though there are many different ways to solve a problem):

ajSeqReplace
ajSeqStr
ajStrAssignS
ajStrInsertS

Practical 7 (Part II) - ACD Files : Intermediate Skills

For background information, see Talk 7 Part II

It's easy to create a friendly and intuitive application interface using the ACD syntax. This practical shows you how.

Work through the examples below. You can just read the examples but it's best if you edit your ACD file to correspond with what's shown, and type in the example commands.

So far you've used the application definition and the string, infile and outfile specifications in ACD. You'll have written an ACD file that contains something like.

application: helloworld 
[
  documentation: "Prints something arguably uninteresting"
]

string: message 
[
  parameter: "Y"
]

outfile: outfile 
[
  parameter: "Y"
]

Both message and outfile are defined to be parameters.

1. Parameters

Values for parameter data items must be specified on the command line in the order they appear in the ACD file. So to operate the above from the command line you'd have to type:

   % helloworld "Hello World!" message.dat

That would cause Hello World! to be printed to the output file message.dat. But what if we didn't want to force the user to specify a message, rather, we just wanted to add a default message ("Hello World!") to our ACD file which would be printed if nothing was given, i.e. by typing:

  % helloworld message.dat

we want "Hello World!" (the default message) to be printed to the file message.dat.

The above ACD will not do that. That is because all the ACD items are set to be "parameter". Typing the above would result in "message.dat" appearing as the string to be printed and you'd then be prompted for an output file name. In short, not the desired behaviour. This is where command line qualifiers come in.

2. Qualifiers

Qualifiers can appear anywhere on the command line but you must always refer to them by their label. So the reference to our message text will look like:

     -message "Hello World!"

This would allow you to type:

  % helloworld -message "Hello World!" message.dat      or
  % helloworld message.dat -message "Hello World!"

To get this to work you must specify the string to be a "standard" or an "additional" qualifier, instead of "parameter" that's currently specified. To recap from before:

Parameter: "Y" means that the data item is a parameter, i.e. you do not have to use the data label to specify a value for it on the command line. e.g. "myprog 10".
Standard: "Y" and Additional: "Y" mean that the data item is a qualifier, i.e. you DO have to use the data label to specify a value for it on the command line. e.g. "myprog -somevalue 10".
Values for parameters and standard qualifiers are always prompted for (with their default value) if not specified on the command line.
Values for additional qualifiers are not prompted for (a default value will be used) unless '-option' is given on the command line. A default value for additional qualifiers should always be given in the ACD file.

Your ACD file should now look like:

application: helloworld 
[
    documentation: "Prints something arguably uninteresting"
]

string: message 
[
    standard: "Y"
]

outfile: outfile 
[
    parameter: "Y"
]

The two invocations of helloworld shown above will now work. What happens, though, if you don't mention the message at all on the command line? That is, what happens if you just type:

  % helloworld message.dat

The answer is that "message.dat" will be taken as a parameter and the output file name, the program will then prompt you for a string to print out. Close, but still not the desired behaviour. What we wanted was for it to go ahead and run with a default string.

3. Default actions

You can associate a default value for most ACD definitions using the 'default' attribute. The ACD can be modified as below:

application: helloworld 
[
  documentation: "Prints something arguably uninteresting"
]

string: message 
[
  standard: "Y"
  default: "Hello World!"
]

outfile: outfile 
[
  parameter: "Y"
]

Now try typing

  % helloworld message.dat

You'll notice that although we specified a default, its still prompting us for a value. Remember that all values for standard data items are always prompted for, regardless of whether a default is specified or not. We need to specify message as being an additional data item, which are not normally prompted for. So your ACD file should look like:

appl: helloworld 
[
  documentation: "Prints something arguably uninteresting"
]

string: message 
[
  optional: "Y"
  default: "Hello World!"
]

outfile: outfile 
[
  parameter: "Y"
]

This ACD finally does what is needed. Typing:

  % helloworld message.dat

will print "Hello World!" to the file message.dat. Hurrah! You can override the default message by specifying the message on the command line:

  % helloworld -message "Goodbye World!" message.dat

will print a rather morbid message to the output file. To reiterate, you must always supply a default value for "optional" ACD data items as EMBOSS will not prompt you for one if you omit to specify a value on the command line. EMBOSS would generate an error if you tried, from within your C source code, to access the value of that unspecified data item. In contrast, values for "standard" and "parameter" data items are always prompted for if they're not specified on the command line.

4. Confusing the picture

Having explained the above it should now be obvious that "parameters" are almost the same as "standard" data items . It would be perfectly acceptable to invoke the modified ACD using:

  % helloworld -message "Hello World!" -outfile message.dat

The difference with parameters is that you don't have to mention the label of the datatype whereas with qualifiers you do. If you omit the label for parameters then their values must appear on the command line in the order in which they appear in the ACD file.

5. Maxima and minima

It is often either useful or vital to be able to set limits on the maximum and/or minimum values to be associated with an ACD datatype definition. This is done in an intuitive way:

integer: window 
[
    standard: "Y"
    default:  10
    minimum:  5
    maximum:  100
]

6. Setting the prompts

Emboss will always provide default prompt text. So, for the "window"example in the last section the user would be prompted as follows:

  -window : Enter a number [10]:

While adequate, it's not entirely friendly. You can set the prompt for a datatype definition by using the "information" option. So,

integer: window [
  standard: "Y"
  default: 10
  minimum: 5
  maximum: 100
  information: "Window size"
]

will print the following as the prompt.

  Window size [10]:

which is much more meaningful.

7. Setting help information

Every EMBOSS application accepts built-in qualifiers -help and -verbose (the latter used in combination with the former if at all). These print out all the program parameters and qualifiers with explanatory text alongside. To set this information the "help" option is used. An example would be:

integer: window 
[
    standard: "N"
    default: 10
    minimum: 5
    maximum: 100
    information: "Window size"
    help: "Number of residues used to calculate the value for each point"
]

Practical 7 (Part III) - Sequence Handling

For background information, see Talk 7 Part III

ACD files support various biological data types and many file formats saving you huge effort. You'll deepen your knowledge of ACD by modifying your program to handle real biological sequences from any source.

1. Types of sequence input

There are three methods that must be catered for to access sequence information. First, a program may operate on a single sequence. Secondly, a program may need to read in sequences from a database one after the other. Thirdly, a program may need to read in a set of sequences at once. These three options are covered by the following ACD datatypes.

  sequence:          input a single sequence
  seqall:            read many sequences sequentially
  seqset:            read in a sequence set e.g. an alignment

These are defined just like any other ACD datatype. These datatypes have extra attributes though. One of the most used is the type attribute. Not surprisingly, this is used to limit the kind of sequence that EMBOSS will accept. The sequence type can be one of the following:

  any                   any sequence without gaps
  dna                   a DNA sequence without gaps
  rna                   an RNA sequence without gaps
  nucleotide            DNA or RNA without gaps
  protein               a protein sequence without gaps
  puredna               DNA without ambiguities
  purerna               RNA without ambiguities
  purenucleotide        DNA or RNA without ambiguities
  pureprotein           protein without ambiguities
  gapdna                DNA allowing gaps
  gaprna                RNA allowing gaps
  gapnucleotide         DNA or RNA allowing gaps
  gapprotein            protein allowing gaps
  stopprotein           protein allowing for "stop codon" equivalents
  gapany                any sequence at all

You can see the behaviour for yourself by, for example, coding an ACD data item to be a nucleotide sequence then giving it a protein sequence. Try it!

An ACD definition allowing the input of a single sequence (any type at all) would be:

sequence: fubar  
[
    parameter: "Y"
    type: "gapany"
]

In this case, the distinction between the datatype (sequence) and the label (fubar) clear. It is conventional though to use the label "sequence" for all sequence inputs. So, although the ACD above would work it should be written as:

sequence: sequence  
[
    parameter: "Y"
    type: "gapany"
]

It is also customary (where possible/sensible) to have EMBOSS use the default prompts i.e. no "information" attribute is given. The seqall and seqset equivalents are:

seqall: sequence  
[
  parameter: "Y"
  type: "gapany"
]

seqset: sequence  
[
  parameter: "Y"
  type: "gapany"
]

2. Retrieving sequence input within a program

This is done just like any other datatype within the C program. Each of the three access methods has its own object type.

AjPSeq - a single sequence
AjPSeqall - multiple sequential sequences (one sequence at a time)
AjPSeqset - a set of sequences

Recovery is done using the associated ajAcdGet calls.

2a. Recovering a sequence

 ...
 AjPSeq seq=NULL;
 ...
 seq = ajAcdGetSeq("sequence");

2b. Recovering a sequence set

 ...
 AjPSeqset seqset=NULL;
 ...
 seqset = ajAcdGetSeqset("sequence);

You can then use AJAX library functions to recover information about the whole set of sequences or just individual sequences e.g. ajSeqsetSize. Hint: Do an ID search on EFUNC (SRS) using the search term "ajseqset".

A particularly useful function is ajSeqsetSeq which will give you a character pointer to the start of the n'th sequence in a set.

2c. Recovering sequences sequentially

 ...
 AjPSeqall seqall=NULL;
 ...
 seqall = ajAcdGetSeqall("sequence");

However, a seqall object is really a means to an end of returning individual sequences in a loop. A code segment that is often used for this is:

 ...
 AjPSeqall seqall = NULL;
 AjPSeq    seq    = NULL;
 ...

seqall = ajAcdGetSeqall("sequence");
 ...
 while(ajSeqallNext(seqall, &seq))
    {
        /* Do something with 'seq' */
    }

Note: You may use the in-built qualifiers -sbegin and -send on the EMBOSS command line to specify a start and end position for a sequence. These are used to set values for corresponding elements in the appropriate object. Note however that regardless of the -sbegin and -send values, you still get all of the sequence accessible to you in memory. The function ajSeqTrim may be used to convert a sequence object to hold just that area of sequence. The function ajSeqOffset will return the -sbegin value the user specified, i.e. the offset from the start of the sequence.

3. Sequence output in ACD

There are three dataypes for sequence output which match their data input equivalents, namely:

seqout
seqoutset
seqoutall

By convention the label "outseq" is used with them.

seqout: outseq 
[
  parameter: "Y"
]

seqoutset: outseq 
[
  parameter: "Y"
]

seqoutall: outseq 
[
  parameter: "Y"
]

There is no associated type attribute as the output format is determined from within the application (see below).

4. Retrieving sequence output information in the application

This is done as you'd expect with associated objects and ajAcdGet calls.

A single object is used for all three forms of output:

AjPSeqout

The associated ajAcdGet calls are:

...
AjPSeqout seqout = NULL;
...
seqout = ajAcdGetSeqout("outseq");       or
seqout = ajAcdGetSeqoutset("outseq");    or
seqout = ajAcdGetSeqoutall("outseq");

5. Outputting sequence information

There are three AJAX function calls used for writing out sequence information. They are:

void ajSeqWrite(AjPSeqout outseq, AjPSeq seq);
void ajSeqsetWrite(AjPSeqout outseq, AjPSeqset seq);
void ajSeqAllWrite(AjPSeqout outseq, AjPSeq seq);

Note that for both single sequences the object that is written is an AjPSeq whereas for sequence sets it is an AjPSeqset.

6. Setting the output format

fasta format is used by default. The output format can be changed by adding "-osformat format" on the command line or by giving it in the USA (Uniform Sequence Address) of the output filename e.g. "embl::myfile.seq". There are many possible output formats which include:

"gcg"
"gcg8"
"embl"
"em"
"swiss"
"sw"

and many more. Some of the formats are synonyms for another (e.g. "embl" and "em).

See Talk 7 Part III for more details on supported formats and USAs.

7. Closing the output file

When you have finished writing the sequences then you can close the file by calling the AJAX function ajSeqWriteClose. Of course, you must remember to manage the memory for all your objects.

Practical 7 (Part IV) - ACD Files : Advanced Skills

For background information, see Talk 7 (Part IV)

The definition of an application interface frequently involve calculations, conditional statements and the use of variables and menus. You'll learn how EMBOSS supports all these operations and also lets you probe the attributes of input datatypes such as sequences.

To complete your coverage of ACD files and their processing, work through the examples below. Once again, you can just read the examples or edit your ACD file and source code to correspond to what's shown.

ACD menus

Selecting from a list of options is often necessary in molecular biology programs. ACD provides two methods of doing this, the select datatype and the list datatype.

1. The list datatype
Here is a typical list definition in ACD.

list: frame  [
  standard: "Y"
  help: "Allows selection from a set of reading frames"
  default: "1"
  minimum: "1"
  maximum: "1"
  header: "Translation frames"
  values: "1:1, 2:2, 3:3, F:Forward three frames, -1:-1, -2:-2, -3:-3, R:Reverse three frames, 6:All six frames"
  delimiter: ","
  codedelimiter: ":"
  information: "Frame(s) to translate"
]

The user will be presented with the title of the menu from the header attribute. After that will appear the tag/text information from the values attribute. Following that will be the prompt from the information attribute. It will look something like this

Translation frames

   1     1
   2     2
   3     3
   F     Forward three frames
  -1    -1
  -2    -2
  -3    -3
   R     Reverse three frames
   6     All three frames

Frame(s) to translate[1]:

The user is allowed (generally) to supply a comma-separated list of options from the above depending on the minimum and maximum values. In the above example both the minimum number of selections and the maximum number of selections are set to one, therefore only one selection value is allowed. Selection is done by typing the tag values, therefore if the maximum count had been set to 3 then a user entry of "-1,F,6" would be a valid input.

The codedelimiter attribute allows you to select any character to separate the tags from the text in the values attribute. The delimiter attribute allows you to choose a character to separate the tag/text pairs within the values attribute.

1a Recovering list selections from the application source code
Taking the above list as an example it would be recovered as follows:

...
AjPStr *flist=NULL;
...
flist = ajAcdGetList("frame");
...

The declaration shows that flist will be an array of string objects. The values held in the strings are the tags from the values list and the list is terminated by a null string. So, using our example, as only one value is allowed and let's assume the user had answered '6' to the prompt the resulting array would be:

    flist[0]     a string object that contains "6"
    flist[1]     a null string object

Your code would likely step through the list, if maximum is greater than 1, using something like the following:

n = 0;
while(flist[n])
{
    if (ajStrMatchC(flist[0],"6"))
    {
       /* Do something */
    }
    ...
    ++n;
}

2. The select datatype
This is very similar to the list datatype. Here is an example:

select: order  
[
  casesensitive: "N"
  default: "score"
  delimiter: ","
  header: "Sort order of results"
  help: "Name of the output file which holds the results of the analysis.
  information: "Sort order of results"
  maximum: "1"
  minimum: "1"
  information: "Select sort order of results"
  standard: "Y"
  values: "length, position, score"
]

Note that the values attribute just contains a 'delimiter' separated list of text (leading whitespace is ignored). Selection takes place on the text values, there are no tags. Also, note that there is a casesensitive attribute which is often set to "N" so that shifted and non-shifted characters are equally acceptable. The user can select on the text down to a length of text that has no ambiguity with other values.

2a Recovering 'select' selections within the application source code
Taking the above select as an example it would be recovered as follows:

...
AjPStr *oselect=NULL;
...
    oselect = ajAcdGetSelect("order");
...

Otherwise the code is the same as for the list datatype with the exception that it would test for the strings "length", "position" and "score".

Calculated attributes, conditions, calculations, variables and options

1. Calculated attributes of sequences

When writing a program to insert one sequence into another, one way to make sure that the insertion position was not greater than the length of the first sequence is to used code like the following:

    if(position > ajSeqLen(seq))
        ajFatal("Insertion position out of bounds");

The problem with that is, the user having gone to the effort of configuring it (entering all the inputs), the program will terminate once it's running. What would be better is if the interface forced the correct input, and there is a way to achieve that by using calculated attributes in the ACD file itself. There are eight calculated attributes you can use. Assuming you have the following ACD snippet:

sequence: sequence 
[
  parameter: "Y"
  type: protein
]

The eight calculated attributes are:

sequence.begin Start residue (-sbegin value)
sequence.end End residue (-send value)
sequence.length Length
sequence.protein True if sequence is protein
sequence.nucleic True if sequence is nucleic
sequence.name Name
sequence.weight Alignment weight for a seqset
sequence.count Number of sequences in a seqset

You access them with the ACD "get the value of" syntax which consists of surrounding a term in parentheses and putting a dollar sign at the front. They therefore become:

    $(sequence.begin)
    $(sequence.end)
    $(sequence.length)    etc

Therefore, to make sure your insert program doesn't try inserting off the end of the sequence you just need to add:

    maximum: $(sequence.end)

to the integer definition of the insertion position. These calculated attributes are also useful for conditional statements (see below).

2. ACD calculations

Calculations can be performed in ACD using the '@' syntax. A rather silly, but legal, calculation would be:

       @(5 + 9)

which equates to the value 14. You can add, subtract, multiply or divide.

Calculations can be used to test for equality, inequality, greater than or less than using: == != > <. For example the following ACD would be legal but possibly not useful in practice.

   standard: "@($(sequence.length) == 20)"

you can see see the standard attribute ie being set to either "Y" or "N". Up till now, you've only ever specified "Y" after parameter, standard, or additional but "N" is in fact supported. A "N" will override the default behaviour of these attributes such that prompting for a value will be turned off. This is useful in some situations. In this case, the calculation will switch a prompt on only if the sequence length was equal to 20. You can use calculations with most attributes of datatypes where they make sense. An example might be:

sequence: sequence 
[
  parameter: "Y"
  type: pureprotein
]

integer: window 
[
  standard: "Y"
  etc
]

integer: start 
[
  standard: "Y"
  maximum:  "@(@($(sequence.length) - $(window)) + 1)"
  etc
]

This would set some sort of start condition to have a maximum value of the sequence length minus a window size value plus one. Note that there are two separate calculations here so each needs to be surrounded by an @() syntax. Long calculations can get messy. If you need to use them then you possibly need to rethink your ACD logic. If they can't be avoided then they can be tidied up with the use of variables (described later).

Equality and inequality tests can also be used on strings, as indeed can greater or less than but these don't usually make sense.

3. Conditional statements in ACD

There are two kinds of conditional statements in ACD, unary and ternary.

A typical use for unary conditionals is to switch prompts on or off. Let us assume that a window size should only be prompted for if the sequence turns out to be a protein. The ACD to accomplish this would look as follows:

sequence: sequence 
[
  parameter: "Y"
  type: gapany
]

integer: window 
[
  standard: "$(sequence.protein)"
  etc
]

If the sequence is a protein then the required statement is equivalent to:

   standard: "Y"

and the prompt is switched on. If the sequence is nucleic the statement is equivalent to:

   standard: "N"

This will effectively disable the prompt.

Ternary conditionals are described below.

4. Negation

Negation often finds a use in ACD files. Lets assume that your application can produce both graphic and textual output. Assume further that you only want textual output if the user hasn't selected graphical output. First you would set up a toggle ACD datatype definition as follows:

toggle: plot 
[
  standard: "Y"
  default: "N"
  information: "Plot a graph"
]

A toggle is a special type of Boolean datatype that is used exclusively to control the prompting of other attributes.

The value of $(plot) will be "Y" if the user adds "-plot" to the command line. The value is "N" if either the user doesn't add anything to the command line or if the user adds "-noplot" to the command line.

The output file can now be defined as:

outfile: outfile 
[
 standard: "@(!$(plot))"
]

This becomes equivalent to standard: "Y" only if plot is not true. The negation operator (!) is a calculation so the term must be surrounded by @().

The only sad thing about this is that it doesn't work as written but not for any reason involving the logic. The reason is because EMBOSS handles file input/output operations in a different manner to other datatypes. If it sees one of the file (e.g. outfile) or sequence (e.g. seqout) definitions it will always try and open it.
If the term equates to standard: "N", and no filename has been specified on the command line or as a default (and you wouldn't normally specify a default name for a file) then ACD parsing will try and open a file with no name. That would cause an error.

There is a way around this and that is to use the "nullok" attribute. A definition of outfile that works is:

outfile: outfile 
[
  standard: "@(!$(plot))"
  nullok: "Y"
]

The nullok statement above means that its OK to continue (do not generate an error) if no filename is given.

5. Boolean tests (& and |)

Boolean tests can also be performed using calculations. Here is an ACD code snippet:

integer: fubar 
[
   standard: "Y"
   default: 5
   etc
]

integer: rtfm 
[
   standard: "@(@($(fubar)==3) | @($(fubar)==7))"
   etc
]

The integer rtfm will only be prompted for if the value of fubar is either 3 or 7. Each of the equality tests is a calculation and the boolean test is another calculation. There are therefore three @() instances. The AND (&) operator can also be used in such calculations.

6. Ternary conditional

This calculation has the form:

  @(conditional  ? value-if-true : value-if-false)

It is useful, for example when setting gap penalty values differently for proteins and nucleic acids in alignment programs.

integer: penalty 
[
  standard: "N"
  default: "@($(sequence.protein) ? 14 : 16)"
  etc
]

This will set the penalty to 14 for proteins and 16 for nucleic acids.

7. Variables: keeping things tidy

Variables are useful for holding partial calculations or values. They can keep ACD looking neat and tidy. The syntax for them is:

    variable: label value

As an example, here is the window calculation again from section3:

integer: start 
[
  standard: "Y"
  maximum:  "@(@($(sequence.length) - $(window)) + 1)"
  etc
]

This can be tidied by storing one of the calculations in a partial result as follows:

variable: lminusw "@($(sequence.length) - $(window))"

integer: start 
[
  standard: "Y"
  maximum: "@($(lminusw) + 1)"
  etc
]

Practical 7 (Part V) - Software Consolidation

For background information, see Talk 7 (Part V)

Learn how to consolidate your work through coding standards, documentation and quality assurance tests. You'll apply these methods to your application.

1. Coding standards

First, make a copy of your source code.

After reviewing the EMBOSS C programming standards document which is summarised in Talk 7 (Part V), modify your code so that it conforms to the standards (make as many changes as time allows).

Modify your application so that the functionality is in distinct functions, if it not already like this. Pay particular attention to the general arrangement of your code (variable declarations at the top, memory allocation in one block where possible etc) so that you achieve an intuitive layout.

Compare your revision to your original. It should be significantly easier to read.

2. Code documentation

First, review the EMBOSS code documentation standards which is summarised in Talk 7 (Part V).

Add comments to your code to describe

a broad overview at the beginning of the application.
tricky steps within the program or functions.
major logical steps within the program or functions.

Ensure you include the GPL license.

Ensure you have documented the main() function.

If you have defined any functions, document them now using the standard EMBOSS method.

If you have defined any new data structures, document them now using the standard EMBOSS method.

3. Application documentation

Review the notes on application documentation in Talk 7 (Part V).

If you have time, generate application documentation (in html format) using the described method.

4. Code quality assurance

Review the notes on quality assurance testing in Talk 7 (Part V).

As time permits, write and run one or more QA tests for your application using the described method.

DAY 3: The main objective is to get a feel for some of the advanced programming features. This day is open-ended; you can pick tasks to learn about the advanced features, or consolidate what you did on the previous days.

Practical 8 - Data Input : Using Features

For background information, see Talk 8

Features are specific regions of interest in a biological sequence. You'll learn how EMBOSS supports a variety of common feature formats and modify your program to read and write sequence features.

First, review the notes on features given in Talk 8. Your program will likely already use a sequence, seqset or seqsetall datatype (for sequence input) and a seqout, seqoutall or seqoutset datatypes (for sequence output). If it doesn't, then modify it so that it reads and writes a sequence, and uses, say the sequence and seqout datatypes. See Talk 7 (PartIII)

Use the features ACD attribute (note "attribute" and not "datatype"!) so that sequence feature input is supported, and so that the sequence output will include feature information.
Experiment to see what the different feature formats look like. You'll need to use -offormat on the command-line along with "embl", "gff", "swissprot", "swiss", "pir" or "nbrf". You should notice that if the output format you specify does not support features, the features will be written to a separate file (by default in GFF format).
Experiment by using the "-oufo" or "-offormat" and "-ofname" command-line qualifiers to force your application to write the sequence features to a raw feature table file, i.e. just the feature table without the other database records.
Edit this raw feature table by hand, changing the value of some of the features there. It doesn't matter what you change so long as you don't break the file format.
Now experiment by using '-ufo' or '-fformat' and '-fopenfile' to read in your edited raw feature table, and so replace any existing feature table that was read from the input sequence (e.g. file or database entry). You should confirm the features were overwritten by examining your output file.

You'll now have a good grounding in features. To deepen your knowledge, you might wish to look at:

showfeat for displaying features.
extractfeat for extracting the sequences of features.

Practical 9 - Data Input : Using Reports

For background information, see Talk 9

The standardisation of application input / output is essential for interoperability. You'll learn how EMBOSS achieves this by modifying your program to use one of the standard EMBOSS report formats.

First, review the notes on reports given in Talk 9 There are several exercises to try:

Compile the program listed in Talk 9 and experiment by running the program with the -rformat qualifier and the list of permissible report names.
Try modifying the program to use a seqall datatype and report values for several sequences between the head and tail. Hint: look at ajFeattableClear
Modify the program to print out multiple reports (one per sequence) in the same file.

and / or ...

You may well have already incorporated the functionality of infoseq into your application. If you have not done so, look at it now.
Ignore the html output options and try implementing the infoseq functionality using a report.

Practical 10 - Objects, Pointers and Memory Management

For background information, see Talk 10

To support new biological datatypes you'll need a deeper knowledge of memory management under EMBOSS. You'll learn how to program your own data structures and functions for their manipulation.

Testing your understanding

After working through the material in Talk 10, figure out exactly what is going on in the destructor function below. If you can do that, then you can be happy that you are on your way to become adept at objects, pointers and memory management in EMBOSS.

 
 /* @func ajXyzPdbtospDel ***********************************************************
 **
 ** Destructor for Pdbtosp object.
 **
 ** @param [w] thys [AjPPdbtosp*] Pdbtosp object pointer
 **
 ** @return [void]
 ** @@
 ******************************************************************************/
 
 void ajXyzPdbtospDel(AjPPdbtosp *thys)
 {
     AjPPdbtosp pthis = NULL;
     ajint i;

     if(!thys)	return;
     pthis = *thys;
     if(!pthis)     return;
 
     ajStrDel(&pthis->Pdb);
 
     if(pthis->n)
     {
 	for(i=0; i< pthis->n; i++)
 	{
 	    ajStrDel(&pthis->Acc[i]);
 	    ajStrDel(&pthis->Spr[i]);
 	}
 	AJFREE(pthis->Acc);
 	AJFREE(pthis->Spr);
     }
 
     AJFREE(pthis);
     (*thys)=NULL;
 
     return;
 }

Exercise

A very worthwhile exercise would be to implement a data structure of your own design and accompanying constructor and destructor functions, and modify your program to use these. This data structure could, for example, hold all of the output data that are calculated by your application. You may need to refer to the notes in Talk 4 and Talk 10 to do this.

Practical 11 - The NUCLEUS Library

For background information, see Talk 11

The NUCLEUS Library incorporates various algorithms for molecular biology. This practical will give you a taste of what's available.

First, review the notes on NUCLEUS in Talk 11.

Your task, if time permits, is to adapt your program so that it has an option to perform a Needleman-Wunsch global alignment on two input sequences. This is a very significant challenge so don't worry if you don't succeed, the main thing is to get a feel for how you might proceed.

In order to do this, you'll need (at least) the following ACD data types:

sequence (sequence a)
sequence (sequence b)
matrixf (residue substitution matrix)
float (gap insertion penalty)
float (gap extension penalty)

To handle the ACD data items, you'll need the following in your code:

AjPSeq (sequence a)
AjPSeq (sequence b)
AjPMatrixf (residue substitution matrix)
float (gap insertion penalty)
float (gap extension penalty)

To perform the alignment, you'll need to call the following functions:

embAlignPathCalc
embAlignScoreNWMatrix
embAlignWalkNWMatrix
embAlignReportGlobal

which as you can see from their names are all EMBASSY functions.

You'll need a fair bit of other code too, for instance, to convert your ACD data types into a format that the alignment functions expect, and for memory management.

Try and implement the functionality by reading the function documentation and by applying what you've learnt so far. You can of course refer to the source code for needle if you get stuck, but avoid the temptation of merely copying and pasting code.

Free for All

You may continue work on your sequence manipulation program, aiming to either provide complete coverage of the functions of the applications mentioned in Practical 7 Part I, or to implement one or more of the other features mentioned today.

Otherwise, feel free to discuss with us how EMBOSS could be used with your own bioinformatics projects. Or go home / to be pub early if you've had enough :)

Last modified on 2005 Jon Ison.