Announcements Sponsors |
Bioinformatics Software Development CourseApr 18-20 2006Details of PracticalsDAY 1: The main objective is to get an overview of EMBOSS programming and become familiar with string-handling. Your main task is to implement a program to perform some simple string manipulation tasks. Practical 1 - Introducing EMBOSSFor background information, see Talk 1To begin, you'll learn how to install and compile EMBOSS on a PC running the Linux operating system. 1. The download areaGo to the EMBOSS homepage and click on the "Downloads" link and briefly review the Downloads page. You will notice there are two flavours of download:
2. First time installationBefore you start make sure you're in your home directory.From the Downloads page follow the link to "Developers (CVS) release" (or click here). Type or cut and paste the following lines: cd cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss login You are now logged onto the CVS server. It will ask you for a password, which is "cvs". Now check-out (download) EMBOSS by typing the following command. cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss checkout emboss This will take some time as it's downloading several megabytes of source code and data from the USA (the EMBOSS code repository lives on the same machine as the BIO* projects). Once the download is complete, issue the following command to terminate your CVS session: cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss logout You only need to check-out EMBOSS once for each machine you install a copy on. After this you'll just need to keep it updated. Read the notes from Talk 1 on how to do this. 3. Examine the checked-out directory structureLet's assume your HOME directory is /home/fred. After having checked out the source code you'll have a /home/fred/emboss directory. To see what's there type:cd /home/fred/emboss ls -lYou'll see two directories, one is called "CVS" and the other is "emboss". cd into the emboss directory i.e. to the directory /home/fred/emboss/emboss. Listing this directory reveals the contents of the EMBOSS source tree. You should see directories called "ajax" and "nucleus" (programming libraries), "emboss" (application code) and "embassy" (EMBASSY application code), amongst other stuff. The reason why there is an "extra" emboss directory (/home/fred/emboss/) is that it provides a convenient place to install the bin, lib and share directories when a "make install" command is issued (more on this shortly). 4. Prerequisites for compilationTo compile EMBOSS the following GNU configuration tools must be installed on your system:
5. Configuration and compilationThe first thing to do is configure the package, but you must first build a script to do that. From the second EMBOSS directory (/home/fred/emboss/emboss) type:aclocal -I m4 [note: aclocal is part of the automake package] autoconf automake -a This will build the "configure" script from the files "Makefile.am" and "configure.in" (in the EMBOSS checkout) and "aclocal.m4". Specifically:
Running "./configure" will
"./configure" is controlled by command-line arguments (generally used to switch on features autoconf was unable to detect) and by environment variables (generally used to set build information such as compiler options). If you intend to compile using "make install" (see below) you must specify an installation area for the executables and supporting files, however, it's good practice to specify this even if you intend to compile with a plain "make". Further, as you'll be using the gcc compiler it's a good idea to turn warnings on (in EMBOSS most warnings are bad errors). To do this, type ./configure --prefix=/home/fred/emboss --enable-warnings Note that the '--enable-warnings' switch is for GCC compilers only. Finally, type: make This will, by using the Makefiles, compile all the source files into executable binaries within e.g. /home/fred/emboss/emboss/emboss. Alternatively, compiling with "make install" will install the binaries and supporting files into the bin, lib and share directories into the directory specified after "--prefix" on the configure line; in this case, the "extra" (first) emboss directory level from the CVS checkout (/home/fred/emboss). Had you not specified "--prefix=/home/fred/emboss" they'd be installed to "/usr/local" by default instead. 6. Environment variables and PATHAssuming you are using a csh style shell then the following commands will set up the path and the current graphics library location:set path=(/home/fred/emboss/emboss/emboss $path) rehash setenv PLPLOT_LIB /home/fred/emboss/emboss/plplot/libOr, if you'd done a "make install" rather than just a "make": set path=(/home/fred/emboss/bin $path) rehash setenv PLPLOT_LIB /home/fred/emboss/share/EMBOSS/ 7. Adding databases to the development installationDatabase locations are defined in the files emboss.default or .embossrc. The emboss.default file should be kept under the third emboss directory and you'll find a template file in the distribution called e.g. /home/fred/emboss/emboss/emboss.default.template. For this course, however, we recommend using a .embossrc file, which should live in your home directory. For the moment this need just access the EMBL and swissprot databases via SRS. Your .embossrc file therefore needs to look like this:DB swissprot [ type: P method: url format: swiss url: "http://srs.sanger.ac.uk/srs7bin/cgi-bin/wgetz?-ascii+-e+[swissprot:'%s']" comment: "Swissprot via Sanger SRS" ] DB embl [ type: N method: url format: embl url: "http://srs.sanger.ac.uk/srs7bin/cgi-bin/wgetz?-ascii+-e+[embl:'%s']" comment: "EMBL via Sanger SRS" ] 8. TestingEMBOSS makes use of the GNU libtool, therefore the executables can be run from either the directory they're compiled in or from their 'make install' location. You should now be able to type: seqret embl:hsfau seqret swissprot:mel_apidoand get a sequence retrieved in both cases. Practical 2 - Navigating EMBOSSFor background information, see Talk 2EMBOSS is big. You will learn how to efficiently navigate the package and its libraries to find stuff you need, by using SRS, documentation on the web and the source code itself. 1. IntroductionEMBOSS contains hundreds of library calls within the AJAX (low-level) and NUCLEUS (algorithm) libraries; navigating it can be daunting at first. Three methods are provided for this purpose:
2. SRSGetting startedThe source code is divided into two SRS databases, "EMBOSS Data Structures" (EDATA), and "EMBOSS Functions" (EFUNC). EDATA contains information about the objects, EFUNC about the functions. Both EDATA and EFUNC are available for the latest stable release and for the developers (CVS) release. To access these databases go to http://www.ebi.ac.uk/srs/. From there ...
The documentation here is broken into several sections. The first 3 give the name, description and "aliases" of the object. AjSStr is the formal name of the object, AjOStr is the datatype name for the object whereas AjPStr is the datatype name for the object pointer. A further explanation is available in Talk 2. After the "Aliase(s)" section you'll see several more blocks which correspond to groups of functions. Each block contains a list of available functions within that group. The groups you see will depend upon the library file and are subject to change from the current code refactoring exercise, but might include:
At the bottom of the page you'll see the "Related Data", "Attributes" and "Body" sections. "Related Data" lists objects that are related to the AjSStr object, "Attributes" lists the elements of the object (C data structure) whereas "Body" gives you the C code for the object definition. Try clicking on an entry for a function e.g. ajStrAppendS. You will see the marked-up source code for the function. Searching EFUNC directly
It should be obvious from looking at the names that the likely candidates for what you want are those in the ajStrAppend family. Have a look at those functions (and some of the others) to see why the search picked them up. You'll see that some of the functions accept other string objects, character strings or just single characters. This method is of course limited by the vocabulary used in the descriptions of the functions. For instance, we use the term "append" rather than "catenate". Prove this to yourself by repeating the above search using "catenate & string". To show the advantage of limiting the search, change to "Description" field back to "AllText" and repeat the "string & append" query. You'll see that there is a significant amount of noise in the results list. Viewing the source code
The most useful information for a user of the library user are the Input, Returns and Prototype fields. The "Input" field shows that this function takes the address of a string object pointer as its first parameter and a string object pointer per se as its second parameter. The "Returns" field shows, as expected, the return value (AjBool; a boolean value) of the function. All this information is given at-a-glance in the "Prototype" field for the function (the prototypes are built into the library and you don't declare them in your programs). A prototype tells the compiler what a function is expecting and what it will return. Below the prototype is the body of the function. This patently contains the source code of the function. C language reserved words are highlighted in red. The source code is marked-up with any calls to other EMBOSS functions; unhighlighted function calls are standard C library calls. You could click on, for example, ajFatal and see the code for that function. Clicking on the red arrow on the prototype line will show all the EMBOSS functions that use this particular function. Clicking on the blue arrow will show all the EMBOSS functions that are called by this particular function. As an EMBOSS application programmer you really don't need to know most of the detailed information above, just the inputs and returns. As a library developer, all the information is useful. 3. Web documentationDetails are subject to change with progress in revisions to the on-line documentation Another useful route to the function documentation is via the EMBOSS website.
4. The source codeThe source code itself is a vital reference, especially once you're more familiar with the libraries.
Searching for keywords in the .c files is a direct method to find example code for the task at hand. If you are unsure how to do a particular task, e.g. read in a data file, then find a program that does what you need and look at the source code to see an example of how it can be done. Bear in mind there are many ways to solve a problem and the example you see might not necessarily be the best. Practical 3 - Your first EMBOSS applicationFor background information, see Talk 3Learn the basic steps needed to develop any EMBOSS application by implementing helloworld! under EMBOSS. To develop this application, which prints "Hello, World!" to the screen, you will write an ACD file and a file of C source code that uses the EMBOSS libraries. Following the steps below, write an EMBOSS application to print "Hello, World!" to the screen. Limited information only is given below, if you get stuck, refer to Talk 3. For this practical and throughout the course you will need to look up appropriate functions for the task at hand. Background information on how to navigate EMBOSS is given in Practical 2.
Practical 4 - Introduction to Objects Using StringsFor background information, see Talk 4EMBOSS programming objects (C data structures and functions) form the core of the EMBOSS libraries. You'll get a gentle introduction to objects by adapting helloworld.c to perform some simple string manipulations. You'll learn about two special types of functions called constructors and destructors that are used to allocate and free memory for objects. Following the steps below, modify helloworld.c to perform various string manipulation tasks. For each task you'll need to find the appropriate function(s) using SRS etc, as you did in the previous practical. When using the functions be very careful to pass the object pointer or its address as required. If you get stuck on the use of pointers, refer to Talk 4. 1. Modify helloworld.c to use an AjPStr (call it string1); don't do anything with it yet other than correctly declaring and initialising the object, allocating it (use the default constructor function) and freeing it (use the destructor function). 2. Comment out the line that prints "Hello, World!" to the screen. Now, using the appropriate functions, copy the text "Hello, World!" into string1. Print string1 to the screen; you'll have to use %S with the print function to print an AjPStr, rather than %s which is used for char * strings. 3. Can you find a different constructor function that does the job of both the default constructor and your assignment function? Implement the change! 4. Now add two new AjPStr's to your program to make a total of 3. You'll now copy your string1 to both of the new strings. For one of them(call this string2), use a constructor function that uses the reference count, ie. makes a reference to (not a real copy) of string. For the other (call this string3) make a genuine copy instead. Keep your memory management clean, which means calling a destructor for all of your strings - including the string you construct by reference, so that the reference count and ultimately the memory are managed correctly (see Talk 3 for more information). Print all 3 strings out to screen and print the usage count of string1 after each copy and after calling the destructor functions.
You might notice that in certain circumstances you can get away without explicitly calling the constructor and destructor functions. This is because most of the EMBOSS functions will allocate memory for you as required if the application code hasn't done so. This should be considered a safety mechanism which you should not rely on - instead, strive to write clean code which calls constructor and destructor functions as logic dictates. In principle, the operating system should free any memory allocated to your program once it terminates, however its bad programming practice to rely on that, so you should clean up as described before exiting. If you used the ajStrNew constructor function you might be surprised at the value of the usage count, which is higher than you might have expected. The explanation is that a call to ajStrNew doesn't immediately instantiate an AjSStr object, it just returns the address of the "NULL String" (an object which is defined globally in AJAX), whose reference count may well be in the hundreds owing to the call to embInit (which itself makes, indirectly, many calls to ajStrNew). Its only when the char * string is given a non-NULL value (by whatever means) that memory for a string object proper is allocated. AJAX is programmed in this way for maximum speed and efficiency of string handling. You can illustrate this idea nicely:
5. Note that if you do anything (e.g. via an EMBOSS library call) to the original string1 such that its length would exceed the reserved string length forcing it to be reallocated (done by AJAX for you automatically), string2 will still point to the original string and will not be a reference to the new string1. The use of pointers is a common source of error in C programming and EMBOSS too. The important point is this: always ask yourself whether anything is really gained from referencing rather than true copying; occasionally it has a place but not very often. Things are much more intuitive (and so your code is cleaner and more readable) if you make real copies. So now is a good time to convert the code so that string2 is a genuine copy of string1. 6. It's easy to find out whether a function reallocated a string that was passed to it. Most of the functions in ajstr.c that have an AjPStr* parameter return an AjBool. The '*' indicates an address of the string object pointer and that the function might change the object or the data pointed to, i.e. the string might be reallocated. The AjBool (boolean value) may have a value of ajTrue or ajFalse, in this case, ajTrue means that the string was reallocated. To illustrate this:
You should see that it doesn't take long for the string to be reallocated. Automatic dynamic memory allocation of strings is one of the great features of EMBOSS - you don't have to worry about resizing memory for your string - EMBOSS does it all for you. This is also the basis for the safety mechanism mentioned above. EMBOSS also includes dynamic arrays and you might get to use them later. 7. Now we'll look at functions for copying substrings to string objects and how EMBOSS strings (AjPStr) can be used with normal C strings (char *). Your task is to copy the substring "Hello" from string1 to string2, and copy the substring "World" from the string1 to string3. There are two functions (you'll need to find them) to copy a substring, one that works with an EMBOSS string (AjPStr) and another that uses a C-type string (char *). Experiment with both. For the later, you'll need to get to the char * element within string1: your code shouldn't probe the structure directly, therefore use library function ajStrGetPtr instead.
8. That completes our first treatment of string handling which is probably the most developed of all the EMBOSS library files. We've only skimmed the surface of ajstr.c but later today in Practical 6 you'll review the entire string library by way of a worked example. There are some further exercises there too (at the end of the practical): you can start that practical now if you've completed this one early. Practical 5 - ACD Files : Basic SkillsFor background information, see Talk 5EMBOSS allows very flexible definition of application interfaces via the ACD language. You'll get a gentle introduction to ACD programming by adapting your application for user-defined input and output and file-handling. This is a good treatment of the basics of writing ACD files and the code required to process them. You'll learn about various ACD datatypes and attribute, and file handling. 1. Getting startedYour first task is to modify helloworld.c to print any user-defined string a user-defined number of times to the screen. Then modify your program to read the string from an input file instead, and write it to an output file.
application: helloworld [ documentation: "My first EMBOSS program" ] 2. Application groupsNow add an appropriate groups attribute to the application definition.
3. Getting startedNow modify your ACD file to support the replacement of the hard-coded string "Hello, World!" with a user-defined string.
Now modify helloworld so that the only string it prints out is the one specified in the ACD file; comment out all hard-coded print statements that print "Hello, World!" or some derivative. When you run the program, you should get output that looks like this: Unix % Unix % helloworld My first EMBOSS program Hello, World! It should be obvious that it's using the default value, the string can be user-defined though, you just have to specify the appropriate qualifier (the label of the data item, "text" in the example below) on the command line when you invoke the program: Unix % Unix % helloworld -text "Hello, Sailor!" My first EMBOSS program Hello, Sailor! 4. Watch out for memory leaks!These can occur when, in your source code, you lose a reference to an allocated block of memory, usually by accidentally making a pointer that holds the address of that memory point somewhere else, without first freeing the memory or keeping track of where its held. Bear in mind that the call to embInit will have allocated all the memory required for holding the ACD data items, which means you don't have to. Manage your memory carefully! When retrieving for example a string (AjPStr) data item by using a call to ajAcdGetString, the function will return a pointer to the string created by embInit. This means that you do not have to allocate memory for the string first. In fact, if you do, you will have caused a memory leak; your call to the string constructor function would set aside some memory, but you'd then lose your handle on (reference to) this memory (creating the memory leak) when you catch the value returned by ajAcdGetString. You do have to free the string afterwards though to clear the memory allocated by embInit (although in the future a call to ajExit will do this). The following pseudocode snippet illustrates this: AjPStr mystring=NULL; embInit(...); /* This allocates memory for all ACD data items and assigns their values. */ mystring=ajStrNewRes(...); /* Allocating some memory ... unnecessarily. */ mystring = ajAcdGetXXX(...) /* Now mystring points somewhere else, losing the handle on memory allocated by ajStrNew - this is a memory leak! */ ajStrDel(&mystring); /* This must free the memory allocated by embInit. Omit this and you'll have another leak. */ 5. Qualifiers and parametersHaving to use the '-text' option is a bit cumbersome, it would be nice if we could specify the text to print without having to know the label name. For this we use the 'parameter' ACD attribute. If a data item is defined with the parameter attribute (parameter:), using the name of the parameter on the command line is not mandatory, i.e. you can just type helloworld "Hello, Sailor!".
Your next task is to add an integer data item to your ACD file; this integer should control how many times the user-defined string is printed to the screen.
6. File input and outputYou'll now learn about basic file input and output. All the functions you need for this are in ajfile.c. We're going straight in at the deep end: your task is to adapt helloworld so that it reads text from an input file, and writes this same text a user-defined number of times to an output file. Use the following steps as a guide, but you'll have to fill in the detail from what you've learned so far:
When you're done, make a safe copy of helloworld.c and helloworld.acd. It'll be handy to have a basic, working programme which you can experiment with when learning about ACD files and EMBOSS programming. It's possible to achieve the same ends as your helloworld program without even having a file of C source code. Can you think how ? Practical 6 - EMBOSS String HandlingFor background information, see Talk 6Efficient string handling is fundamental to molecular sequence analysis. You'll deepen your familiarity of the powerful AJAX string library by way of a worked example. EMBOSS will soon include for each library file, an application which illustrates the correct usage of each function in that library. Currently, these demo applications are kept in the applications directory and have the prefix "demo". For example, /home/fred/emboss/emboss/emboss/demostring.c illustrates the use of the string library. Bear in mind this is a work in progress, you'll notice demo applications are only available for a few of the library files, and of those, only demostring.c to is guaranteed to be up-to-date and reliable. Of course, there is an ACD file for each demo* application, and you could compile them as you would any other EMBOSS application. However, the Makefile.am files are already set up for you:
You'll notice it occurs in the check_PROGRAMS section of the "executables" Makefile.am file. This section is reserved for programs that are undergoing active development or have yet to be fully tested, or which should be considered incomplete for some other reason, e.g. they are undocumented or quality-assurance test data are not yet available for them. The demo* applications are in there, reflecting their "work in progress" status.
You should now have a working demostring application. Your objective is to get a feel for the scope of the string library. To do that you'll need to
Once you've had enough of that, apply your knowledge of ACD files and your deeper knowledge of the string library to implement various simple string manipulations of your own choosing in your own program. Choose manipulations which involve some input from the user, such as an integer or a string value and which thus require you to extend your ACD file and the C code to process it. Here are a few further exercises to test your knowledge of strings, which builds on what you did for helloworld :
DAY 2: The main objective is to deepen your knowledge of programming and software development under EMBOSS. Your main task is to adapt your program to incorporate the functions of several existing EMBOSS applications. Practical 7 (Part I) - The AJAX Library & The Software CycleFor background information, see Talk 7 (Part I)Here you'll get an overview of the AJAX low-level library functions. But there's much more to software than just coding; you'll plan your days work considering your design and the essential steps for efficient development. 1. Planning your workBefore you start programming today, make sure you read the whole of the text below. Your main task for today is to develop your application into a general purpose sequence manipulation program which combines several existing EMBOSS applications into one. This will deepen your understanding of ACD files. You will have to apply much of what you have learnt so far, as well as methods for processing biological sequences that you'll cover today. You should aim to incorporate functionality from at least two applications. Its up to you what to incorporate, but here are some suggestions:
The first thing to do is to decide which applications take your fancy and will be incorporated. Feel free to not restrict yourself to the list above. 2. The Software Development CycleHaving decided on the basic functions of your program you are probably eager to get coding. There is however much more to software development than coding and you should first work up a design for your application and consider the basic steps to implementation.First of all, read the relevant notes from Talk 7 (Part I). You'll not need to apply all of it today, but it should inform you of the scale of the task and the steps involved. After reading the notes, you should work to your own plan for the rest of the day, or follow (some or all of) the steps below:
3. Overview of the AJAX libraryFor this practical and throughout the rest of the course you'll need to use many functions other than those in the string library. You should read the relevant notes from Talk 7 (Part I) to get an overview of what's available.Before you start coding, ensure you have an appreciation of what functions you'll need (see implementation steps above). It's sensible to research these functions before you start coding, so that everything you need is available to you. Conceptually, this is just the same as collecting all your tools together before starting a DIY project. If you start coding before you know what you need and what's available, your progress will be slower as you'll have to repeatedly break off to research functions, when what you should be doing is concentrating on the program logic. 4. Getting startedWe recommend you make a start by incorporating the functionality of the pasteseq program.You can adapt helloworld.c or start from scratch if you think that'd be cleaner. Your application will use string handling, so you might want to look back over previous talks. It will also require sequence handling and ACD topics that are covered in later practicals and talks. Therefore, you might want to look at Talk 7 (Part I) and Talk 7 (Part II) now, which are really just a continuation from what you've covered so far. Of course it's possible to cheat by simply pasting in the code from pasteseq.c, but you'll learn much more if you have a go from first principles and ask questions if you get stuck! To help you out here, in no particular order, are some functions which may be of use (remember though there are many different ways to solve a problem):
Practical 7 (Part II) - ACD Files : Intermediate SkillsFor background information, see Talk 7 Part IIIt's easy to create a friendly and intuitive application interface using the ACD syntax. This practical shows you how. Work through the examples below. You can just read the examples but it's best if you edit your ACD file to correspond with what's shown, and type in the example commands. So far you've used the application definition and the string, infile and outfile specifications in ACD. You'll have written an ACD file that contains something like. application: helloworld [ documentation: "Prints something arguably uninteresting" ] string: message [ parameter: "Y" ] outfile: outfile [ parameter: "Y" ] Both message and outfile are defined to be parameters. 1. ParametersValues for parameter data items must be specified on the command line in the order they appear in the ACD file. So to operate the above from the command line you'd have to type:% helloworld "Hello World!" message.dat That would cause Hello World! to be printed to the output file message.dat. But what if we didn't want to force the user to specify a message, rather, we just wanted to add a default message ("Hello World!") to our ACD file which would be printed if nothing was given, i.e. by typing: % helloworld message.datwe want "Hello World!" (the default message) to be printed to the file message.dat. The above ACD will not do that. That is because all the ACD items are set to be "parameter". Typing the above would result in "message.dat" appearing as the string to be printed and you'd then be prompted for an output file name. In short, not the desired behaviour. This is where command line qualifiers come in. 2. QualifiersQualifiers can appear anywhere on the command line but you must always refer to them by their label. So the reference to our message text will look like:-message "Hello World!" This would allow you to type: % helloworld -message "Hello World!" message.dat or % helloworld message.dat -message "Hello World!" To get this to work you must specify the string to be a "standard" or an "additional" qualifier, instead of "parameter" that's currently specified. To recap from before:
Your ACD file should now look like: application: helloworld [ documentation: "Prints something arguably uninteresting" ] string: message [ standard: "Y" ] outfile: outfile [ parameter: "Y" ] The two invocations of helloworld shown above will now work. What happens, though, if you don't mention the message at all on the command line? That is, what happens if you just type: % helloworld message.dat The answer is that "message.dat" will be taken as a parameter and the output file name, the program will then prompt you for a string to print out. Close, but still not the desired behaviour. What we wanted was for it to go ahead and run with a default string. 3. Default actionsYou can associate a default value for most ACD definitions using the 'default' attribute. The ACD can be modified as below:application: helloworld [ documentation: "Prints something arguably uninteresting" ] string: message [ standard: "Y" default: "Hello World!" ] outfile: outfile [ parameter: "Y" ]Now try typing % helloworld message.datYou'll notice that although we specified a default, its still prompting us for a value. Remember that all values for standard data items are always prompted for, regardless of whether a default is specified or not. We need to specify message as being an additional data item, which are not normally prompted for. So your ACD file should look like: appl: helloworld [ documentation: "Prints something arguably uninteresting" ] string: message [ optional: "Y" default: "Hello World!" ] outfile: outfile [ parameter: "Y" ] This ACD finally does what is needed. Typing: % helloworld message.dat will print "Hello World!" to the file message.dat. Hurrah! You can override the default message by specifying the message on the command line: % helloworld -message "Goodbye World!" message.dat will print a rather morbid message to the output file. To reiterate, you must always supply a default value for "optional" ACD data items as EMBOSS will not prompt you for one if you omit to specify a value on the command line. EMBOSS would generate an error if you tried, from within your C source code, to access the value of that unspecified data item. In contrast, values for "standard" and "parameter" data items are always prompted for if they're not specified on the command line. 4. Confusing the pictureHaving explained the above it should now be obvious that "parameters" are almost the same as "standard" data items . It would be perfectly acceptable to invoke the modified ACD using:% helloworld -message "Hello World!" -outfile message.datThe difference with parameters is that you don't have to mention the label of the datatype whereas with qualifiers you do. If you omit the label for parameters then their values must appear on the command line in the order in which they appear in the ACD file. 5. Maxima and minimaIt is often either useful or vital to be able to set limits on the maximum and/or minimum values to be associated with an ACD datatype definition. This is done in an intuitive way:integer: window [ standard: "Y" default: 10 minimum: 5 maximum: 100 ] 6. Setting the promptsEmboss will always provide default prompt text. So, for the "window"example in the last section the user would be prompted as follows:-window : Enter a number [10]:While adequate, it's not entirely friendly. You can set the prompt for a datatype definition by using the "information" option. So, integer: window [ standard: "Y" default: 10 minimum: 5 maximum: 100 information: "Window size" ] will print the following as the prompt. Window size [10]: which is much more meaningful. 7. Setting help informationEvery EMBOSS application accepts built-in qualifiers -help and -verbose (the latter used in combination with the former if at all). These print out all the program parameters and qualifiers with explanatory text alongside. To set this information the "help" option is used. An example would be:integer: window [ standard: "N" default: 10 minimum: 5 maximum: 100 information: "Window size" help: "Number of residues used to calculate the value for each point" ] Practical 7 (Part III) - Sequence HandlingFor background information, see Talk 7 Part IIIACD files support various biological data types and many file formats saving you huge effort. You'll deepen your knowledge of ACD by modifying your program to handle real biological sequences from any source. 1. Types of sequence inputThere are three methods that must be catered for to access sequence information. First, a program may operate on a single sequence. Secondly, a program may need to read in sequences from a database one after the other. Thirdly, a program may need to read in a set of sequences at once. These three options are covered by the following ACD datatypes.sequence: input a single sequence seqall: read many sequences sequentially seqset: read in a sequence set e.g. an alignment These are defined just like any other ACD datatype. These datatypes have extra attributes though. One of the most used is the type attribute. Not surprisingly, this is used to limit the kind of sequence that EMBOSS will accept. The sequence type can be one of the following: any any sequence without gaps dna a DNA sequence without gaps rna an RNA sequence without gaps nucleotide DNA or RNA without gaps protein a protein sequence without gaps puredna DNA without ambiguities purerna RNA without ambiguities purenucleotide DNA or RNA without ambiguities pureprotein protein without ambiguities gapdna DNA allowing gaps gaprna RNA allowing gaps gapnucleotide DNA or RNA allowing gaps gapprotein protein allowing gaps stopprotein protein allowing for "stop codon" equivalents gapany any sequence at all You can see the behaviour for yourself by, for example, coding an ACD data item to be a nucleotide sequence then giving it a protein sequence. Try it! An ACD definition allowing the input of a single sequence (any type at all) would be: sequence: fubar [ parameter: "Y" type: "gapany" ]In this case, the distinction between the datatype (sequence) and the label (fubar) clear. It is conventional though to use the label "sequence" for all sequence inputs. So, although the ACD above would work it should be written as: sequence: sequence [ parameter: "Y" type: "gapany" ]It is also customary (where possible/sensible) to have EMBOSS use the default prompts i.e. no "information" attribute is given. The seqall and seqset equivalents are: seqall: sequence [ parameter: "Y" type: "gapany" ] seqset: sequence [ parameter: "Y" type: "gapany" ] 2. Retrieving sequence input within a programThis is done just like any other datatype within the C program. Each of the three access methods has its own object type.
Recovery is done using the associated ajAcdGet calls. 2a. Recovering a sequence... AjPSeq seq=NULL; ... seq = ajAcdGetSeq("sequence");2b. Recovering a sequence set ... AjPSeqset seqset=NULL; ... seqset = ajAcdGetSeqset("sequence); You can then use AJAX library functions to recover information about the whole set of sequences or just individual sequences e.g. ajSeqsetSize. Hint: Do an ID search on EFUNC (SRS) using the search term "ajseqset". A particularly useful function is ajSeqsetSeq which will give you a character pointer to the start of the n'th sequence in a set. 2c. Recovering sequences sequentially... AjPSeqall seqall=NULL; ... seqall = ajAcdGetSeqall("sequence"); However, a seqall object is really a means to an end of returning individual sequences in a loop. A code segment that is often used for this is: ... AjPSeqall seqall = NULL; AjPSeq seq = NULL; ... seqall = ajAcdGetSeqall("sequence"); ... while(ajSeqallNext(seqall, &seq)) { /* Do something with 'seq' */ } Note: You may use the in-built qualifiers -sbegin and -send on the EMBOSS command line to specify a start and end position for a sequence. These are used to set values for corresponding elements in the appropriate object. Note however that regardless of the -sbegin and -send values, you still get all of the sequence accessible to you in memory. The function ajSeqTrim may be used to convert a sequence object to hold just that area of sequence. The function ajSeqOffset will return the -sbegin value the user specified, i.e. the offset from the start of the sequence. 3. Sequence output in ACDThere are three dataypes for sequence output which match their data input equivalents, namely:
By convention the label "outseq" is used with them. seqout: outseq [ parameter: "Y" ] seqoutset: outseq [ parameter: "Y" ] seqoutall: outseq [ parameter: "Y" ] There is no associated type attribute as the output format is determined from within the application (see below). 4. Retrieving sequence output information in the applicationThis is done as you'd expect with associated objects and ajAcdGet calls.A single object is used for all three forms of output:
... AjPSeqout seqout = NULL; ... seqout = ajAcdGetSeqout("outseq"); or seqout = ajAcdGetSeqoutset("outseq"); or seqout = ajAcdGetSeqoutall("outseq"); 5. Outputting sequence informationThere are three AJAX function calls used for writing out sequence information. They are:void ajSeqWrite(AjPSeqout outseq, AjPSeq seq); void ajSeqsetWrite(AjPSeqout outseq, AjPSeqset seq); void ajSeqAllWrite(AjPSeqout outseq, AjPSeq seq); Note that for both single sequences the object that is written is an AjPSeq whereas for sequence sets it is an AjPSeqset. 6. Setting the output formatfasta format is used by default. The output format can be changed by adding "-osformat format" on the command line or by giving it in the USA (Uniform Sequence Address) of the output filename e.g. "embl::myfile.seq". There are many possible output formats which include:
See Talk 7 Part III for more details on supported formats and USAs. 7. Closing the output fileWhen you have finished writing the sequences then you can close the file by calling the AJAX function ajSeqWriteClose. Of course, you must remember to manage the memory for all your objects.Practical 7 (Part IV) - ACD Files : Advanced SkillsFor background information, see Talk 7 (Part IV)The definition of an application interface frequently involve calculations, conditional statements and the use of variables and menus. You'll learn how EMBOSS supports all these operations and also lets you probe the attributes of input datatypes such as sequences. To complete your coverage of ACD files and their processing, work through the examples below. Once again, you can just read the examples or edit your ACD file and source code to correspond to what's shown. ACD menusSelecting from a list of options is often necessary in molecular biology programs. ACD provides two methods of doing this, the select datatype and the list datatype.1. The list datatype
list: frame [ standard: "Y" help: "Allows selection from a set of reading frames" default: "1" minimum: "1" maximum: "1" header: "Translation frames" values: "1:1, 2:2, 3:3, F:Forward three frames, -1:-1, -2:-2, -3:-3, R:Reverse three frames, 6:All six frames" delimiter: "," codedelimiter: ":" information: "Frame(s) to translate" ] The user will be presented with the title of the menu from the header attribute. After that will appear the tag/text information from the values attribute. Following that will be the prompt from the information attribute. It will look something like this Translation frames 1 1 2 2 3 3 F Forward three frames -1 -1 -2 -2 -3 -3 R Reverse three frames 6 All three frames Frame(s) to translate[1]: The user is allowed (generally) to supply a comma-separated list of options from the above depending on the minimum and maximum values. In the above example both the minimum number of selections and the maximum number of selections are set to one, therefore only one selection value is allowed. Selection is done by typing the tag values, therefore if the maximum count had been set to 3 then a user entry of "-1,F,6" would be a valid input. The codedelimiter attribute allows you to select any character to separate the tags from the text in the values attribute. The delimiter attribute allows you to choose a character to separate the tag/text pairs within the values attribute. 1a Recovering list selections from the application source code
... AjPStr *flist=NULL; ... flist = ajAcdGetList("frame"); ... The declaration shows that flist will be an array of string objects. The values held in the strings are the tags from the values list and the list is terminated by a null string. So, using our example, as only one value is allowed and let's assume the user had answered '6' to the prompt the resulting array would be: flist[0] a string object that contains "6" flist[1] a null string object Your code would likely step through the list, if maximum is greater than 1, using something like the following: n = 0; while(flist[n]) { if (ajStrMatchC(flist[0],"6")) { /* Do something */ } ... ++n; } 2. The select datatype
select: order [ casesensitive: "N" default: "score" delimiter: "," header: "Sort order of results" help: "Name of the output file which holds the results of the analysis. information: "Sort order of results" maximum: "1" minimum: "1" information: "Select sort order of results" standard: "Y" values: "length, position, score" ] Note that the values attribute just contains a 'delimiter' separated list of text (leading whitespace is ignored). Selection takes place on the text values, there are no tags. Also, note that there is a casesensitive attribute which is often set to "N" so that shifted and non-shifted characters are equally acceptable. The user can select on the text down to a length of text that has no ambiguity with other values. 2a Recovering 'select' selections within the application source code
... AjPStr *oselect=NULL; ... oselect = ajAcdGetSelect("order"); ... Otherwise the code is the same as for the list datatype with the exception that it would test for the strings "length", "position" and "score". Calculated attributes, conditions, calculations, variables and options1. Calculated attributes of sequencesWhen writing a program to insert one sequence into another, one way to make sure that the insertion position was not greater than the length of the first sequence is to used code like the following: if(position > ajSeqLen(seq)) ajFatal("Insertion position out of bounds"); The problem with that is, the user having gone to the effort of configuring it (entering all the inputs), the program will terminate once it's running. What would be better is if the interface forced the correct input, and there is a way to achieve that by using calculated attributes in the ACD file itself. There are eight calculated attributes you can use. Assuming you have the following ACD snippet: sequence: sequence [ parameter: "Y" type: protein ] The eight calculated attributes are:
You access them with the ACD "get the value of" syntax which consists of surrounding a term in parentheses and putting a dollar sign at the front. They therefore become: $(sequence.begin) $(sequence.end) $(sequence.length) etc Therefore, to make sure your insert program doesn't try inserting off the end of the sequence you just need to add: maximum: $(sequence.end) to the integer definition of the insertion position. These calculated attributes are also useful for conditional statements (see below). 2. ACD calculationsCalculations can be performed in ACD using the '@' syntax. A rather silly, but legal, calculation would be:@(5 + 9) which equates to the value 14. You can add, subtract, multiply or divide. Calculations can be used to test for equality, inequality, greater than or less than using: == != > <. For example the following ACD would be legal but possibly not useful in practice. standard: "@($(sequence.length) == 20)" you can see see the standard attribute ie being set to either "Y" or "N". Up till now, you've only ever specified "Y" after parameter, standard, or additional but "N" is in fact supported. A "N" will override the default behaviour of these attributes such that prompting for a value will be turned off. This is useful in some situations. In this case, the calculation will switch a prompt on only if the sequence length was equal to 20. You can use calculations with most attributes of datatypes where they make sense. An example might be: sequence: sequence [ parameter: "Y" type: pureprotein ] integer: window [ standard: "Y" etc ] integer: start [ standard: "Y" maximum: "@(@($(sequence.length) - $(window)) + 1)" etc ] This would set some sort of start condition to have a maximum value of the sequence length minus a window size value plus one. Note that there are two separate calculations here so each needs to be surrounded by an @() syntax. Long calculations can get messy. If you need to use them then you possibly need to rethink your ACD logic. If they can't be avoided then they can be tidied up with the use of variables (described later). Equality and inequality tests can also be used on strings, as indeed can greater or less than but these don't usually make sense. 3. Conditional statements in ACDThere are two kinds of conditional statements in ACD, unary and ternary.A typical use for unary conditionals is to switch prompts on or off. Let us assume that a window size should only be prompted for if the sequence turns out to be a protein. The ACD to accomplish this would look as follows: sequence: sequence [ parameter: "Y" type: gapany ] integer: window [ standard: "$(sequence.protein)" etc ] If the sequence is a protein then the required statement is equivalent to: standard: "Y" and the prompt is switched on. If the sequence is nucleic the statement is equivalent to: standard: "N" This will effectively disable the prompt. Ternary conditionals are described below. 4. NegationNegation often finds a use in ACD files. Lets assume that your application can produce both graphic and textual output. Assume further that you only want textual output if the user hasn't selected graphical output. First you would set up a toggle ACD datatype definition as follows:toggle: plot [ standard: "Y" default: "N" information: "Plot a graph" ] A toggle is a special type of Boolean datatype that is used exclusively to control the prompting of other attributes. The value of $(plot) will be "Y" if the user adds "-plot" to the command line. The value is "N" if either the user doesn't add anything to the command line or if the user adds "-noplot" to the command line. The output file can now be defined as: outfile: outfile [ standard: "@(!$(plot))" ] This becomes equivalent to standard: "Y" only if plot is not true. The negation operator (!) is a calculation so the term must be surrounded by @(). The only sad thing about this is that it doesn't work as written but not for any reason involving the logic. The reason is because EMBOSS handles file input/output operations in a different manner to other datatypes. If it sees one of the file (e.g. outfile) or sequence (e.g. seqout) definitions it will always try and open it.
There is a way around this and that is to use the "nullok" attribute. A definition of outfile that works is: outfile: outfile [ standard: "@(!$(plot))" nullok: "Y" ] The nullok statement above means that its OK to continue (do not generate an error) if no filename is given. 5. Boolean tests (& and |)Boolean tests can also be performed using calculations. Here is an ACD code snippet:integer: fubar [ standard: "Y" default: 5 etc ] integer: rtfm [ standard: "@(@($(fubar)==3) | @($(fubar)==7))" etc ] The integer rtfm will only be prompted for if the value of fubar is either 3 or 7. Each of the equality tests is a calculation and the boolean test is another calculation. There are therefore three @() instances. The AND (&) operator can also be used in such calculations. 6. Ternary conditionalThis calculation has the form:@(conditional ? value-if-true : value-if-false) It is useful, for example when setting gap penalty values differently for proteins and nucleic acids in alignment programs. integer: penalty [ standard: "N" default: "@($(sequence.protein) ? 14 : 16)" etc ] This will set the penalty to 14 for proteins and 16 for nucleic acids. 7. Variables: keeping things tidyVariables are useful for holding partial calculations or values. They can keep ACD looking neat and tidy. The syntax for them is:variable: label value As an example, here is the window calculation again from section3: integer: start [ standard: "Y" maximum: "@(@($(sequence.length) - $(window)) + 1)" etc ] This can be tidied by storing one of the calculations in a partial result as follows: variable: lminusw "@($(sequence.length) - $(window))" integer: start [ standard: "Y" maximum: "@($(lminusw) + 1)" etc ] Practical 7 (Part V) - Software ConsolidationFor background information, see Talk 7 (Part V)Learn how to consolidate your work through coding standards, documentation and quality assurance tests. You'll apply these methods to your application. 1. Coding standardsFirst, make a copy of your source code.After reviewing the EMBOSS C programming standards document which is summarised in Talk 7 (Part V), modify your code so that it conforms to the standards (make as many changes as time allows). Modify your application so that the functionality is in distinct functions, if it not already like this. Pay particular attention to the general arrangement of your code (variable declarations at the top, memory allocation in one block where possible etc) so that you achieve an intuitive layout. Compare your revision to your original. It should be significantly easier to read. 2. Code documentationFirst, review the EMBOSS code documentation standards which is summarised in Talk 7 (Part V).Add comments to your code to describe
Ensure you include the GPL license. Ensure you have documented the main() function. If you have defined any functions, document them now using the standard EMBOSS method. If you have defined any new data structures, document them now using the standard EMBOSS method. 3. Application documentationReview the notes on application documentation in Talk 7 (Part V).If you have time, generate application documentation (in html format) using the described method. 4. Code quality assuranceReview the notes on quality assurance testing in Talk 7 (Part V).As time permits, write and run one or more QA tests for your application using the described method.
DAY 3: The main objective is to get a feel for some of the advanced programming features. This day is open-ended; you can pick tasks to learn about the advanced features, or consolidate what you did on the previous days. Practical 8 - Data Input : Using FeaturesFor background information, see Talk 8Features are specific regions of interest in a biological sequence. You'll learn how EMBOSS supports a variety of common feature formats and modify your program to read and write sequence features. First, review the notes on features given in Talk 8. Your program will likely already use a sequence, seqset or seqsetall datatype (for sequence input) and a seqout, seqoutall or seqoutset datatypes (for sequence output). If it doesn't, then modify it so that it reads and writes a sequence, and uses, say the sequence and seqout datatypes. See Talk 7 (PartIII)
Practical 9 - Data Input : Using ReportsFor background information, see Talk 9The standardisation of application input / output is essential for interoperability. You'll learn how EMBOSS achieves this by modifying your program to use one of the standard EMBOSS report formats. First, review the notes on reports given in Talk 9 There are several exercises to try:
and / or ...
Practical 10 - Objects, Pointers and Memory ManagementFor background information, see Talk 10To support new biological datatypes you'll need a deeper knowledge of memory management under EMBOSS. You'll learn how to program your own data structures and functions for their manipulation. Testing your understandingAfter working through the material in Talk 10, figure out exactly what is going on in the destructor function below. If you can do that, then you can be happy that you are on your way to become adept at objects, pointers and memory management in EMBOSS./* @func ajXyzPdbtospDel *********************************************************** ** ** Destructor for Pdbtosp object. ** ** @param [w] thys [AjPPdbtosp*] Pdbtosp object pointer ** ** @return [void] ** @@ ******************************************************************************/ void ajXyzPdbtospDel(AjPPdbtosp *thys) { AjPPdbtosp pthis = NULL; ajint i; if(!thys) return; pthis = *thys; if(!pthis) return; ajStrDel(&pthis->Pdb); if(pthis->n) { for(i=0; i< pthis->n; i++) { ajStrDel(&pthis->Acc[i]); ajStrDel(&pthis->Spr[i]); } AJFREE(pthis->Acc); AJFREE(pthis->Spr); } AJFREE(pthis); (*thys)=NULL; return; } ExerciseA very worthwhile exercise would be to implement a data structure of your own design and accompanying constructor and destructor functions, and modify your program to use these. This data structure could, for example, hold all of the output data that are calculated by your application. You may need to refer to the notes in Talk 4 and Talk 10 to do this.Practical 11 - The NUCLEUS LibraryFor background information, see Talk 11The NUCLEUS Library incorporates various algorithms for molecular biology. This practical will give you a taste of what's available. First, review the notes on NUCLEUS in Talk 11. Your task, if time permits, is to adapt your program so that it has an option to perform a Needleman-Wunsch global alignment on two input sequences. This is a very significant challenge so don't worry if you don't succeed, the main thing is to get a feel for how you might proceed. In order to do this, you'll need (at least) the following ACD data types:
To handle the ACD data items, you'll need the following in your code:
To perform the alignment, you'll need to call the following functions:
You'll need a fair bit of other code too, for instance, to convert your ACD data types into a format that the alignment functions expect, and for memory management. Try and implement the functionality by reading the function documentation and by applying what you've learnt so far. You can of course refer to the source code for needle if you get stuck, but avoid the temptation of merely copying and pasting code. Free for AllYou may continue work on your sequence manipulation program, aiming to either provide complete coverage of the functions of the applications mentioned in Practical 7 Part I, or to implement one or more of the other features mentioned today. Otherwise, feel free to discuss with us how EMBOSS could be used with your own bioinformatics projects. Or go home / to be pub early if you've had enough :)
Last modified on
2005 Jon Ison. |