Announcements Sponsors |
Bioinformatics Software Development CourseApr 18-20 2006Details of TalksTalk 1 - Introducing EMBOSSBackground information for Practical 1 1. Introduction to EMBOSSThere is a short overview of EMBOSS and its key features which you should read now if you aren't already familiar with the project. EMBOSS is a big project and has much to offer to a diverse group including systems administrators, application programmers, bioinformaticians, training and education experts and, of course, the biologist end-user. This course is primarily for software developers. You might need to use all the following pages to find background information, detailed information in a specific area or to solve a particular problem:
The user documentation is well worth a visit because it summarises some of the major themes in EMBOSS. Practical 1 should give all the information on installing and compiling EMBOSS that you'll need for this course. There are, however, some notes below on know how to keep your installation up-to-date and configuration options, to read once you've done the practical. Further information is in the administrator documentation, especially our very comprehensive Administrators Guide. 2. Keeping your EMBOSS CVS copy up to dateYou only need to check-out EMBOSS once, after that you need to update it. It is advisable to update your copy of EMBOSS regularly.To update, first cd to the emboss directory containing the ajax and nucleus subdirectories. For example, if you checked-out EMBOSS in the directory /home/fred then you should "cd" into the directory /home/fred/emboss/emboss. Then type the following commands: cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss login [password "cvs"] cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss update -d cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss update -P cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/emboss logoutThe first ("login") command logs you on to the CVS server. The second ("update -d") ensures you'll pick up any new directory structures the core development team have added. The third ("update -P") will delete any obsolete files or directory structures that the core development team have removed. The last command ("logout") logs you off the server. Note: The -d and -P flags are case-sensitive. 3. Associated libraries and compilationOn Linux systems (including those used on this course) and many other systems, most support libraries are installed in /lib, /usr/lib, /usr/X11R6/lib etc. However, whereas Linux distributions include RPMs for libgd, libpng and libz (which are required for PNG support in EMBOSS), other operating systems do not.If you are installing these libraries and include files in somewhere other than /usr then you must specify their location when configuring. Assuming you have installed them in the /usr/local area (i.e. in /usr/local/lib and /usr/local/include) you would add the following switch to the configuration command line: --with-pngdriver=/usr/local Talk 2 - Navigating EMBOSSBackground information for Practical 2 1. Meaning of AjSStr, AjOStr, AjPStrAjSStr is the formal name of the string object, AjOStr is the datatype name for the object whereas AjPStr is the datatype name for the object pointer.AjOStr could, in principle, be used in code to create an instance of the object in memory, but in practice AjPStr is used (memory is allocated to the pointer - more on this later). You'll notice in SRS that "AjPStr" is given after "Name" and that's because AjSStr or AJOStr are never really used in the code (other than in the object definition). For the sake of brevity, we often say "object" (to refer to an AjPStr for example) when what we really mean is "object pointer". Watch out for that in the course so that you're clear about what we're referring to. Talk 3 - Your first EMBOSS applicationBackground information for Practical 3 1. Programming guideFor your future reference, there is a guide to programming EMBOSS application. However, during this course (or the first day at least) you should stick to the material in the talks and practicals. 2. helloworld in CHere's the source code of a C program that prints "Hello, World!" to the screen.#include < stdio.h > /* Omit the space after '<' and before '>' */ int main(void) { printf("Hello, World!\n"); return 0; }The first line is a preprocessor directive telling the compiler to include the `header file' stdio.h. The program consists of a single function (main) which does not have any parameters and has an integer return tpye, in this case it returns 0 to the operating system after printing "Hello, World!" to the screen. Lets say we saved the source code to a file called helloworld.c. To get an executable (runnable) version of the program you have to compile the source. Typing one of the following lines would do it: Unix % gcc helloworld.c -o helloworld (using gcc, an ANSI C compiler)Presuming there are no errors during compilation you will end up with an executable file called 'helloworld'. If you omitted '-o helloworld' the executable would be called 'a.out'. To run your program you simply type `helloworld' at the Unix prompt: Unix % helloworld 3. helloworld in EMBOSSA few more steps are involved in EMBOSS. The first thing to understand is that, in addition to writing the source code, you must also write an ACD file for your new application. ACD files control all of the user input operations. All of the parameters required for an application are prompted for before the application proper begins. The input values are read and held in memory, files are opened as required and so forth, so that all the parameters are available when the application proper starts. An EMBOSS application cannot ask the user for more information after several hours of processing!It's good practice to write your ACD file before the source code because this forces you to think closely about the application inputs and outputs and exactly what's required from the user. You should then test the ACD file by using an EMBOSS application called acdc (more on this shortly). Finally, the application is added to EMBOSS, compiled and ran.
1) Decide on inputs and outputs
2) Write ACD file
application: helloworld [ documentation: "My first EMBOSS program" ]Every ACD file must have the file extension .acd and it's sensible (but not mandatory) that the filename (without .acd extension) is identical to the application name. Every ACD file must contain an application definition, and this should come first in the file. The definition consists of the application: token, followed by the application name and a block of attributes between square brackets. The definition above contains a single documentation: attribute. The text should be a succinct description of the program and will be printed to screen when the program is run. If the documentation attribute is missing, a warning will be issued when you run the program. ACD files will be covered in great detail later on. 3) Test ACD file
Unix % acdc helloworldacdc reads helloworld.acd and reads in any required data just as if the application itself was running. It will also test anything you add on the command line. In this case there is no required data and nothing else on the command line, and all is well. 4) Write source code
/* @source helloworld application ** ** @author: Copyright (C) Arthur Geek (ageek@ebi.ac.uk) ** ** @@ ** ** This program is free software; you can redistribute it and/or ** modify it under the terms of the GNU General Public License ** as published by the Free Software Foundation; either version 2 ** of the License, or (at your option) any later version. ** ** This program is distributed in the hope that it will be useful, ** but WITHOUT ANY WARRANTY; without even the implied warranty of ** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ** GNU General Public License for more details. ** ** You should have received a copy of the GNU General Public License ** along with this program; if not, write to the Free Software ** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. ******************************************************************************/ #include "emboss.h" /* @prog helloworld ********************************************************** ** ** Prints "Hello, World!" to the screen. ** ******************************************************************************/ int main(int argc, char **argv) { embInit("helloworld", argc, argv); ajFmtPrint("Hello, World!\n"); ajExit(); return 0; } The source begins with a comment block including the copyright notice, disclaimer and authors contact details. EMBOSS applications are licensed under the GNU General Public License so its essential that these comments are included in the source.
Unix % more nucleus/emboss.h #ifndef emboss_h #define emboss_h #include "ajax.h" #include "ajgraph.h" #include "embaln.h" #include "embcom.h" #include "embcons.h" #include "embdbi.h" . . Unix % more ajax/ajax.h #ifdef __cplusplus extern "C" { #endif #ifndef ajax_h #define ajax_h #include "ajarch.h" #include < stdarg.h > #include < stdio.h > #include < stdlib.h > #include < string.h > #include "ajassert.h" #include "ajdefine.h" #include "ajstr.h" #include "ajtime.h" #include "ajfile.h" . . Every function, including main(), in an EMBOSS application must be documented. Undocumented code often has little value, with the explanation of code that is self-explanatory. Even then it's helpful to provide at least basic documentation.
The source code proper begins with the main() function. The command line must be available therefore main must include it. This is done in the parameter list using int main(int argc, char **argv). Three calls to the EMBOSS libraries are made. All EMBOSS applications must contain a call to embInit. This is the function that handles all of the user input processing and so must be called right at the start of the application. embInit reads in local database definitions, finds the right ACD file to use (the first argument is "helloworld" so it looks for "helloworld.acd" in the ACD directory) and processes the command line (it uses argc and argv from main). If our ACD file was more complicated, and required a sequence as input and a file as output for example, then by the time the call returned it would have read in the sequence and put it somewhere in memory and also opened the output file. ajFmtPrint is used to print text to the screen. ajExit calls some internal clean-up routines before calling exit with a successful code (zero).
/home/fubar/emboss/emboss/emboss /* The 'executables directory' (for C source files and executables) */ /home/fubar/emboss/emboss/emboss/acd /* The 'acd directory' (for ACD files) */ The files we have to edit are : /home/fred/emboss/emboss/emboss/Makefile.am /home/fred/emboss/emboss/emboss/acd/Makefile.am The Makefile.am in the executables directory contains information about each C source file. First, you must add your program name to the "bin_PROGRAMS" list. This is usually done in alphabetical order. The before and after editing stages will look something like this: i) Before editing bin_PROGRAMS = aaindexextract abiview acdc antigenic \ ... garnier geecee getorf helixturnhelix hmoment \ ... ii) After editing bin_PROGRAMS = aaindexextract abiview acdc antigenic \ ... garnier geecee getorf helixturnhelix helloworld hmoment \ ... N.B. Line continuation characters ('\') must be explicitly added (it's bad practice to have lines longer than 80 characters). Secondly you must add the name of your source file to a SOURCES line. Again this is usually done alphabetically. The before and after appearance of the file would be as follows, given the above example. iii) Before editing ... geecee_SOURCES = geecee.c getorf_SOURCES = getorf.c helixturnhelix_SOURCES = helixturnhelix.c hmoment_SOURCES = hmoment.c iep_SOURCES = iep.c infoalign_SOURCES = infoalign.c ... iv) After editing ... geecee_SOURCES = geecee.c getorf_SOURCES = getorf.c helixturnhelix_SOURCES = helixturnhelix.c helloworld_SOURCES = helloworld.c hmoment_SOURCES = hmoment.c iep_SOURCES = iep.c infoalign_SOURCES = infoalign.c The Makefile.am in the ACD directory contains information about each ACD file. All that needs to be done for this file is to add the name of the new ACD file. Again, it is usual to do this alphabetically. Here are the before and after positions: i) Before editing pkgdata_DATA = codes.english \ aaindexextract.acd abiview.acd ajbad.acd ajfeatest.acd ajtest.acd \ ... garnier.acd geecee.acd getorf.acd helixturnhelix.acd hmoment.acd \ histogramtest.acd iep.acd infoalign.acd infoseq.acd isochore.acd \ lindna.acd listor.acd \ marscan.acd maskfeat.acd maskseq.acd \ matcher.acd i) After editing pkgdata_DATA = codes.english \ aaindexextract.acd abiview.acd ajbad.acd ajfeatest.acd ajtest.acd \ ... garnier.acd geecee.acd getorf.acd helixturnhelix.acd helloworld.acd \ hmoment.acd histogramtest.acd iep.acd infoalign.acd infoseq.acd \ isochore.acd lindna.acd listor.acd \ marscan.acd maskfeat.acd maskseq.acd \ matcher.acd Again, line continuation characters ('\') must be added explicitly. 6) Compile After the two Makefile.am files have been correctly edited, you can compile the application by typing "make helloworld" from the executables directory (quickest way) or "make" from either the executables or second emboss directory (slower as this will compile everything). The GNU tools will recognise whether the Makefile.am files have been edited and reconstruct the Makefile files when a "make" command is given. It is bad practice to edit the Makefile files themselves. Unix % pwd /home/fred/emboss/emboss/emboss/ Unix % make helloworld /bin/sh ../libtool --tag=CC --mode=link gcc -O2 -Wall -fno-strict-aliasing -o helloworld helloworld.o ../nucleus/libnucleus.la ../ajax/libajaxg.la ../ajax/libajax.la ../plplot/libplplot.la -L/usr/X11R6/lib -lX11 -lm -lgd -lpng -lz -lm gcc -O2 -Wall -fno-strict-aliasing -o .libs/helloworld helloworld.o ../nucleus/.libs/libnucleus.so ../ajax/.libs/libajaxg.so ../ajax/.libs/libajax.so ../plplot/.libs/libplplot.so -L/usr/X11R6/lib -lX11 -lgd -lpng -lz -lm -Wl,--rpath -Wl,/home/jison/emboss_test_installation_for_course/emboss/lib creating helloworld And at last: Unix % helloworld My first EMBOSS program Hello, World! Unix % Talk 4 - Introduction to Objects Using StringsBackground information for Practical 4 1. EMBOSS objectsEMBOSS borrows the concept of objects from C++. In C++, an object can be thought of as a 'black box' which takes a defined input and produces a defined output. An object store its own data (member data) and knows how to perform certain actions (via member functions). From the perspective of the user of the object, it doesn't matter what is going on inside so long as the interface, i.e. the inputs and outputs, remains the same. The use of objects allows the programmer to model their program on the problem, breaking it down into smaller, easily managed pieces. It's important to make the distinction between an object and an instance of an object. Strictly speaking, an object is a definition, or a template for instances of that object. The instance is the actual thing that can be manipulated. If you want to do anything you must create an instance, i.e. instantiate the object. The objects in EMBOSS are C data structure definitions. Elements in the structure are the member data. There are no member functions as such, however, all the functions that use an object are documented above the structure definitions themselves, and are accessible from the web and SRS. Therefore EMBOSS maintains the link between the data and functions that act upon it. You can see that for yourself by inspecting a library header file, e.g. ajstr.h. EMBOSS objects should be considered as black boxes, for instance, you should never access the elements of an instance directly; that is what the library functions are there for. The string object is one of the simplest so lets have a look at it now: typedef struct AjSStr { ajint Res; ajint Len; ajint Use; char *Ptr; } AjOStr; #define AjPStr AjOStr* typedef AjPStr* AjPPStr; C programmers will see that we have defined a structure called AjSStr of 4 elements (Len, Res, Use and Ptr) and have created three new datatype names, AjOStr for the object itself, AjPStr for the object pointer, and AjPPStr for a pointer to an AjPStr. Don't worry about the elements of the structure, we'll come onto them in a bit. 2. Managing objectsIn principle, it's possible to instantiate the object in this way:#include "emboss.h" int main(void) { AjOStr my_structure; }This approach is NOT taken in EMBOSS however because it does not give the programmer the freedom to manage the memory of the object. Even if you only need one structure, you should avoid using AjOStr because it would be inconsistent with the rest of EMBOSS. The standard way to instantiate an object is to dynamically allocate memory to the object pointer. Its for this reason (and because AjOStr is almost never used) that we often say 'object' instead of the more cumbersome 'object pointer' when referring to an AjPStr etc - so make sure you understand the difference. All AJAX pointers start with "AjP" so its easy to guess that a sequence object is an AjPSeq. All objects should be allocated dynamically and free'd once you're done with them. This is easy because a constructor function (for memory allocation) and destructor function (for freeing memory) are provided for every type of object. Here is a code snippet illustrating this: #include "emboss.h" int main(int argc, char **argv) { AjPStr my_pointer=NULL; embInit("helloworld", argc, argv); my_pointer = ajStrNew(); ajStrAssC(&my_pointer, "Hello, World!\n"); ajFmtPrint("%S", my_pointer); ajStrDel(&my_pointer); ajExit(); return 0; }AjPStr my_pointer=NULL; declares the object pointer and initialises it to NULL. Pointers should always be set to NULL when they are declared because EMBOSS functions presume that non-NULL pointers have had memory allocated to them. If you do not set the pointer to NULL, it may receive some junk value when the program runs and any function that uses it might mistakenly think memory had been allocated for it - which might lead to a segmentation fault! ajStrNew() is the constructor function. This allocates a block of memory for the object and returns the memory address of the allocated block. The memory address is held in the variable my_pointer. Disregard the calls to ajStrAssC and ajFmtPrint for the time being. ajStrDel() is the destructor function; this must not only free the memory but also set the pointer back to NULL so that it is ready for re-use. You see we pass it the address of my_pointer. You may be wondering, as my_pointer is a pointer anyway, why do we need to pass a pointer to it to get the memory freed? The answer is simple if you remember that in C, function arguments are passed "by value". A temporary copy of each argument is created and passed to the function, rather than the originals. While a copy of the pointer would be enough to free the memory that is pointed to, we need a handle on (a pointer to) the original if we want to set the original to NULL, hence the requirement for &my_pointer. 3. The golden ruleFor consistency, all functions in EMBOSS use the following rule: If the function is to change the object or the data pointed to in any way, the address of the object pointer must be passed. If the function merely reads the data pointed to, just pass in the object pointer. Armed with this rule, we can see that ajStrAssC, which copies text into an AjPStr, must receive the address of my_pointer, whereas ajFmtPrint which merely prints an AjPStr only requires my_pointer. The above rule is one of the most important keys to coding in EMBOSS. There'll be more on pointers and objects later in the course, so don't worry if you don't fully understand what's going on yet - just remember and apply the rule you'll be fine. The library functions are well documented as to whether an object or address of an object is required so you can always find out what's needed if you're not sure. 4. Elements of the string objectNow back to the AjPStr:typedef struct AjSStr { ajint Res; ajint Len; ajint Use; char *Ptr; } AjOStr; #define AjPStr AjOStr* typedef AjPStr* AjPPStr;The char *Ptr; is just a standard C pointer which holds a character string and the ajint Len; is its length. The character string may or may not be null terminated; the library functions for printing AjPStr objects look at the length field for how many characters to print; they won't stop at the first NULL if there is one. The ajint Res; element internally lets the library know how much reserved dynamic memory is associated with the object. This is always at least equal to Len but is often more. Res is and should be outside your direct control. If you use a library call to add anything to the string then, if it'll fit within the memory given by Res then the operation is performed immediately; if the memory required is larger than Res then more memory is allocated and the Res item is updated. A little more memory than required is usually allocated. ajint Use; is the string usage counter. Sometimes, you'll want to two or more references to a single string rather than making a genuine copy. EMBOSS functions that do this increment the strings usage counter. The usage counter is decremented when a call to destroy the string itself or a reference to it is made. When the usage counter reaches zero the object will be deleted. All of this of course is inside the "black box" which means you don't need to worry about it, so long as you don't play with the object internals directly. The important message is : If you intend altering the contents of an object then safety is guaranteed if you use the available library functions. Talk 5 - ACD Files : Basic SkillsBackground information for Practical 5The notes below should describe everything you need to do this practical. If you need further information on ACD files, more comprehensive documentation is available, and will serve as a reference for the rest of the course. Please note there might be some slight differences in the nomenclature used. 1. Introduction to ACD filesThe ACD (Ajax Command Definitions) file describes the data that the program needs to run, this means all the program parameters including things such as input and output files. It indicates whether data items are mandatory or not and whether certain data have to be within limits; a gap penalty for an alignment must be higher then 0 for instance. It can also indicate whether the value of one data item is dependent on the value or the presence of another. For example, if the input sequence for an alignment program is DNA, it should not accept a protein comparison matrix. We'll cover some of these issues in this session and others later in the course. 2. General ACD syntaxAll ACD definitions ("data items") have the same general syntax which is:datatype: label [ attributes ] The available datatypes are pre-defined and include sequence, integer, float and many other types. There is a complete list of supported datatypes. The label can be anything you like within reason. This is the name by which the data item is referred to both from within your C source code and on the command line. The programmer must have a handle on each ACD data item from within the C source code and the label is used for this purpose, as you'll see later. On the command line, the value of a data item can be set if the label is specified. For this reason the label can be referred to as the qualifier name. The attributes allow you to specify such things as informative text, default values, maxima and minima and so on. Global attributes apply to all datatypes whereas others are datatype-specific, and are assigned values (calculated in some cases) after the data item is validated. For example, once a sequence has been read in from file. Here is a complete list of the supported global and datatype-specific attributes. Comment lines can be added to an ACD file and you begin the line of comments with '#'. Any whitespace in an ACD file is ignored. 3. Application groupThe application definition was mentioned in Practical 3 and supports various attributes. Today, we're interested in the groups attribute which associates the application to programs which do similar things or different things in the same general area. The groups attribute is used by the seealso application which takes the name of an existing program in EMBOSS and gives a list of the programs which share some functionality with it. Please refer to the list of valid group names. The group names there can be preceded by 'DNA:' or 'PROTEIN:' if appropriate; 'PROTEIN: Alignment consensus' is valid for example.4. Retrieving ACD valuesLets reiterate that for EMBOSS applications, all input is read and held in memory before the application proper starts. What this means in terms of C source code is that you've made a call to embInit which, amongst other things, will have read all the data items and placed them in memory by the time it returns.The ajAcdGet family of functions is used to retrieve values for ACD data items. These functions are in the library file ajacd.c that is summarised here. 5. Qualifiers and parametersAs mentioned above, the value of a data item can be specified by the user on the command line and the label (or qualifier name) is used for this purpose. In fact, each ACD data item can be specified to be one of the following:
Parameter: "Y" means that the data item is a parameter, i.e. you do not have to use the data label to specify a value for it on the command line. e.g. "myprog 10". Standard: "Y" and Additional: "Y" mean that the data item is a qualifier, i.e. you DO have to use the data label to specify a value for it on the command line. e.g. "myprog -somevalue 10". Values for parameters and standard qualifiers are always prompted for (with their default value) if not specified on the command line.
The standard and additional qualifiers also differ in where the information from the built-in "-help" qualifier is shown. Help info. from 'standard' qualifiers appears in the 'required' section of the help display. Help info. from the 'additional' qualifiers appears in the 'advanced' section of the help display. If none of Parameter: "Y", Standard: "Y" or Additional: "Y" are specified then the data item defaults to an Advanced qualifier. An advanced qualifier is never prompted for and would appear in the "advanced" section of the documentation if the program was run with the '-help' qualifier. To specify a value for an advanced qualifier on the command line you must use the data label. You wouldn't normally specify Paramter: "N", Standard: "N" or Additional: "N". The "Y" in the previous definitions is given for consistency because every ACD attribute, being a label:value pair, has to have a value. You'll see later, however, that the "N" can in exceptional circumstances be used to override the default behaviour of these attributes. Talk 6 - EMBOSS String HandlingBackground information for Practical 6 All the information you need is given in the practical. Talk 7 (Part I) - The AJAX Library & The Software CycleBackground information for Practical 7 (Part I) 1. Overview of the AJAX libraryBy now you should be reasonably familiar with the string library. There is however much, much more to AJAX than just strings. AJAX is the core library used by all EMBOSS applications and covers standard data structures, including strings, sequences, file handles, queues, hashes, heaps, lists, dictionaries, trees and dynamic arrays. It also covers standard algorithms including comparisons, pattern matching, sorting, and iterators.You can't hope to cover everything, but you'll need to become familiar with several other library files to do the practicals today. The main libraries you'll need are:
2. The Software Development Cycle
"The first step toward the management of disease was replacement of demon theories and humours theories by the germ theory. That very step, the beginning of hope, in itself dashed all hopes of magical solutions. It told workers that progress would be made stepwise, at great effort, and that a persistent, unremitting care would have to be paid to a discipline of cleanliness. So it is with software engineering today." - Fred Brooks, No Silver Bullet.
1) Suggested implementation steps under EMBOSS There's much more to software than just coding; especially when programming for EMBOSS which is very widely distributed and deployed in production environments. You should familiarise yourself with the basic steps now, so that you know what's involved before starting a project and so can plan your work. This should help you develop your software efficiently and deliver your projects on time. As mentioned before, the basic steps to write an EMBOSS application are:
Each of these steps is essential and takes time, so the first thing that should be obvious is that a software project will take much more time than that required to simply write the code. 2) Consider your users The steps above would lead to a finished product, but that's only the start of the lifetime of your software. In practice, you must consider your users and there are four fundamental processes in this regard:
You should now appreciate that writing the first version of the code is the beginning rather than the end of the project. The four steps above have many implications, especially in terms of additional requirements (e.g. how to survey user feedback) and strategy (e.g. how best to implement for evolving requirements). These issues are beyond the scope of this course, but the take home message is: Stay in touch with your users (even if you're the user:) at all stages in the software development cycle. 3) Models for developing and releasing software In terms of developing and releasing software, there is no universal model but the following stages are typical:
Of course in practice early releases often contain many bugs and for this reason people are usually wary of software until it has matured over a period of months or even years. In some cases, usually to adapt the software for evolving requirements, it may be necessary to move the software back into beta or even to start this entire cycle from scratch in cases where a complete redesign is necessary. 4) Software engineering and project management There is a well-defined standard (IEEE 1074) developed by the Institute of Electrical and Electronics Engineers for creating a software life cycle process. The standard is intended for use by process architects (e.g. project managers) but should be useful for managing & performing software projects. You'll certainly not need it for this course but you might benefit from a quick look now: IEEE 1074: Standard for Developing Life Cycle Processes
If you're interested in learning more, there's a nice introduction to software engineering. Here are a few useful definitions of the term:
But really the take-home message for today is ... Take a disciplined, organised step-wise approach to implementing your software. Talk 7 (Part II) - ACD files : Intermediate SkillsFurther background information for Practical 7 (Part II) 1. Parameters and QualifiersWithin ACD, all application parameters ("parameter" here is used in the computer science sense of the word, i.e. any value that's passed to an application) correspond to an ACD data item and are defined via the appropriate ACD attribute to be one of "parameter", "standard" or "additional" with the default of "advanced", using the following syntax:Parameter: "Y" Standard: "Y" (used to be 'required') Additional: "Y" (used to be 'optional') Their behaviour is as follows:
2. SummaryParameter: "Y" means that the data item is a parameter, i.e. you do not have to use the data label to specify a value for it on the command line. e.g. myprog 10Standard: "Y" and Additional: "Y" mean that the data item is a qualifier, i.e. you DO have to use the data label to specify a value for it on the command line. e.g. myprog -somevalue 10. Values for parameters and standard qualifiers are always prompted for (with their default value) if not specified on the command line.
The standard and additional qualifiers also differ in where the information from the built-in "-help" qualifier is shown. Help info. from 'standard' attributes appears in the 'required' section of the help display. Help info. from the 'additional' attribute appears in the 'advanced' section of the help display. If none of Parameter: "Y", Standard: "Y" or Additional: "Y" are specified then the data item defaults to an Advanced qualifier.
You'd never normally specify Paramter: "N", Standard: "N" or Additional: "N". Talk 7 (Part III) - Sequence HandlingFurther background information for Practical 7 (Part III) 1. Uniform Sequence AddressesThe Uniform Sequence Address , or USA, is a standard sequence naming scheme used by all EMBOSS applications.The USA syntax has the following types:
Where "format" is the database format of a file ("file") you have provided and "entry" is the database entry code. Alternatively an entry can be retrieved from an installed database of format "dnmame". "listfile" is the name of a file which itself contains a list of file names. The "::" and ":" syntax is to allow, for example, "embl" and "pir" to be both database names and sequence formats. 2. Sequence formatsMany different sequence formats are supported. You can specify the format of your input file on the command line by adding "-sformat format" on the command line or by giving it in the USA (Uniform Sequence Address) of the input filename, e.g.embl::myfile.seq The format is not required, however. When reading in a sequence, EMBOSS will guess the sequence format by trying all known formats until one succeeds. When writing out a sequence, EMBOSS will use fasta format by default. The output format can be changed by adding "-osformat format" on the command line or by giving it in the USA (Uniform Sequence Address) of the output filename, e.g. gcg::myresults.seq 3. Input sequence command-line qualifiersThere are other built-in command-line qualifiers that change the behaviour of the sequence input.-sbegin integer first base used -send integer last base used, default=seq length -sreverse boolean reverse (if DNA) -sask boolean ask for begin/end/reverse -snucleotide boolean sequence is nucleotide -sprotein boolean sequence is protein -slower boolean make lower case -supper boolean make upper case -sformat string input sequence format -sopenfile string input filename -sdbname string database name -sid string entryname -ufo string UFO features -fformat string features format -fopenfile string features file name 4. Output sequence command-line qualifiersThere are other command-line qualifiers that change the behaviour of the sequence output.-osformat string output sequence file format -osextension string file name extension -osname string base file name -osdirectory bool output sequence file directory -osdbname string database name to add -ossingle bool create a separate output file for each entry -oufo string feature file to create -offormat string features format -ofname string features file name -ofdirectory string features output directory 5. Supported output formatBy default, fasta format will be used for output. The output format can be changed by adding "-osformat format" on the command line or by giving it in the USA of the output filename e.g. "embl::myfile.seq". There are many possible output formats which include:
6. AJAX library files for sequence handling
Talk 7 (Part IV) - ACD Files : Advanced SkillsBackground information for Practical 7 (Part IV) If you need further information on ACD files, very comprehensive documentation is available. Talk 7 (Part V) - Software ConsolidationBackground information for Practical 7 (Part V) 1. Coding standardsTo ensure consistency in the EMBOSS code, all application and library code that you write should conform to basic standards. These are formally defined in the EMBOSS coding standards document.Currently, you will probably find exceptions to these standards but in the future they will be enforced (or at least exceptions raised) by software which automatically checks submitted code against the standards. You should at least familiarise yourself with the standards, most of which concern the layout of code and are summarised below. Ease of readingYour code should be easy to read. This is perhaps more important than the code actually working (if its easy to read then someone else stands a chance of fixing it).a) Line length limitNo line should be longer than 79/80 characters. Line-wrap is ugly on the screen and sometimes even disappears on printouts.Stylea) BracesMatching braces should appear in the same column. Do not use braces unnecessarily.IndentationIndentation of 4 characters is recommended. If you find that indentation within nested loops results in many of the lines wrapping then you should check whether the code structure can be improved.Position of main()Declare your main() function as the first function after the preamble. This saves people from having to wade through countless functions before they find it. This is also a help in preventing implicit declarations cropping up.Implicit declarationsExamples of these would be:
extern fubar(int x); main(int argc, char **argv)Do not use them. Always specify what a function should return. Having main() as your first function can go some way tomards alleviating the problem. Use of 'int', 'ajint', 'long' and 'ajlong'EMBOSS assumes an ajint is at least 32 bits. Use ajint, if 32 bits is enough, instead of 'int'EMBOSS assumes an ajlong is at least 64 bits. Use this instead of 'long'. This circumvents any 'long' or 'long long' problems. Of course, if you are using an Alpha box then both your ints and longs will be 64 bits. In this case don't just use 'ajlong' out of laziness as your code will run more slowly on other platforms. Match your datatype to what you need. Cases where int and long should be used are (e.g.) as parameters to C system library functions. Global variablesDo not use them.Static variables in functionsSuch functions are not re-entrant so do not use them at all if you'd like the code to work in multi-threaded contexts. Otherwise, do not use them without very strong reason.Variable declarationsDeclare all variables at the top of each function. Don't be lazy, declare one variable per line.Note that it is easier to read if both datatypes and variables line up. Always initialise Object pointer variables to NULL. Initialise other datatypes as appropriate in the code. Align initialisations for easy reading. Consider documenting key variables but do not document individual house-keeping variables, for example, loop counters. Precedence of operatorsAvoid confusion introduced by using operator precedence.Organisationa) Source FileUse the following organisation for source files:
b) Header FileIn header files, use the following organisation:
It is not necessary to make the declarations 'extern' (although it is arguably safer to do so. Current EMBOSS library code doesn't do this.) Never use nested includes! (Look at Solaris header files to see how not to do things). Avoid exporting names outside individual C source files; i.e., declare as static every function that you possibly can. Always use full ANSI C prototypes. Structures and unionsAlways use the EMBOSS method of declaring structures and unions i.e. use typedefs. They must contain a structure name, object name and pointer name even if these are not used. You should only ever need to use the pointer declaration within your code.
Use AjS, AjO, AjP or EmbS, EmbO and EmbP as appropriate. Use of this convention i.e. 'P' for pointers avoids problems with these abstract datatypes where a pointer could appear to be an object in its own right. Use of the preprocessorFor constants, consider using:
enum { Red = 0xF00, Blue = 0x0F0, Green = 0x00F }; static const float pi = 3.14159265358; instead of #defines, which are rarely visible in debuggers. Macros should avoid side effects. If possible, mention each argument exactly once. Put parentheses around all arguments. When the macro is an expression, put parentheses around the whole macro body. If the macro is the inline expansion of some function then capitalise the name and precede it with 'M'. Try to write macros so that they are syntactically expressions i.e. so you can put a semi-colon at the end of their use. The preprocessor can be used for:
Function names
Argument names
LoopsTry to avoid the use of 'continue'.GotoDo not use it.Memory and functionsDo not define large arrays as declarations within functions. They will go on the stack and can cause many problems.Do not explicitly use malloc() or calloc(), instead use the AJAX macros e.g. AJALLOC, AJNEW, AJCNEW, AJNEW0, AJCNEW0 etc. Use constructor functions explicitly, i.e.
AjPStr tmpstr = NULL; tmpstr = ajStrNew(); ajStrAssC(&tmpstr,"Hello"); and not:
AjPStr tmpstr = NULL; ajStrAssC(&tmpstr,"Hello");If at all possible put constructors at the start of the function. They will act as a reminder to put destructor functions at the bottom of the function. Doing the above will solve the single biggest cause of memory leaks. A good function has the following structure:
Separate functions by suitable whitespace (4 newlines are recommended). ApplicationsA general rule is:A separate application should only be written if it differs by more than one extra major parameter/qualifier to an existing program. New applications should be put into the 'make check' area of the emboss/Makefile.am until full documentation has been submitted. See the documentation standards document for details. Code should be tested for memory leakage before committing it to CVS. If you are unclear how to do this then ask. General guidelinesDuplicated codeDuplicated code is error-prone and difficult to maintain. Do not duplicate blocks of code, write a function instead. Where two functions do essentially the same thing but have different arguments, make one function simply call the other.Long functionsBig functions are difficult to understand. Smaller functions are easier to document therefore easier for the programmer to identify. Functionality split into smaller functions is more likely to be re-used. Consider breaking big functions down into smaller ones. If necessary, retain the function with the original name which can call the new, smaller functions. Avoid too many levels of the function calls though (see "Nesting of functions" below).Long parameter listsFunctions with many parameters are difficult to understand, use and maintain. Where possible, consider passing an object pointer rather than the individual elements of an object. If the parameters do not belong to an object, consider definining a new object to encapsulate them and pass a pointer to that instead.Managing change to codeYour code should be easy to modify for new functionality. Where you find yourself modifying multiple objects or functions to implement a single change it's likely your data model or program structure is not ideal. Consider defining a new object containg the elements you need or new functions as appropriate.Managing variablesFunctions with long lists of variables are difficult to understand and maintain. Where a group of variables are always used together, consider encspsulating them in a new object, especially where the group reoccurs elswhere in your code.Switch statementsConsider using "switch" statements to improve the readability of code where you have excessively long chains of "if else" statements. Where the same switch statement is duplicated throughout the code, however, this will be difficult to maintain. Consider changing your code (probably the data model) so that the switch is not needed.Over-engineered codeA common mistake is to waste time implementing functionality that you think you'll need one day, but never actually do. Over-engineered code is confusing and difficult to maintain. Only program what you need today, but design your code so that it can, if necessary, be extended in the future.Keep objects cleanThe purpose of each element in an object should be obvious. Objects containing variables that are only rarely used, for instance for house-keeping or to hold temporary variables, are difficult to understand. Review your code and establish whether the variable really needs to be in the object or whether it can be moved somewhere else.Nesting of functionsCode which uses deeply nested chains of functions is extremely difficult to understand. Review your code and simplify it if necessary.Object overlapWhere two or more different objects share common elements there is likely scope for removing redundancy throughout your code. Consider whether a new object encapsulating the common elements would make your code easier to understand and maintain.Use of librariesIt is very wasteful to write code unnecessarily: often as not the functionality you seek will be available in the AJAX or NUCLEUS library. Check the libraries before implementing new functionality and contribute any new code so that it can be incorporated into the libraries.2. Code documentationa) Code documentation standardsAll EMBOSS applications should adhere to formally defined code documentation standards which currently cover:
b) CommentsComments can add immensely to the readability of a program, but used heavily or poorly placed they can render good code completely incomprehensible. It is better to err on the side of too few comments rather than far too many - at least then people can find the code! Also, if your code needs a comment to be understood, then you should look for ways to rewrite the code to be clearer. Do not write comments that might get out of date. An inaccurate or misleading comment hurts more than a good comment helps. Be sure that your comments stay correct.Good places to put comments are:
Avoid fancy layout or decoration. c) GPL licenceThis should be used where possible. The appropriate header (see any EMBOSS application) should be placed at the top of each program.d) main() functionThis should be preceded by a header block matching the following format.
/* @prog water **************************************************************** ** ** Smith-Waterman local alignment ** ******************************************************************************/ e) Application functionsDocumentation blocks should appear before the function. Functions should have a name beginning with "applicationname_" and adhere to the following format.
/* @funcstatic tcode_readdata ******************************************** ** ** Read Etcode.dat data file ** ** @param [w] table1 [AjPFTestcode*] data object ** @param [r] datafile [AjPFile] data file object ** @return [AjBool] true if successful read ** @@ ******************************************************************************/All function parameters should be marked read [r] or write [w], give the parameter variable name (e.g. datafile), the datatype (e.g. AjPFile) and a short description. Return values should be stated and described. Functions that return void use ** @return [void] See the code documentation standards for a list of supported tags (@param etc). f) Library functionsThese should be documented in the same way as application functions. Static functions are labelled "@funcstatic" whereas exported functions are labelled "@func". For exported functions a prototype is declared in an associated header file (see library header files for examples of header file documentation)3. Application documentationDocumentation of the code itself can be invaluable to other developers but is of little value to the biologist end-user. For this audience an entirely different sort of documentation is required.Full application documentation in a format suitable for end-user biologists is available on-line. Every EMBOSS application should be well documented and should adhere to the EMBOSS style, see for example documentation the seqret application. a)APPLICATION DOCUMENTATION PROCESSTo document a new program, ensure you have an up-to-date set of programs compiled, and that any programs you've written have had their executable deleted, otherwise references to them might occur in the automatically-generated "See Also" sections (see below).To generate the documentation, run the script autodoc.pl on each application you wish to document in turn: /home/fred/emboss/emboss/scripts/autodoc.pl application_name (for EMBOSS applications) /home/fred/emboss/emboss/scripts/autodoc.pl -embassy=embassy_package_name application_name (for EMBASSY applications) You should replace embassy_package_name and application_name with something sensible. The following instructions presume you are working in the EMBASSY pacakage "myemboss" and are writing a program called myprogram.
mytest 'Demonstration of sequence reading' Doing test mytest-ex /homes/pmr/cvsemboss/embassy/myemboss/emboss_doc/html/mytest.html *created* /homes/pmr/cvsemboss/embassy/myemboss/emboss_doc/text/mytest.txt *replaced* The script will run wossname to check that application_name really exists, then generate a template documentation file (for you to fill in) with include directives, plus include files for:
Use an editor of your choice to edit the template, adding documentation text. The template should be adequately commented for you to see how to fill it out. You should see the directives to read the include files, which are created for you by autodoc.pl and by the QA test procedure. Once you complete the template and save it, the application index file (to appear on the web - you don't need to worry about this file) will need to be updated. The entry for the new application will be inserted into the correct (alphabetic) position in the index file. In brief, the QA test is performed as follows:
Example of entry in qatest.dat ID myprogram-ex AB myemboss AA myprogram IN FI stderr FC = 2 FP 0 /Warning: / FP 0 /Error: / FP 0 /Died: / FI paamir.myprogram FC = 5 FP /^Usa: tembl-id:PAAMIR\n/ FP /^Length: 2167\n/ //To run the test type:
If it fails, check directory myprogram-ex which contains all the files. Update the test definition and try agaion. When it works, rerun autodoc.pl as above and it will create the remaining 3 include files (usage example, input and output files). When all is done, the complete HTML documentation is created in embassy/myemboss/emboss_doc/html/myprogram.html For EMBOSS applications, work in the doc/programs/master/emboss/apps/ directory instead. Leave out the -embassy=myemboss qualifier from the autodoc.pl commandline. The test definition does not have the "AB myemboss" line, and has "AP myprogram" instead of "AA myprogram". Final documentation goes to doc/programs/html/myprogram.html There is a little work to do to update the index.html file and the Makefile.am files in the html and text directories - we hope to automate this in time for release 4.0.0. b) DOCUMENTATION FILESThe example below is for seqret. All paths are relatives to the directory e.g. /home/fred/emboss/emboss/doc/.i) Template html file./programs/master/emboss/apps/seqret.htmlThis is the template mentioned above with include directives. This should be copied along with the include files to the EMBOSS website on sourceforge (the EMBOSS team have a script for doing this). ii) Include filesThese files (and in most cases their contents) are generated automatically by autodoc.pl../programs/master/inc/seqret.ione One-line description. Taken from ACD file ./programs/master/inc/seqret.ihelp Help table* Generated by running the application with "-help". ./programs/master/inc/seqret.itable) Documentation table** Generated by running acdtable. ./programs/master/inc/seqret.isee "See also" list Generated by running seealso. ./programs/master/inc/seqret.usage Usage example Generated via QA tests. ./programs/master/inc/seqret.input Application input files Generated from QA tests. ./programs/master/inc/seqret.output Application output files Generated from QA tests. ./programs/master/inc/seqret.comment Comments Written manually - usually blank. ./programs/master/inc/seqret.history History Written manually - usually blank. (* Application parameters and appropriate "in-built" qualifiers) (** Application parameters) iii) Raw documentation text./programs/text/seqret.txt Documentation (with included text) in plain text format and organised into sections. Used for manual pages and displayed when running application_name -help. This is generated by autodoc.pl. ./programs/html/seqret.html Final html file, with all data included. Documentation for the web. These might be moved to ./doc/programs/html/apps/ or ./doc/programs/html/embassy/* in the future. 4. Application quality assurance testsEach EMBOSS application is run on test data to ensure that it works as advertised. These tests are performed nightly to ensure that the applications are not broken, e.g. by recent changes to the library code.A set of test data consist of input files, application parameters and the corresponding output files. As many sets of test data should be provided as possible, especially for unusual input conditions, to provide as robust a test as possible. The tests are defined in the file /home/fred/emboss/emboss/test/qatest.dat. There is documentation at the start of that file which describes the records used to define a test. Each test you define should write its output files to the same directory as the application is run in. It is possible to create sub-directories and write files to them though. To run the test for a specific application, from the /home/fred/emboss/emboss/test/test/qa directory: ../../scripts/qatest.pl test_namewhere test_name is the name of the test given on the ID line of the appropriate entry in qatest.dat. If qatest.pl is run on something not defined in qatest.dat it will report "Tests total: 0".
To perform tests, you must must edit the .embossrc file located in the test directory. Make sure to set "emboss_qadata" to appropriate test directory (e.g. /home/fred/emboss/emboss/test) Talk 8 - Data Input : Using FeaturesBackground information for Practical 8 The notes below are a summary of the on-line feature documentation. What is a feature?A feature is a region of interest in a nucleic or protein sequence and consists of:
A feature table is a groups of features. Examples of biological data corresponding to features include restriction enzymes cut sites, probabilities of the three states of a protein secondary structure prediction and tables of the start and end positions of things like predicted exons or motif matches. As most sequence analysis programs generate interesting regions in one form or another, there are a huge number of diverse file formats corresponding to features. Feature formatsIn EMBOSS, features are represented in a variety of standard file formats.The standard formats provide a consistent look and feel to features, helping the user compare the features from different programs more easily. Standardisation also facilitates application interoperability (applications can more easily share their input and output). The standard file formats will become the default way of reporting sequence features as the EMBOSS project matures. What are the formats?EMBOSS uses the well-defined and flexible feature formats that were developed for the major sequence databases (EMBL, Genbank, SwissProt, PIR) and for the input of features into the genome databases (GFF, acedb).Feature tables are stored in one of three ways:
In all cases, the feature format is identical to that used in the sequence database format of the same name, e.g. EMBL feature format is the same as the (subset of the) EMBL sequence format. This holds for when a raw feature table is output too. The following feature formats are understood by EMBOSS.
Uniform Feature ObjectEMBOSS defines a 'UFO' (Uniform Feature Object) as a standard way to refer to a feature file. The UFO specifies:
UFOs can be used for input and output. If no format is specified, the default 'GFF' format is used. You can override the default with a different format when you run the program (see below). ACD datatypes and built-in command line qualifiers for handling featuresACD provides two datatypes for handling features:
These command-line qualifiers change the behaviour of a features ACD datatype: -fformat string features format Default: "" -fopenfile string features file name Default: "" -fask bool prompt for begin/end/reverse Default: N -fbegin integer first base used Default: 0 -fend integer last base used, def=max length Default: 0 -freverse bool reverse (if DNA) Default: N These command-line qualifiers change the behaviour of a featout ACD datatype: -offormat string output feature format Default: "" -ofopenfile string features file name Default: "" -ofextension string file name extension Default: "" -ofname string base file name Default: "" -ofsingle bool separate file for each entry Default: N -ofdirectory bool Output feature file directory Default: ""Their use is described below. Note that the sequence, seqall, seqset & seqsetall datatypes (for sequence input) and seqout,seqoutall & seqoutset datatypes (for sequence output) can also read / write features if their features ACD attribute is set. If set, the sequence output will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format). Reading featuresIf the feature table is included in the sequence input file (as is generally the case when you reading the sequence from a database), then the feature table will be read with no problem. To read a raw feature table from file, you must specify the -ufo in-built qualifier on the command line, e.g. '-ufo gff:results.dat'. Alternatively, the '-fformat' and '-fopenfile' qualifiers can be used to specify the feature format and the file name individually instead of as part of a UFO. Using '-ufo' or '-fopenfile' to read in a feature table will cause the new feature table to replace any existing feature table that is part of the sequence data. In-built command line qualifiers for feature input -ufo string UFO features -fformat string features format -fopenfile string features file name If you wish to combine feature table files from various sources, then the easiest way is to concatenate the feature files (must be in the same format!) into one file and to specify that file using '-ufo'. Currently, programs that read and use the feature table of an input sequence include diffseq, extractfeat, maskfeat, seqret and showfeat. Currently there aren't any programs that read in a raw feature table (i.e. one that's not part of sequence). Writing featuresIf a program is capable of writing out sequences with features (for example run "seqret -feature"), then the feature table will be written out as part of the output sequence file, if the format of the sequence file is one of embl, gff, swissprot or pir (i.e. if the sequence field was designed to hold a feature table). If the sequence format cannot hold a feature table (e.g.'fasta'), then a file ('unknown.gff') is written with the raw feature table in GFF format. This behaviour can be overridden by using the command-line qualifiers below. Even if a sequence format that is capable of holding a feature table has been specified, these will enable you to specify a name and format for output to a raw feature table file. Output sequence command-line qualifiers -oufo string UFO features -offormat string features format -ofname string features file name Many programs are capable of writing raw feature tables. The default output format for raw features tables is 'gff', but this can be changed by specifying '-offormat' followed by the format name. Calculated attributesThe features ACD datatype has the following calculated attributes (these are "properties" of an input feature that can be queried within ACD):fbegin (integer) - start of the features to be used. fend (integer) - end of the features to be used. flength (integer) - total length of sequence fprotein (boolean) - feature table is protein fnucleic (boolean) - feature table is nucleotide fname (string) - the name of the feature table fsize (string) - number of featuresand the following specific attributes: type: (string) - defines whether the feature is "protein" or "nucleotide". There is a default based on the type of an input sequence, but a value should always be specified. nullok: (boolean) - allows a default name for a feature to be replaced by an empty string or by -noxxx (where "xxx" is the ACD label for the feature) on the command line. The application must be able to run without feature input. See below.The featout ACD datatype has the following specific attributes: format: (string) Default feature format. name: (string) Default base file name (use of -ofname is preferred).Default: "" extension: (string) Default file extension (use of -offormat preferred) type: (string) Defines whether the feature output is "protein" or "nucleotide". There is a default based on the type of any input sequence, but a value should always be specified. multiple: (Y/N) Features for multiple sequences Default: N nullok: (Y/N) Allows a default name for a feature to be replaced by an empty string or by -noxxx (where "xxx" is the ACD label for the feature) on the command line. The application must be able to run without feature output. See below. nulldefault: (Y/N) Defaults to 'no file' Default: N The output filename is constructed from the name: and extension: attributes in a $( name).$(extension) format. If the name: attribute is not defined in the ACD file, it will default to the calculated attribute name: of the FIRST sequence that is read in (or $(asequence.name), for a sequence parameter named "asequence"). The nulldefault: attribute overrides the default name generation, and uses an empty string (no feature output) as the default for programs where feature output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line. Talk 9 - Data Output : Using ReportsBackground information for Practical 9
1. Too many formats
Report formats take their origin from the need to deal with EMBL, GenBank and PIR feature tables. It was therefore a natural choice to extend these to cope with other output data. The first thing to consider is that all of the standard sequence feature tables (see Talk 8) are also report formats. These include:
2. Report format theory
3. Method to output reports Each report has a header, body and tail information. The process of outputting using reports is as follows.
4. Example application This looks hard but isn't. As an example we develop below a simple application that will take a sequence and produce a tabular report format showing its length and molecular weight.
4.1 ACD for a molecular weight application.
application: wreport [ documentation: "Example report program" ] section: input [ info: "input Section" type: page ] sequence: sequence [ parameter: "Y" type: "Protein" ] endsection: input section: advanced [ info: "advanced Section" type: page ] datafile: aadata [ information: "Amino acid data file" help: "Molecular weight data for amino acids" default: "Eamino.dat" ] endsection: advanced section: output [ info: "output Section" type: page ] report: outfile [ parameter: "Y" rformat: "table" multiple: "N" precision: "1" taglist: "float:molwt int:len" ] endsection: outputThe section and endsection definitions provide a means by which GUIs can be instructed to organize the ACD information on the screen. Each section must always have a corresponding endsection. It is standard practice to have at least an input and an output section definition, adding others as appropriate. The sequence input has been met before. The datafile datatype is just there so a file of molecular weight data in the EMBOSS data area can be read. It is the report datatype we are interested in ... The rformat attribute is the equivalent of the "default" attribute in other datatypes. It is the default report format that will be printed if the user doesn't change it with -rformat on the command line. The multiple attribute says whether multiple reports will be given in the output. This will generally have a value of "N" if you are using the "sequence" datatype and "Y" if you're using the "seqall" datatype. The precision attribute is for floating point numbers. It gives how many decimal places will be printed in the output. The most interesting attribute is taglist. This shows, in order, the datatype/column name pairs that will be used in the report. float:molwt therefore means that one of the columns is called 'molwt' and it will contain floating point values. Typical taglist datatypes are :
4.2 C source code
/* @source wreport application ** ** Show sequence length and molwt as a report ** ** @author: Copyright (C) Alan Bleasby (ableasby@embnet.org) ** @@ ** ** This program is free software; you can redistribute it and/or ** modify it under the terms of the GNU General Public License ** as published by the Free Software Foundation; either version 2 ** of the License, or (at your option) any later version. ** ** This program is distributed in the hope that it will be useful, ** but WITHOUT ANY WARRANTY; without even the implied warranty of ** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ** GNU General Public License for more details. ** ** You should have received a copy of the GNU General Public License ** along with this program; if not, write to the Free Software ** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. ******************************************************************************/ #include "emboss.h" /* @prog wreport ************************************************************** ** ** Show sequence length and molwt as a report ** ******************************************************************************/ int main(int argc, char **argv) { AjPSeq seq = NULL; AjPReport report = NULL; AjPFeattable ftable = NULL; AjPFeature feat = NULL; AjPStr tmpstr = NULL; double molwt; int len; AjPStr datafn = NULL; AjPFile mfptr = NULL; embInit ("wreport", argc, argv); seq = ajAcdGetSeq ("sequence"); report = ajAcdGetReport("outfile"); /* This bit just reads in an EMBOSS data table of molwt info */ mfptr = ajAcdGetDatafile("aadata"); embPropAminoRead(mfptr); /* End of data file reading */ /* Calculate the values to output */ len = ajSeqLen(seq); molwt = embPropCalcMolwt(ajSeqChar(seq),0,len-1); /* Create a feature table */ ftable = ajFeattableNewSeq(seq); tmpstr = ajStrNew(); /* Fill head and tail information for the report */ ajFmtPrintS(&tmpstr,"This is some Header Text"); ajReportSetHeader(report, tmpstr); ajFmtPrintS(&tmpstr,"This is some Tail Text"); ajReportSetTail(report, tmpstr); /* Create feature object and load with the output values */ feat = ajFeatNewII(ftable,1,len); ajFmtPrintS(&tmpstr,"*molwt %.1f", (float)molwt); ajFeatTagAdd(feat,NULL,tmpstr); ajFmtPrintS(&tmpstr,"*len %d", len); ajFeatTagAdd(feat,NULL,tmpstr); /* Write report and clean up */ ajReportWrite(report,ftable,seq); ajFeattableDel(&ftable); ajStrDel(&tmpstr); ajFileClose(&mfptr); ajExit (); return 0; } This is a minimalistic program for clarity (e.g. no notice is taken of any -sbegin or -send values). Taking it step by step, here are the declarations: AjPSeq seq = NULL; AjPReport report = NULL; AjPFeattable ftable = NULL; AjPFeature feat = NULL; AjPStr tmpstr = NULL; A sequence (AjPSeq) object is obviously needed. A report object (AjPReport) is declared which is used to pick up the ACD information using ajAcdGetReport. The feature table object (AjPFeattable) will be used to hold the feature objects (AjPFeature) containing the column values. A temporary string object is declared into which the column values will be printed. This string will then be used to load the feature object. After a bit of code to load in a data file containing amino acid molecular weight data, the molecular weight and length values to be reported are calculated by the lines: len = ajSeqLen(seq); molwt = embPropCalcMolwt(ajSeqChar(seq),0,len-1); Note that the code uses len and molwt for both the variable names and for the column names. This is not necessary but does make the code more clear. The feature table object is then instantiated. Only one feature table is required per report: ftable = ajFeattableNewSeq(seq);The sequence object is passed as a parameter so that the sequence name can be automatically loaded into the internal GFF format. Next, some head and tail information is loaded into the report object. This could have been done any time after the ajAcdGetReport but before the report is printed. They are optional but recommended. The temporary string object is used for this. ajFmtPrintS(&tmpstr,"This is some Header Text"); ajReportSetHeader(report, tmpstr); ajFmtPrintS(&tmpstr,"This is some Tail Text"); ajReportSetTail(report, tmpstr); A feature object can now be created into which the column values can be loaded feat = ajFeatNewII(ftable,1,len); This required 3 parameters. The first is the feature table object, the last two are the sequence start and end positions. Rather naughtily they are hard coded in this example (as mentioned above). A feature object has to be created per line of output in the final report. Now, the molecular weight and length values are loaded into the feature object: ajFmtPrintS(&tmpstr,"*molwt %.1f", (float)molwt); ajFeatTagAdd(feat,NULL,tmpstr); ajFmtPrintS(&tmpstr,"*len %d", len); ajFeatTagAdd(feat,NULL,tmpstr);This is the bit where the ACD column names (molwt and len) must match the ones in the ajFmtPrintS calls. Within the C program these names must be preceded by an asterisk! The datatypes specified in the ACD file (float and int) must also match what's given in the C code. The NULL parameter just means that only a value is being added (the library will add the /note tag automatically). All that remains is to print out the report. ajReportWrite(report,ftable,seq);Three objects are passed. The reason for the sequence object being passed is in case you choose a report format that prints out the (sub)sequence used. Finally, the dynamic memory is recovered in a clean-up. ajFeattableDel(&ftable); Talk 10 - Objects, Pointers and Memory ManagementBackground information for Practical 10 No course in EMBOSS programming would be complete without a treatment of pointers. Here you will get a clear explanation of the use of pointers with particular reference to the management of memory for objects in EMBOSS. This coverage of pointers was inspired by Chuck Allison's 'Code Capsules' (find it on the web!) Pointer basicsPointers are one of the most feared aspects of C and their missuse leads to more problems than any other part of the language. That is not to say that pointers are the problem, it's just that many programmers aren't ready for them yet. With a proper understanding of the underlying principles, it is easier than you imagine to get to grips with all aspects of using pointers and their specific implementation in EMBOSS. To become good at EMBOSS programming you have to master at least the basics of pointers. The trick with pointers is, they're easy to use, so long as you understand the principles, so ...The very first thing to understand is that, with the exception of register variables, every variable you declare in your program resides somewhere in memory, that "somewhere" is the memory address of the variable. So, when this line of the program is executed ... ajint x=0;... sufficient memory to hold an integer (usually 4 bytes) will be reserved for use by our program. The value of those 4 bytes is set to zero. To find the memory address of our variable, we use the & (address) operator. And to get to the value held at a particular memory address you use the * (pointer) operator (this is called dereferencing the pointer or getting a value by indirection). If you're O.K. with the idea that 'x is an integer' then in the same way understand that 'a pointer is a memory address'. Spelling that out ... A POINTER IS MERELY A VARIABLE WHICH HOLDS A MEMORY ADDRESS If you don't overcomplicate the above idea, you've already gone a long way to understanding pointers. Example pointer codeConsider the following:main() { ajint x=0; /*1*/ printf("Value of x : %d\n", x); /*2*/ printf("Memory address of x : %p\n", &x); /*3*/ printf("Value of x by indirection : %d\n", *(&x)); } /* Output: Value of x : 0 Memory address of x : #1 Value of x by indirection : 0 (In reality, a hexadecimal number would be printed instead of '#1', but '#1' is easier to follow). */ The variable name x is our handle on the reserved memory, it is used to refer to an integer value that happens to live at memory address #1. We usually say that "x is an integer" or "x holds an integer" rather than the more acurate and cumbersome "x is a variable name referring to a reserved area of memory of sufficient size to hold an integer". In the code:
In practice, a pointer holds the memory address of a specific data object such as an integer, C data structure or even another pointer. You have to specify the type of data pointed at when you declare your pointer. This is not because the memory address of an integer is any different to that of a float, it's so that the compiler knows how the pointer can be used in the source code. For example, the computer must know the type of data pointed at to be able to print a value by indirection. Pointers to pointers So how do we declare a pointer variable? Easy ...
The * means that ptr is a memory address and the ajint tells us that it's the address of an integer. When that line of the program is executed, sufficient memory to hold a memory address will be reserved for use by our program. This, like an integer, is normally 4 bytes. The value of these 4 bytes is set to NULL. Its important to know that its only in the context of a variable declaration that *ptr=NULL means set the value of the pointer to NULL. If *ptr=NULL was found elsewhere in the program it would mean, "set the value held at memory address ptr to NULL". The final thing to mention is that we've used NULL for the pointer and 0 for the integer in the declarations; they achieve the same thing but cannot be used interchangeably as they are not of the same type; you would get serious bitching from the compiler if you tried! You can see that in the code below: main() { /*1*/ ajint x=0; /*1*/ ajint *ptr=NULL; /*2*/ printf("Value of x : %d\n", x); /*3*/ ptr = &x; /*4*/ *ptr=5; /*2*/ printf("Value of x : %d\n", x); } /* Output: Value of x : 0 Value of x : 5 */ In the code:
In the above example, you would normally say that "ptr holds the address of x" or simply "ptr points to x". It was mentioned above that a pointer can hold the memory address of another pointer. This is obvious when you think that a pointer, like any variable, resides somewhere in memory. So if a pointer that holds the memory address of an integer is a 'pointer to an int', then a pointer that holds the memory address of another pointer is, of course, 'a pointer to a pointer'.This bit of code shows how we declare a pointer to a pointer-to-an-int: ajint **ptrto=NULL;The second * means that ptrto is a memory address. The ajint * tells us that it's the address of a pointer-to-an-integer. When the code is executed, enough memory to hold an address is reserved for our use and the value of the bytes is set to NULL. Of course, the & (address) and the * (pointer) operators still work with pointers to pointers. Where you have multiple levels of pointers you can use multiple * (pointer) operators for dereferencing. *ptrto would dereference once and retrieve an address (a pointer to an integer).**ptrto would dereference twice and retrieve an integer. You can see this in the code below: main() { /*1*/ ajint x=0; /* an integer */ /*1*/ ajint *ptr=NULL; /* a pointer to an integer */ /*1*/ ajint **ptrto=NULL; /* a pointer to a pointer-to-an-integer */ /*2*/ printf("Address of x : %p\n", &x); /*2*/ printf("Address of ptr : %p\n", &ptr); /*2*/ printf("Address of ptrto : %p\n", &ptrto); /*3*/ ptr = &x; /*3*/ ptrto = &ptr; /*4*/ printf("Value of x : %d\n", x); /*4*/ printf("Value of ptr : %p\n", ptr); /*4*/ printf("Value of ptrto : %p\n", ptrto); /*5*/ printf("Value of x by dereferencing ptr : %d\n", *ptr); /*5*/ printf("Value of x by dereferencing ptrto : %d\n", **ptrto); } /* Output: Address of x : #1 Address of ptr : #2 Address of ptrto : #3 Value of x : 0 Value of ptr : #1 /* i.e. the address of x*/ Value of ptrto : #2 /* i.e. the address of ptr*/ Value of x by dereferencing ptr : 0 Value of x by dereferencing ptrto : 0 */ There are no new concepts in the above code, its merely an extension of what you already know about pointers. In the code we:
You already know what *ptr means. Further on you see we dereference the ptrto twice, which is what you've got to do if you want to get to the integer from it. The first time you dereference ptrtor you get to ptr, the second time you are effectively dereferencing ptr, which takes you to x. Simple ! This and in fact all problems in pointers are very easily understood if you sketch what's happening on a piece of paper. If you've got this far with your head intact then you're closer than you think to having the basics of pointers licked, and you've got more or less all you need to master objects, pointers and memory management in EMBOSS. Object definitionConsider the following object definition:/* @data AjPPdbtosp ******************************************************* ** ** Ajax Pdbtosp object. ** ** Holds swissprot codes and accession numbers for a PDB code. ** ** Pdb is the pdb code. ** n is the number of Acc / Spr pairs for this pdb code. ** Acc is the accession number ** Spr is the swissprot code ** ** AjPPdbtosp is implemented as a pointer to a C data structure. ** ** @alias AjSPdbtosp ** @alias AjOPdbtosp ** ** @@ ******************************************************************************/ typedef struct AjSPdbtosp { AjPStr Pdb; ajint n; AjPStr *Acc; AjPStr *Spr; } AjOPdbtosp, *AjPPdbtosp; Note how the structure is nicely documented - your object definitions should do the same! There is nothing new here other than Acc and Spr which are both pointers to AjPStr objects. As an AjPStr is itself a pointer (to the AjOStr object proper) you can see that we're dealing with pointers to pointers. In this case, Acc and Spr are going to be used to create two arrays of strings as we can see in the constructor function below: Constructor function/* @func ajXyzPdbtospNew *********************************************************** ** ** Pdbtosp object constructor. Fore-knowledge of the number of entries is ** required. This is normally called by the ajXyzPdbtospReadC / ajXyzPdbtospRead ** functions. ** ** @param [r] n [ajint] Number of entries ** ** @return [AjPPdbtosp] Pointer to a Pdbtosp object ** @@ ******************************************************************************/ AjPPdbtosp ajXyzPdbtospNew(ajint n) /*1*/ { AjPPdbtosp ret = NULL; /*2*/ ajint i=0; AJNEW0(ret); /*3*/ ret->Pdb = ajStrNew(); /*5*/ if(n) { AJCNEW0(ret->Acc,n); /*4*/ AJCNEW0(ret->Spr,n); /*4*/ for(i=0; i We'll go through this line by line.
Destructor function/* @func ajXyzPdbtospDel *********************************************************** ** ** Destructor for Pdbtosp object. ** ** @param [w] thys [AjPPdbtosp*] Pdbtosp object pointer ** ** @return [void] ** @@ ******************************************************************************/ void ajXyzPdbtospDel(AjPPdbtosp *thys) { AjPPdbtosp pthis = NULL; ajint i; if(!thys) return; pthis = *thys; if(!pthis) return; ajStrDel(&pthis->Pdb); if(pthis->n) { for(i=0; i< pthis->n; i++) { ajStrDel(&pthis->Acc[i]); ajStrDel(&pthis->Spr[i]); } AJFREE(pthis->Acc); AJFREE(pthis->Spr); } AJFREE(pthis); (*thys)=NULL; return; } It is your task in Practical 10 to figure out exactly what is going on in this destructor function. If you can do that, then you can be happy that you are on your way to become adept at objects, pointers and memory management in EMBOSS. Suffice it to say that this function safely clears up all of the memory that was allocated by the constructor, this is achieved by calling appropriate destructor functions and by using AJFREE. AJFREE will free the memory pointed to by its argument. There are 3 places its used, twice to free the arrays and once to free the Pdbtosp object itself. Note! AJFREE as used here will free the arrays but will not free the string objects proper that are pointed to (this is the job of the ajStrDel calls in the preceding code). The function also sets the object that was passed in to NULL. This is a requirement of all destructor functions for reasons explained in Talk 4. Calling constructor and destructor functionsFollowing is a code snippet illustrating how the object and constructor and destructor functions could be used. You'll notice they're used in just the same way as you've been managing memory for strings.main() { AjPPdbtosp ptr=NULL; ptr = ajXyzPdbtospNew(10); ajXyzPdbtospDel(&ptr); /* ptr will have been reset to NULL now, and is ready for reuse */ ptr = ajXyzPdbtospNew(10); ajXyzPdbtospDel(&ptr); } EMBOSS memory allocation macrosThe final thing is to give a summary of the EMBOSS memory allocation macros:
For non-C programmers "malloc" allocates memory but the contents are undefined whereas "calloc" allocates memory setting each location to zero. Talk 11 - The NUCLEUS LibraryBackground information for Practical 11 The NUCLEUS library provides higher-level functions and algorithms, mostly for molecular sequence analysis, including sequence comparisons, translation, codon usage and annotation. See the NUCLEUS Library Documentation for the documentation of individual functions and datatypes. The available NUCLEUS libraries are listed below.
In contrast to the applications and to the AJAX library, NUCLEUS is not as well developed or documented. In future code refactoring, some of the libraries may be merged together. In several cases, algorithms and data structures that you would expect to find in NUCLEUS are in fact kept in AJAX. An example of this is the domain handling code, most of which is kept in ajdomain.c/h rather than embdomain.c/h. The reason for this is for purposes of compilation: the ACD file-handling code (which is part of AJAX) must call these functions which therefore must live in AJAX. This may change in a future code refactoring exercise.
Last modified on
2005 by Jon Ison. |