The EMBOSS package consists of a large number of separate programs that have a specific function. They usually take a (number of) input file(s) and some parameters that are important to the function and produce output in the form of files, plots, web pages or simple text output.
The programs can be invoked in a myriad of ways. Its name could be entered on the command line with all parameters, so the program will have all the information it needs all at once. A more interactive way is a query-answer session with the user, in which the user is asked to enter a piece of information one at a time. A third way could be a web-interface where a user chooses the options for the program using lists, checkboxes, radio buttons etc. In EMBOSS, the way a program interacts with the user, its interface, is independent of the actual program.
At the moment, EMBOSS programs are called by giving their name on the UNIX command line either with or without parameters. Many parameters can have qualifiers that will give more information about a parameter. For instance, the format of the information in a sequence file that is used as an input file could be specified on the command line, like:
|
% seqret filename.seq -sformat fasta |
In this example the EMBOSS program ' seqret is called with the filename 'filename.seq' as its first parameter. '-sformat fasta' indicates that the sequence file is in 'fasta' format. A complete description of the command line syntax will follow in section 2 Formal Description of the ACD language. The percentage sign '%' indicates that the command was entered on the UNIX command line. This will be used throughout the documentation.
Every EMBOSS program will be accompanied by a so-called ACD (Ajax Command Definitions) file, which describes the parameters that the program it refers to needs. It contains information about its input and output files and other parameters the program may need. It will indicate if any of the parameters are mandatory (like an input sequence file) or that certain parameters are within certain limits (a gap penalty for an alignment must be higher then 0 for instance). It can also indicate whether one parameter's value is dependent on the value or the presence of another. (An example: If the input sequence for an alignment program is DNA, it should not accept a protein comparison matrix).
The parameters are defined in a special purpose language called Ajax Command Definitions or ACD, specially designed for EMBOSS. It will specify everything that can appear on the command line or can be used in another interface like web pages. It is a very 'forgiving' language in that it does not restrict the available syntax any more than is strictly necessary.
ACD files are simple text files that contain the definitions. The files usually have the same prefix as the program, but this is not required. ACD files use the extension '.acd'. This is mandatory.
Formalised:
token: token [ definition ]
is equivalent to
token=token [
definition
]
The first token in the file must be "application" directly followed by a colon ':' or an equal sign '='. The second token is the application name with which this ACD file is associated. The application name is followed by (required) application attributes enclosed in square brackets.
Formalised:
application: appname [
attributes
]
Example:
application: wossname [
documentation: "Finds programs by keywords"
groups: "Display"
]
The first token of a parameter definition is an Ajax datatype, directly followed by a colon ':' (preferred) or equal sign '='. The second token is the name by which this parameter is going to be known (this is also the name that is used by the EMBOSS program to get the value of the parameter). After the name, definitions are in mandatory square brackets, [], which can make a definition span multiple lines.
Formalised:
datatype: parametername [
definition
]
Example:
sequence: asequence [
standard: "Y"
]
Tokens representing data types can be abbreviated up to the point where they are not ambiguous. For example, default: can be abbreviated to default: or even d: although the latter is not recommended due to lack of clarity.
Values can be delimited (i.e. treated as one token) by double quotes
The first token of an ACD file must be the application: token, followed by the application name. The application name and the ACD filename (without the .acd extension) are usually identical, but this is not mandatory. When a program calls the embInit("program") function with "program" as its parameter, the function will only look for an ACD file called program.acd. It will not compare the parameter with the string given after the application: token.
The application: token has a documentation: attribute which is followed by a string describing the function of the program. This documentation string will be used to generate the description of the program when the program is run or the user specifies the -help qualifier. When the documentation: attribute is missing, a warning will be issued.
Formalised:
application: appname [
documentation: string
]
Example:
ACD file definition (partly):
application: seqret [
documentation: "Reads and writes (returns) a sequence"
]
Command line:
% seqret Reads and writes (returns) a sequence Input sequence :
The ACD file starts with the definition of the program seqret. The documentation: attribute is followed by a string briefly explaining the function of the program and this string is shown after the program is invoked and before it prompts the user for any input. The documentation: string is also searched by the wossname utility, which finds applications by keyword (in the doc string) and group.
The length of the documentation: string should be kept to 63 characters or shorter in order to allow the wossname utility to display each program name and its documentation on one 80-character line.
The documentation: string should not end with a '.' character
Any acronyms or capitalised abbreviations in the documentation: string should be written in upper case. (e.g.: SNPs, EST, DNA, ABI, SRS, ASCII, CDS, mRNA, B-DNA, RNA, CpG, ORFs, MAR/SAR, PCR, STS, REBASE, SCOP, PROSITE, PRINTS, EMBL, TRANSFAC, AAINDEX, BLAST, GCG, EMBOSS)
The documentation: string should start with an upper-case letter.
The groups: attribute allows the EMBOSS programs to be grouped together based on their functionality. The groups: attribute is followed by a string value, containing the name(s) of the group(s). When an application belongs to more then one group, the group names must be separated by either a comma (,) or semi-colon (;); i.e. a group name is not a token, but a list of tokens.
The groups: string is also searched by the wossname utility, which finds applications by keyword (in the doc string) and group.
Formalised:
application: appname [
groups: "group name1, group2, ... "
]
Example: ACD file definition (partly):
application: seqret [
groups: "Display"
]
Group names can have spaces in them.
The group names can be split into sub-levels by
the use of a ':' character:
First Level : Second Level
Several third-party interfaces are starting to rely upon there being a maximum
of 2 levels, so do not use more than one ':' in a group name.
The group name is now checked against a list of accepted values in the file groups.standard which is defined and installed in the same directory as the ACD files. This file contains one line for each known group, with subgroups defined with a ":" delimiter, and spaces replaced by underscores. Each group also has a short description.
The table in the following section lists all groups currently defined
The First and Second level group names are given below with some explanation of what might be expected to be placed in the group.
If a group is composed of two levels, such as
Alignment : Consensus
then the group specification must not use the group names singly, (i.e. you
must not use "Alignment" or "Consensus").
If the group consists of only one level, such as
Display
then please don't start adding sub-levels to
it. (i.e. you must not use "Display : Features")
You are strongly encouraged to use the following groups structure. This is the set of groups defined by the groups.standard file. We have found that most things will fit in one or more of these groups. When, however, a completely new category of program is written, please discuss the creation of the new group name with the developers' mailing list. Sometimes a new group is required (for example the group "Enzyme Kinetics" which had to be created to hold 'findkm').
|
Top Level |
Second Level |
Description |
|
Acd |
|
ACD file utilities |
|
Alignment |
Consensus |
Merging sequences to make a consensus |
|
|
Differences |
Finding differences between sequences |
|
|
Dot_plots |
Dot plot sequence comparisons |
|
|
Global |
Global sequence alignment |
|
|
Local |
Local sequence alignment |
|
|
Multiple |
Multiple sequence alignment |
|
Assembly |
Fragment_assembly |
DNA sequence assembly |
|
Display |
|
Publication-quality display |
|
Edit |
|
Sequence editing |
|
Enzyme_Kinetics |
|
Enzyme kinetics calculations |
|
Feature_tables |
|
Manipulation and display of sequence annotation |
|
HMM |
|
Hidden Markov Model analysis |
|
Information |
|
Information and general help for users |
|
Menus |
|
Menu interface(s) |
|
Nucleic |
2D_structure |
Nucleic acid secondary structure |
|
|
Codon_usage |
Codon usage analysis |
|
|
Composition |
Composition of nucleotide sequences |
|
|
CpG_islands |
CpG island detection and analysis |
|
|
Gene_finding |
Predictions of genes and other genomic features |
|
|
Motifs |
Nucleic acid motif searches |
|
|
Mutation |
Nucleic acid sequence mutation |
|
|
Profiles |
Nucleic acid profile generation and searching |
|
|
Primers |
Primer prediction |
|
|
Repeats |
Nucleic acid repeat detection |
|
|
RNA_folding |
RNA folding methods and analysis |
|
|
Restriction |
Restriction enzyme sites in nucleotide sequences |
|
|
Transcription |
Transcription factors, promoters and terminator prediction |
|
|
Translation |
Translation of nucleotide sequence to protein sequence |
|
Phylogeny |
Consensus |
Phylogenetic consensus methods |
|
|
Continuous_characters |
Phylogenetic continuous character methods |
|
|
Discrete_characters |
Phylogenetic discrete character methods |
|
|
Distance_matrix |
Phylogenetic distance matrix methods |
|
|
Gene_frequencies |
Phylogenetic gene frequency methods |
|
|
Molecular_sequence |
Phylogenetic tree drawing methods |
|
|
Tree_drawing |
Phylogenetic molecular sequence methods |
|
|
Misc |
Phylogenetic other tools |
|
Protein |
2D_structure |
Protein secondary structure |
|
|
3D_structure |
Protein tertiary structure |
|
|
Composition |
Composition of protein sequences |
|
|
Motifs |
Protein motif searches |
|
|
Mutation |
Protein sequence mutation |
|
|
Profiles |
Protein profile generation and searching |
|
Test |
|
Testing tools, not for general use. |
|
Utils |
Database_creation |
Database installation |
|
|
Database_indexing |
Database indexing |
|
|
Misc |
Utility tools |
Table 1. Standard application groups
ACD files describe the parameters that a program needs, in an object-oriented manner. The most important types or objects are file objects, sequence objects, number objects, Boolean objects and string objects. The current objects are listed in Table 1.
|
Data type / Object |
Description |
Calculated Attributes |
Specific Attributes |
Command Line Qualifiers |
|
All data types |
||||
|
|
All data types |
|
additional: "N" |
|
|
Simple types |
||||
|
array |
List of floating point numbers |
|
minimum: (-FLT_MAX) |
|
|
boolean |
Boolean value Yes/No |
|
|
|
|
float |
Floating point number |
|
minimum: (-FLT_MAX) |
|
|
integer |
Integer |
|
minimum: (INT_MIN) |
|
|
range |
Sequence range |
|
minimum: 1 |
|
|
string |
String value |
length (integer) |
minlength: 0 |
|
|
toggle |
Toggle value Yes/No |
|
|
|
|
Input types |
||||
|
codon |
Codon usage file in EMBOSS data path |
|
name: "Ehum.cut" |
format: "" |
|
cpdb |
Clean PDB file |
|
nullok: N |
format: "" |
|
datafile |
Data file |
|
name: "" |
|
|
directory |
Directory |
|
fullpath: N |
|
|
dirlist |
Directory with files |
|
fullpath: N |
|
|
discretestates |
Discrete states file |
|
length: 0 |
|
|
distances |
Distance matrix |
distancecount (integer) |
size: 1 |
|
|
features |
Readable feature table |
fbegin (integer) |
type: "" |
fformat: "" |
|
filelist |
Comma-separated file list |
|
nullok: N |
|
|
frequencies |
Frequency value(s) |
freqlength (integer) |
length: 0 |
|
|
infile |
Input file |
|
nullok: N |
|
|
matrix |
Comparison matrix file in EMBOSS data path |
|
pname: "EBLOSUM62" |
|
|
matrixf |
Comparison matrix file in EMBOSS data path |
|
pname: "EBLOSUM62" |
|
|
pattern |
Property value(s) |
|
minlength: 1 |
pformat: "" |
|
properties |
Property value(s) |
propertylength (integer) |
length: 0 |
|
|
regexp |
Regular expression pattern |
length (integer) |
minlength: 1 |
pformat: "" |
|
scop |
Clean PDB file |
|
nullok: N |
format: "" |
|
sequence |
Readable sequence |
begin (integer) |
type: "" |
sbegin: "0" |
|
seqall |
Readable sequence(s) |
begin (integer) |
type: "" |
sbegin: "0" |
|
seqset |
Readable set of sequences |
begin (integer) |
type: "" |
sbegin: "0" |
|
seqsetall |
Readable sets of sequences |
begin (integer) |
type: "" |
sbegin: "0" |
|
tree |
Phylogenetic tree |
treecount (integer) |
size: 0 |
|
|
Selection lists types |
||||
|
list |
Choose from menu list of values |
|
minimum: 1 |
|
|
selection |
Choose from selection list of values |
|
minimum: 1 |
|
|
Output types |
||||
|
align |
Alignment output file |
|
type: "" |
aformat: "" |
|
featout |
Writeable feature table |
|
name: "" |
offormat: "" |
|
outcodon |
Codon usage file |
|
name: "" |
odirectory: "" |
|
outcpdb |
Cleaned PDB file |
|
nulldefault: N |
|
|
outdata |
Formatted output file |
|
type: "" |
odirectory: "" |
|
outdir |
Output directory |
|
fullpath: N |
|
|
outdiscrete |
Discrete states file |
|
nulldefault: N |
odirectory: "" |
|
outdistance |
Distance matrix |
|
nulldefault: N |
|
|
outfile |
Output file |
|
name: "" |
odirectory: "" |
|
outfileall |
Multiple output files |
|
name: "" |
odirectory: "" |
|
outfreq |
Frequency value(s) |
|
nulldefault: N |
odirectory: "" |
|
outmatrix |
Comparison matrix file |
|
nulldefault: N |
odirectory: "" |
|
outmatrixf |
Comparison matrix file |
|
nulldefault: N |
odirectory: "" |
|
outproperties |
Property value(s) |
|
nulldefault: N |
odirectory: "" |
|
outscop |
Scop entry |
|
nulldefault: N |
odirectory: "" |
|
outtree |
Phylogenetic tree |
|
name: "" |
odirectory: "" |
|
report |
Report output file |
|
type: "" |
rformat: "" |
|
seqout |
Writeable sequence |
|
name: "" |
osformat: "" |
|
seqoutall |
Writeable sequence(s) |
|
name: "" |
osformat: "" |
|
seqoutset |
Writeable sequences |
|
name: "" |
osformat: "" |
|
Graphics types |
||||
|
graph |
Graph device for a general graph |
|
nulldefault: N |
gprompt: "N" |
|
xygraph |
Graph device for a 2D graph |
|
multiple: 1 |
gprompt: "N" |
Table 2. Available Data Types/Objects in ACD.
Array parameters are lists of numbers, either integer or floating point. The ACD attributes control validation, for example the number of values, or a list of numbers that adds to a given total. The data value is a list of numbers separated by spaces or commas.
Boolean parameters are simple switches. If they are entered on the command line the value will be Y (True), if they are absent from the command line the value will be the default value. The name can also be prefixed by 'no' to force the value to be N (False). This is needed if the default value is Y (True). The data value is Y for yes and N for no.
The integer data type can hold simple integer values. The value range can be controlled by minimum and maximum values (a minimum value of 0 or 1 is often useful).
Simple float values. The value range can be controlled by minimum and maximum ACD attributes (a minimum value of 0.0 is often useful).
Ranges of sequence positions. Originally defined as a simple list of paired numbers, ranges can now be specified in files with the range syntax "@filename", as pairs of numbers with text comments. For example:
# this is my set of ranges 12 23 4 5 this is like 12-23, but smaller 67 10348 interesting region
Any string value. The length can be controlled by ACD attributes, and a regular expression pattern to provide more general validation if necessary. Most string values are free text, although strings can be used by a program for any input that is not covered by a defined ACD type.
Toggle parameters are simple switches, and work in the same way as "boolean" parameters. Toggle parameters are intended for use in turning on/off other parameters. When ACD parameters are grouped in sections, a clean ACD file will have all the "required" parameters in the "required" secion and all the "additional" parameters in the "additional" section. Some of these will have calculated values for the "standard" and "additional" attributes, controlled by the value of another parameter. The "toggle" parameters are designed to be used in these calculated values, and can be in the "required" section even if not themselves defined as "standard".
Exactly like "boolean" parameters, if they are entered on the command line the value will be Y (True), if they are absent from the command line the value will be the default value. The name can also be prefixed by 'no' to force the value to be N (False). This is needed if the default value is Y (True). The data value is Y for yes and N for no.
Codon usage tables are simple files read from the EMBOSS data search path, and are distributed in the emboss/data directory.
Codon usage files can be read in several formats, including "gcg".
Cpdb (Cleaned PDB) files are simple input files in CPDB format. See the documentation for pdbparse, part of the EMBASSY domainatrix package, which generates CPDB files from PDB file input.
Datafile input refers to a formatted data file to be read from the standard EMBOSS data file locations (see the EMBOSS Administrator's guide for full details).
EMBOSS looks for data files in the local/share/EMBOSS/data directory, or in various user directories.
Most data files are already defined as their own ACD types - matrix, matrixf, codon. Otehrs are hard coded file names that do not need their own ACD definition, although users are free to define their own file with the appropriate name to override the default file provided.
Directory defines a directory that can be used for input or output definitions.
Directory is intended for future use to replace string definitions of directory names in some applications, and to provide additional validation of the user input specific to directory specifications.
Directory defines a set (list) of directories that can be used for input or output definitions.
Dirlist is intended for future use to replace string definitions of directory names in some applications, and to provide additional validation of the user input specific to directory specifications.
Discretestates is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Discretestates input is used by the phylip "discrete character" applications. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
Distances is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Discretestates input is used by the phylip "distance matrix" applications. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
Feature annotation in any known feature format. Features can also be read from a sequence and written with a sequence.
Filelist defines a set (list) of input files.
Filelist is intended for future use to replace string definitions of input file names in some applications, and to provide additional validation of the user input specific to multiple input files.
Frequencies is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Discretestates input is used by the phylip "gene frequency and continuous character" applications. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
Non-sequence-related data file. This data type refers to files that are to be used in the program and do usually not contain sequence data. The type of data can be identified by a "knowntype" attribute and matched to Outfile standard types, or to report, align, featout, or seqout formats.
Comparison matrix files are used by many programs. They are data files read from the EMBOSS data search path, and are distributed in the emboss/data directory. For preference, we use the matrix files distributed with BLAST.
Integer matrices are usually faster and are preferred by most applications. Floating-point matrix files are also available if needed, and an integer matrix file can of course also be read as floating point.
The matrix data type has an attribute to force selection of a nucleic acid or protein comparison matrix. In ACD files, the type of the input sequence is often used here.
Remember that any application which uses gap penalties will need to set them separately for each matrix.
Floating point comparison matrices are required by some algorithms. An integer matrix file can of course be used equally well as a floating point matrix.
Pattern definitions files allow multiple search patterns to be described, each with a name.
Pattern files are used for PROSITE syntax sequence patterns. The same syntax is used for "regexp" input. Pattern files also allow mismatch values to be defined for each pattern, and a "-pmismatch" qualifier sets the mismatch default for all patterns in the file. Mismatches are not appropriate for regular expression matches.
Properties is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Properties input is used by the phylip applications to define weights, ancestral states and factors (multi-state characters). By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
Any regular expression value, or (new in release 4.0.0) a file containing regular expressions and names.
The length can be vallidated and controlled by ACD attributes. The case can be set to upper or lower case only. The regular expression must be supported by the EMBOSS regular expression library.
EMBOSS uses the "Perl-Compatible Regular Expression Library" (PCRE), so any regular expression that is valid in Perl 5.0 should be valid here.
SCOP files are simple input files in SCOP format.
USA (database reference or file) indicating a single sequence. The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.
set of single sequences that can be addressed one after another (for example a set of sequences that will be used in an multiple alignment). The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.
set of single sequences that can be used all at the same time (for example a database of some sort that is to be used for a pattern search). The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.
One or more sets of single sequences that can be used all at the same time (for example a database of some sort that is to be used for a pattern search). The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.
Tree is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Properties input is used by the phylip applications to define one or more phylogenetic trees. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options. The trees are currently parsed by phylip itself, but in the near future we will implement parsing methods in ACD processing.
Selection lists are a way to present the user with a limited list of options he/she can choose from. For the user, the difference between the list and selection data type is minimal and lies only in the way the choices are labelled. In a selection data type, the choices are numbered automatically from 1 up. In a list data type the choices can be labelled by any arbitrary text label. The user can choose one of the options by either typing the number (for a selection type) or the text of the label (for a list type) or a non-ambiguous part of the value of the choice. In practice, the list data type is much preferred for this reason.
A list of text descriptions with short labels. The user can enter one (or sometimes more) labels, or can specify partial text descriptions. The program is given a list of text labels as input.
A list of text descriptions (usually short, unlike list data), with generated numbers. The user can enter one (or sometimes more) numbers, or can specify partial text descriptions. The program is given a list of text descriptions as input. The listdata type is usually preferred.
An output file for sequence alignments. Defined in the same way as a plain text "Outfile" but with extra qualifiers to allow a choice of alignment formats, and attributes to specify whether the alignment will have 2 or more sequences (which limits the possible formats). The data is stored as sequences, the available formats include the most common sequence formats.
Feature annotation in any known feature format. Can also be stored with the sequence if the sequence output "features" attribute is set.
Output file containing codon usage data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing cleaned PDB protein structure data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing cleaned formatted data as tables or lists. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Multiple outdata definitions are by default appended to a single file. The individual ACD definitions allow the format of each file section to be defined.
Output directory for multiple output files to be written. Specifying an outdir allows other properties to be defined, including the default file extension with the "extension" attribute.
Output file containing phylogenetics discrete characteristics data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing phylogenetics distance matrix data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Non-sequence-related data file, usually plain text. This data type refers to files that are to be produced by the program and usually do not contain sequence data. The type of data can be identified by a "knowntype" attribute and matched to an Infile standard type for use as input to another program.
Non-sequence-related data files, usually plain text. This data type refers to files that are to be produced by the program and usually do not contain sequence data. The type of data can be identified by a "knowntype" attribute and matched to an Infile standard type for use as input to another program.
Output file containing phylogenetics character frequency data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing integer comparison matrix data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing floating point comparison matrix data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing phylogenetics property data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing SCOP protein domain data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing phylogenetic tree data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
An output file for sequence annotation. Defined in the same way as a plain "Outfile" but with extra qualifiers to allow a choice of report formats. Report data is stored internally as a feature table, so the available formats include the most common feature formats.
USA (database reference or file) indicating a single sequence. Can also write features if the "features" ACD attribute is set.
The default file extension is the sequence format, but can be specifically set with the "osextension" attribute, for example where appliations produce two or more sequence outputs.
A set of single sequences to be written to a single file. Can also write features if the "features" ACD attribute is set.
The default file extension is the sequence format, but can be specifically set with the "osextension" attribute, for example where appliations produce two or more sequence outputs.
A set of single sequences stored in memory together, usually a multiple sequence alignment. Can also write features if the "features" ACD attribute is set.
The default file extension is the sequence format, but can be specifically set with the "osextension" attribute, for example where appliations produce two or more sequence outputs.
For graphical output of any general kind, including dotplots. The data value is the graphics device, as specified by the "PLPLOT" graphics library used in EMBOSS at present. Example values include "ps" for Postscript, "png" for PNG files, and "X11" for X-Windows. A value of "?" in answer to the prompt will list the available graphics devices on your installation.
For graphical output as a simple two dimensional (2D) XY plot with the sequence along the x-axis. . The data value is the graphics device, as specified by the "PLPLOT" graphics library used in EMBOSS at present. Example values include "ps" for Postscript, "png" for PNG files, and "X11" for X-Windows. A value of "?" in answer to the prompt will list the available graphics devices on your installation.
ACD objects have mandatory names.
Formalised:
datatype: parametername [ ]
Example:
sequence: asequence [ ]
This defines asequence to be the name of a sequence object.
In order to assign a value to a parameter, the name of the parameter can be specified on the command line (in a number of ways, see section 4) followed by a value that is appropriate for that data type.
Example:
ACD file definition (partly):
sequence: asequence [ ]
Command line :
% acddemo -asequence filename.seq
This defines filename.seq to be the value of the parameter named asequence for the EMBOSS program acddemo.
If a parameter is defined with a special parameter attribute ( parameter:"Y"), using the name of the parameter on the command line is not mandatory (see section 3.4). This is commonly used for input data and for output filenames.
The name of an object is also used, in the EMBOSS program, to refer to the value of the parameter. After the initiation call using the EMBOSS function embInit(), the values of the parameters have been read in and checked (see 1.4). The program must then assign the parameter to an actual EMBOSS object, like sequence (AjPSeq), string (AjPStr) etc. The actual function calls are beyond the scope of this document, and the reader is referred to the AJAX documentation (http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EDATA for the SRS searchable Object documentation), but some examples can be found in section 1.4 and 1.5.
The name can also be used in the definition of other ACD parameters. The value of the parameter (or variable) is retrieved, using the dollar sign '$' and a the name of the parameter encapsulated by a pair of parentheses.
Formalised:
$(parametername)
Example:
integer: gappenalty [
standard: Y
default: 10
]
integer: gapextpenalty [
default: $(gappenalty)
]
This defines the default for parameter gapextpenalty as the value of parameter gappenalty.
Naming conventions
Although everybody is free to use any (valid) name for a parameter, we would like to propose a naming convention, to streamline the development of ACD files.
|
Name |
Datatype |
Usage |
|
sequence |
sequence |
primary input sequence, generally required |
|
outseq |
outseq |
primary output sequence, generally required, generally should default to the primary input sequence name, extension defaults to the name of the output sequence format. |
|
outfile |
outfile |
primary output non-sequence results file, generally required. The file extension should be allowed to default to the application name. |
|
data |
infile |
primary auxiliary input data file, generally optional |
|
minlen |
int |
minimal length of sequence feature to be found |
|
maxlen |
int |
maximum length of sequence feature to be found |
|
wordsize |
int |
word size for hash tables etc. generally minimum=2 for protein, 4 for DNA |
|
window |
int |
window length for calculating dotplots/features/etc. |
|
shift |
int |
amount by which window is shifted in each iteration |
|
consensus |
bool |
flag for whether consensus sequence should be output |
|
gap |
float |
gap penalty |
|
gapext |
float |
gap extension penalty |
|
from |
int |
position of start of input sequence to specify for an operation (e.g. deletion), defaults to start of sequence, minimum value = 1, maximum value = sequence length |
|
to |
int |
position of end of input sequence to specify for an operation (e.g.: deletion), defaults to the 'from' value, minimum value = 'from', value, maximum = sequence length. |
|
threshold |
float/int |
threshold for various operations |
|
left |
bool |
operation should be done at the start of the sequence |
|
right |
bool |
operation should be done at the end of the sequence |
|
pattern |
string |
pattern to search for in sequence |
|
patterns |
infile |
file of patterns to search for in sequence |
Table 3. Recommended naming conventions.
There are two types of attributes for parameters. 'Global' attributes cen be defined for any ACD data type. Each data type then has its own set of 'specific' attributes. These definitions can refer to 'calculated' attributes generated automatically by ACD processing. The 'global' and 'specific' ttributes are part of the parameter definition and are placed between the square brackets.
Formalised:
datatype: parametername [
attribute: "value"
]
Attributes to parameters can specify the default value, and the requirements for a correct value, for a parameter. It can specify whether the parameter is mandatory and what the limits are for a valid value. There are global attributes that apply to all data types and there are data type-specific attributes.
default:
Defines the default value for the parameter, which can be dependent on the values of parameters defined earlier.
Each data type has a default value, which can be valid (for example a boolean will default to "N") or invalid (many input types will default to an empty string).
information:
The string giving information about the parameter, for use on Web forms and in GUIs and also a default prompt to the user
For some data types (sequence is a good example) there are standard prompts so no value is expected, and the acdvalid utility will issue a warning if an information attribute is found.
parameter:
Defines a parameter on the command line which can appear without a qualifier name. Also implies that the value is required and will be prompted for if missing.
standard:
Indicates whether a parameter is mandatory and will be prompted for if missing.
additional:
Indicates if the parameter should be queried for when the -options qualifier is set on the command line.
help:
The string shown when the -help qualifier is used on the command line
Help is usually only defined if a specific string is needed. If help is not defined, the value of the "information" attribute, or the default prompt, will be used.
expect:
A string used in the "Default" column of the command line syntax table in the documentation. This table is automatically generated from the ACD file, and in most cases there is a reasonable value generated. Where there is no suitable value, this attribute should be used to provide one.
valid:
A string used in the "Allowed values" column of the command line syntax table in the documentation. This table is automatically generated from the ACD file, and in most cases there is a reasonable value generated. Where there is no suitable value, this attribute should be used to provide one.
knowntype:
The knowntype attribute defines one of a controlled vocabulary of known value types. Some ACD data types require a knowntype attribute.
These standard values are read from a file knowntypes.standard which is stored and installed in the ACD file directory. A few other values are accepted, for example "(programname) output" for an outfile data type. These are documented under each output type. The acdvalid utility will check all knowntype values in an ACD file, and report any missing values for data types that require a knowntype.
prompt:
The string used if the user has to be queried for a value, though information can be used instead and usually only one will be defined. information is preferred.
missing:
Indicates whether a qualifier can have no value, especially when it appears on the command line (for example to override a default value in the ACD file).
needed:
Indicates whether a parameter is expected to be included in a GUI form. Some parameters are available on the command line, but are not generally useful to users, or can cause confusion when presented in a GUI form with all other options.
outputmodifier:
Indicates that this qualifier modifies the output in ways that can break parsers, for example by changing text output into HTML. Authors of wrappers can use this to test for qualifiers that can be hardcoded to fix the output syntax and content. Please let the EMBOSS team know if any other qualifiers are candidates for marking as output modifiers.
code:
A code word (no spaces) which is searched for in the file codes.english to give a standard prompt, for example when asking for an alignment gap penalty. The standard default prompts are in the same file. The code word is not case-sensitive. information is preferred.
comment:
A comment, provided for use by the EBI's SoapLab project but not defined in the standard ACD files.
style:
Provided for use by the EBI's SoapLab project but not defined in the standard ACD files.
Any global or specific attribute must have a second token representing the value of the attribute. The attribute must be followed by a colon ':' and usually the value will be enclosed in double quotes.
The syntax of the global attributes is
Formalised:
help: "String" information: "String" default: "value" additional: "Y"/"N" parameter: "Y"/"N" information: "String" standard: "Y"/"N"
Example:
sequence: asequence [
standard: "Y"
information: "Enter filename"
]
The parameter: attribute is a boolean attribute, defining the order of the parameters on the command line, if the parameter name is not explicitly entered on the command line. If set to Y, the parameter can be entered on the command line without using the parameter name.
Formalised:
datatype: parametername [
parameter: Y/N
]
Example:
ACD file definition (partly) :
application: acddemo [
documentation: ""
groups: ""
]
sequence: asequence [
]
Command line :
% acddemo -asequence filename.seq
Is equivalent to:
ACD file definition (partly) :
sequence: asequence [
parameter: Y
]
Command line:
% acddemo filename.seq
In both examples filename.seq is the value of the parameter named asequence for the EMBOSS program acddemo.
The second example will also allow the command line from the first, as parameter names are accepted as qualifiers.
If more then one parameter: attribute is used, the order in which they appear in the ACD file is the same as the order in which they appear on the command line.
Example: ACD file definition (partly) :
application: acddemo [
documentation: ""
groups: ""
]
sequence: asequence [
parameter: Y
]
outseq: outseq [
parameter: Y
]
Command line :
% acddemo infilename.seq outfilename.seq
will assign the name infilename.seq to parameter asequence, and outfilename.seq to parameter outseq.
Any program is expected to have one or more required inputs. An ACD data type that is defined as a "parameter:" (see section 2.4.1.1.1) is automatically counted as required. All other required inputs should have the "standard:" attribute set.
When the program runs, the user will be prompted for any "required" values that are not already on the command line.
The only difference between "parameter:" and "standard:" is that a "parameter" can appear on the command line as the simple value with no name, to provide simple command lines.
When the additional: attribute is set, the parameter will only be queried for, when the -options qualifier is set (on the command line or when the system default is set using an environment variable (See 3.7) or any other way). If the -options qualifier is not set, the user will not be queried for this parameter, if it is omitted in the program execution (i.e. not mentioned on the command line or any other way).
The information: attribute defines the text hint to the user entering a data value. The same text is intended for use in the prompt to the user at a terminal, and as the text in an HTML form or a GUI.
In rare cases where the information: string is misleading, a prompt: string can be defined for use as a terminal prompt. For general use, information: is now preferred.
To provide standard prompts for common ACD data, there are default information: strings for most data types. These can be found in the file codes.english with the names DEFXXXX where XXXX is the name of the ACD data type.
Common practice is to use the default prompt for input and output ACD data types.
The help: attribute is shown in the help information, when the user requests assistance using the -help qualifier on the command line, or when help in other format is requested (Web page).
Again, there is a default help string in the codes.english file with the name HELPXXXX where XXXX is the name of the ACD data type.
The codes.english file includes some additional standard prompts such as GAP for gap penalties. These prompts can be used with the code: attribute, for example code: "GAP", but GUI developers found these hard to use, so we have replaced them with normal information: attributes.
The default set of attributes is available for all ACD data type definitions.
Each ACD type has its own set of specific attributes, summarized in Table 1 and described in more detail below.
Formalised:
|
Data type |
Attribute definition |
Description |
|
array |
minimum: float |
Minimum value |
|
|
maximum: float |
Maximum value |
|
|
increment: float |
(Not used by ACD) Increment for GUIs |
|
|
precision: integer |
(Not used by ACD) Floating precision for GUIs |
|
|
warnrange: Y/N |
Warning if values are out of range |
|
|
size: integer |
Number of values required |
|
|
sum: float |
Total for all values |
|
|
sumtest: Y/N |
Test sum of all values |
|
|
tolerance: float |
Tolerance (sum +/- tolerance) of the total |
|
float |
minimum: float |
Minimum value |
|
|
maximum: float |
Maximum value |
|
|
increment: float |
(Not used by ACD) Increment for GUIs |
|
|
precision: integer |
Precision for printing values |
|
|
warnrange: Y/N |
Warning if values are out of range |
|
integer |
minimum: integer |
Minimum value |
|
|
maximum: integer |
Maximum value |
|
|
increment: integer |
(Not used by ACD) Increment for GUIs |
|
|
warnrange: Y/N |
Warning if values are out of range |
|
range |
minimum: integer |
Minimum value |
|
|
maximum: integer |
Maximum value |
|
|
size: integer |
Exact number of values required |
|
|
minsize: integer |
Minimum number of values required |
|
string |
minlength: integer |
Minimum length |
|
|
maxlength: integer |
Maximum length |
|
|
pattern: string |
Regular expression for validation |
|
|
upper: Y/N |
Convert to upper case |
|
|
lower: Y/N |
Convert to lower case |
|
|
word: Y/N |
Disallow whitespace in strings |
Table 4.1. Simple data types - attributes.
The value for an array is a set of floating point numbers with white space or commas. The size: attribute sets the number of elements in the array. As for the float data type, the minimum: and maximum: attributes define the lower and upper value limits and default to the boundaries as specified by the systems set-up. For validation purposes, the sum: attribute defines the total for all values in the array (tested unless the sumtest: attribute is false), and the tolerance: attribute specifies how closely the sum should match the total. Remember that most floating point fractions cannot be represented accurately in binary form.
Although there are (currently) no specific attributes for a boolean ACD type, care should be taken over the definition of the information: and help: attributes. These are used to prompt the user (interactively or via a GUI), and to provide help text. The text provided in each case should reflect the expected default value of the boolean option, which may be the opposite of what the name implies. For example, if set to "Y" by default, then the command line option would typically be "-noxxx" where "xxx" is the qualifier. If set to "N" by default, then the default action may be the opposite of what the information or help text implies. If the value is calculated, the user may need some extra guidance.
The outputmodifier: attribute is set where this parameter changes the content or syntax of the output. This is provided for the developers of other interfaces and parsers of EMBOSS output so that they can fix the value, or provide parsers for each alternative.
The minimum: and maximum: attributes define the lower and upper value limits and default to the boundaries as specified by the systems set-up.
The increment: attribute defines the steps that this parameter is allowed to take, in case there is a need to iterate this parameter. The increment: attribute can be any valid float value.
The precision: attribute defines the maximum number of significant decimal places that will be taken into account for this value.
The integer data type can hold simple integer values. The minimum: and maximum: attributes define the boundaries and default to the boundaries as specified by the systems setup. The increment: attribute defines the steps that this parameter is allowed to take, in case there is a need to iterate this parameter.
Sequence ranges have similar attribute to integers. The minimum: and maximum: attributes define the boundaries and default to the boundaries as specified by the systems setup. The minlength: attribute defines the minimum number of values required.
The size: attribute defines an exact number of values required. The minsize: attribute defines a minimum number of values required for ranges that can be any length. Only one of these values should be defined for any range.
The value provided by the user is a list of sequence position pairs to be interpreted by the application. The upper and lower bounds (sequence positions can be negative to count back from the end) will depend on the length of the sequence to which they are applied.
The minlength: attribute defines the minimum length the string must be, the maxlength: attribute defines the maximum length the string can be. The default minimum length is zero. There is no default maximum.
The pattern: attribute defines a regular expression used to check the string value. ACD uses the Perl-compatible regular expression library (PCRE) so any Perl-compatible regular expression should be usable.
The word: attribute requires the result to be a valid word with no whitespace. The default minimum length of zero allows an empty string but this is not accepted as a word. This may change in future.
Although there are (currently) no specific attributes for a toggle ACD type, care should be taken over the definition of the information: and help: attributes. These are used to prompt the user (interactively or via a GUI), and to provide help text. The text provided in each case should reflect the expected default value of the toggle option, which may be the opposite of what the name implies. For example, if set to "Y" by default, then the command line option would typically be "-noxxx" where "xxx" is the qualifier. If set to "N" by default, then the default action may be the opposite of what the information or help text implies. If the value is calculated, the user may need some extra guidance.
The outputmodifier: attribute is set where this parameter changes the content or syntax of the output. This is provided for the developers of other interfaces and parsers of EMBOSS output so that they can fix the value, or provide parsers for each alternative.
Formalised:
|
Data type |
Attribute definition |
Description |
|
codon |
name: string |
Codon table name |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
cpdb |
nullok: Y/N |
Can accept a null filename as 'no file' |
|
datafile |
name: string |
Default file base name |
|
|
extension: string |
Default file extension |
|
|
directory: string |
Default installed data directory |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
directory |
fullpath: Y/N |
Require full path in value |
|
|
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
|
extension: string |
Default file extension |
|
dirlist |
fullpath: Y/N |
Require full path in value |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
|
extension: string |
Default file extension |
|
discretestates |
length: integer |
Number of discrete state values per set |
|
|
size: integer |
Number of discrete state set |
|
|
characters: string |
Allowed discrete state characters (default is '' for all non-space characters |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
distances |
size: integer |
Number of rows |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
|
missval: integer |
Can have missing values (replicates zero) |
|
features |
type: string |
Feature type (protein, nucleotide, etc.) |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
filelist |
nullok: Y/N |
Can accept a null filename as 'no file' |
|
|
binary: Y/N |
File contains binary data |
|
frequencies |
length: integer |
Number of frequency loci/values per set |
|
|
size: integer |
Number of frequency sets |
|
|
continuous: Y/N |
Continuous character data only |
|
|
genedata: Y/N |
Gene frequency data only |
|
|
within: Y/N |
Continuous data for multiple individuals |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
infile |
nullok: Y/N |
Can accept a null filename as 'no file' |
|
|
trydefault: Y/N |
Default filename may not exist if nullok is true |
|
|
binary: Y/N |
File contains binary data |
|
matrix |
pname: string |
Default name for protein matrix |
|
|
nname: string |
Default name for nucleotide matrix |
|
|
protein: Y/N |
Protein matrix |
|
matrixf |
pname: string |
Default name for protein matrix |
|
|
nname: string |
Default name for nucleotide matrix |
|
|
protein: Y/N |
Protein matrix |
|
pattern |
minlength: integer |
Minimum pattern length |
|
|
maxlength: integer |
Maximum pattern length |
|
|
maxsize: integer |
Maximum number of patterns |
|
|
upper: Y/N |
Convert to upper case |
|
|
lower: Y/N |
Convert to lower case |
|
|
type: string |
Type (nucleotide, protein) |
|
properties |
length: integer |
Number of property values per set |
|
|
size: integer |
Number of property sets |
|
|
characters: string |
Allowed property characters (default is '' for all non-space characters) |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
regexp |
minlength: integer |
Minimum pattern length |
|
|
maxlength: integer |
Maximum pattern length |
|
|
maxsize: integer |
Maximum number of patterns |
|
|
upper: Y/N |
Convert to upper case |
|
|
lower: Y/N |
Convert to lower case |
|
|
type: string |
Type (string, nucleotide, protein) |
|
scop |
nullok: Y/N |
Can accept a null filename as 'no file' |
|
sequence |
type: string |
Input sequence type (protein, gapprotein, etc.) |
|
|
features: Y/N |
Read features if any |
|
|
entry: Y/N |
Read whole entry text |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
seqall |
type: string |
Input sequence type (protein, gapprotein, etc.) |
|
|
features: Y/N |
Read features if any |
|
|
entry: Y/N |
Read whole entry text |
|
|
minseqs: integer |
Minimum number of sequences |
|
|
maxseqs: integer |
Maximum number of sequences |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
seqset |
type: string |
Input sequence type (protein, gapprotein, etc.) |
|
|
features: Y/N |
Read features if any |
|
|
aligned: Y/N |
Sequences are aligned |
|
|
minseqs: integer |
Minimum number of sequences |
|
|
maxseqs: integer |
Maximum number of sequences |
|
|
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
seqsetall |
type: string |
Input sequence type (protein, gapprotein, etc.) |
|
|
features: Y/N |
Read features if any |
|
|
aligned: Y/N |
Sequences are aligned |
|
|
minseqs: integer |
Minimum number of sequences |
|
|
maxseqs: integer |
Maximum number of sequences |
|
|
minsets: integer |
Minimum number of sequence sets |
|
|
maxsets: integer |
Maximum number of sequence sets |
|
|
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
tree |
size: integer |
Number of trees (0 means any number) |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
Table 4.2. Input data types - attributes.
Codon usage tables are species-specific, and in some cases specific to a class of genes within a species. This makes it useful to specify a default value for a codon usage table name. Internally, a default is set in the ACD source code. Usually this is "Ehum.cut", the human codon usage table provided in the EMBOSS distribution.
Individual codon inputs can set their own default names with the name: attribute which in the current version has the same effect as setting the default: attribute.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a data file. The application must be able to accept a null value for this qualifier.
Cleaned PDB file input has a default value (typically "1azu") set in the ACD source code although this is not really a good idea.
Individual cpdb inputs can set their own default names with the name: attribute which in the current version has the same effect as setting the default: attribute.
The default datafile name is defined by two ACD attributes, name: and extension:. The directory: attribute defines the EMBOSS data subdirectory to be searched.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a data file. The application must be able to accept a null value for this qualifier.
The extension: attribute sets the extension for all files read from the directory. Files with other extensions will not be read
The fullpath: attribute can be used to require a full rather than a relative path specification for a directory.
If a null value (the current directory) is allowed,the nullok: attribute must be set true.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a directory. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no directory) as the default for programs where a directory is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The extension: attribute sets the extension for all files read from the directories. Files with other extensions will not be read
The fullpath: attribute can be used to require a full rather than a relative path specification for a directory.
If a null value (the current directory) is allowed,the nullok: attribute must be set true.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a directory. The application must be able to accept a null value for this qualifier.
The discretestates data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.
The attributes define characteristics required for Phylip programs.
The length: attribute defines the number of state values (the length of the discrete characters string) in each set
The size: attribute defines the number of sets of values, usually 1 but some programs will accept multiple sets.
The characters: attribute defines which discrete state characters can be specified. This is defined as a string containing all possible characters.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a discretestates file.
The distances data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.
The attributes define characteristics required for Phylip programs. The distance matrices accepted by ACD include all the formats read by Phylip, with automatic interconversion.
The length: attribute defines the number of rows in the distance matrix.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a distance file.
The type: attribute defines whether the feature input is "protein" or "nucleotide". There is a default based on the type of any input sequence, but a value should always be specified.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without features input. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no feature input) as the default for programs where a directory is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
Filelist is equivalent to infile, but allows the user to specify one or more input files.
The nullok: attribute specifies that a missing input file is acceptable to the application, and that -noxxx can be used on the command line to avoid reading the default input file (if any)
The frequencies data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.
The attributes define characteristics required for Phylip programs. The frequencies files formats accepted by ACD include all the formats read by Phylip, with automatic interconversion.
The length: attribute defines the number of loci (or values) in the frequencies file.
The size: attribute defines the number of sets of values, usually 1 but some programs will accept multiple sets.
The continuous: attribute specifies a frequencies file with continuous character data values.
The genedata: attribute specifies a frequencies file with genetic locus data values.
The within: attribute specifies a frequencies file with continuous data for multiple individuals (additional values on each line).
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a frequencies file.
The nullok: attribute specifies that a missing input file is acceptable to the application, and that -noxxx can be used on the command line to avoid reading the default input file (if any)
The trydefault: attribute specifies that the default filename may not exist. If nullok: is also defined as true then no error is reported.
The protein: attribute will determine if the scoring matrix is used as a DNA or Protein matrix.
The protein: attribute will determine if the scoring matrix is used as a DNA or Protein matrix.
Patterns are processed by an internal set of library functions designed to handle PROSITE-style pattern definitions.
The minlength: attribute defines the minimum length the string must be, the maxlength: attribute defines the maximum length the regular expression string can be.
The upper: and lower:attributes convert an input regular expression to upper or lower case before compiling.
The type: attribute describes the pattern as applying to nucleotide or protein sequence. Nucleotide patterns are compared in both directions.
The properties data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.
The attributes define characteristics required for Phylip programs. The properties files accepted by ACD include all the formats read by Phylip, with automatic interconversion.
The length: attribute defines the number of values in the properties file.
The size: attribute defines the number of sets of values, usually 1 but some programs will accept multiple sets.
The characters: attribute defines which property characters can be specified. This is defined as a string containing all possible characters.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a properties file.
Regular expressions are processed by the "Perl-Compatible Regular Expression Library" (PCRE). Any value must be accepted by this library's compilation function. Some additional attributes are provided for further validation by ACD.
The minlength: attribute defines the minimum length the string must be, the maxlength: attribute defines the maximum length the regular expression string can be.
The upper: and lower:attributes convert an input regular expression to upper or lower case before compiling.
The type: attribute describes the pattern as applying to nucleotide or protein sequence. Nucleotide patterns are compared in both directions.
Scop file input has a default value (typically "d3sdha") set in the ACD source code although this is not really a good idea.
Individual scop inputs can set their own default names with the name: attribute which in the current version has the same effect as setting the default: attribute.
The type: attribute will force the sequence to be of the given type. By default, any sequence type is accepted.
We recommend always defining the type: attribute so that the accepted input sequence type is always clear.
If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.
The sask: attribute sets the defauklt for the -sask qualifier, and if set to "Y" specifies that the program will prompt the user for a sequence begin and end position, and prompt for the reversing of a nucleotide sequence. The EMBOSS "yank" program works with fragments of sequences, and uses the sask: attribute to prompt the user.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where seqeunce input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The type: attribute will force the sequence(s) to be of the given type. By default, any sequence type is accepted.
We recommend always defining the type: attribute so that the accepted input sequence type is always clear.
If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where seqeunce input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The minseqs: attribute specifies a minimum number of sequences to be read. By default, a single sequence is acceptable.
The type: attribute will force the sequence set to be of the given type. By default, any sequence type is accepted.
We recommend always defining the type: attribute so that the accepted input sequence type is always clear.
The aligned: attribute, if true, specifies that all sequences in the input are expected to be aligned. If false, then the sequences are assumed to be unaligned, and are simply read into memory together for processing. We recommend always defining the aligned: attribute so that the nature of the sequence set if clearly defined.
If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where sequence input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The minseqs: attribute specifies a minimum number of sequences to be read. By default, a single sequence is acceptable.
The type: attribute will force the sequence set(s) to be of the given type. By default, any sequence type is accepted.
We recommend always defining the type: attribute so that the accepted input sequence type is always clear.
The aligned: attribute, if true, specifies that all sequences in the input are expected to be aligned. If false, then the sequences are assumed to be unaligned, and are simply read into memory together for processing. We recommend always defining the aligned: attribute so that the nature of the sequence set if clearly defined.
If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where sequence input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The minseqs: attribute specifies a minimum number of sequences to be read for each set. By default, a single sequence is acceptable.
The tree data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.
The attributes define characteristics required for Phylip programs. The tree files accepted by ACD include all the formats read by Phylip, with automatic interconversion.
The size: attribute defines the number of trees in the input file, usually 0 but some programs will accept multiple sets. Some can only accept a single tree (so the value should be set to "1" for these.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a properties file.
Formalised:
|
Data type |
Attribute definition |
Description |
|
list |
minimum: integer |
Minimum number of selections |
|
|
maximum: integer |
Maximum number of selections |
|
|
button: Y/N |
(Not used by ACD) Prefer checkboxes in GUI |
|
|
casesensitive: Y/N |
Case sensitive |
|
|
header: string |
Header description for list |
|
|
delimiter: string |
Delimiter for parsing values |
|
|
codedelimiter: string |
Delimiter for parsing |
|
|
values: string |
Codes and values with delimiters |
|
selection |
minimum: integer |
Minimum number of selections |
|
|
maximum: integer |
Maximum number of selections |
|
|
button: Y/N |
(Not used by ACD) Prefer radiobuttons in GUI |
|
|
casesensitive: Y/N |
Case sensitive matching |
|
|
header: string |
Header description for selection list |
|
|
delimiter: string |
Delimiter for parsing values |
|
|
values: string |
Values with delimiters |
Table 4.3. Selection data types - attributes.
For both selection list types, the values that the user can choose from are defined in the values: attribute as a string, delimited by the character that is given by the delimiter: attribute (which defaults to the semi-colon ';'). For the list data type there is a second delimiter ( codedelimiter:) character that defines the delimiter that separates the label from the value (defaults to the colon ":"). The minimum: and maximum: attributes define the number of choices that this parameter can handle. The header: attribute will hold the text that is displayed above the option list. The casesensitive: attribute will indicate if the options are case sensitive or not, but the value of the parameter will be exactly what the list value is. The button: attribute, which can either be Y(es) or N(o), is used in for web front ends, to indicate if radiobuttons/checkbox/selection lists are to be used or if the list is simply displayed with a text entry box beneath it, to enter the option.
The values: attribute contains the list of valid code names and values. The delimiter: and codedelimiter: attributes specify how to parse this string into individual list items.
The minimum: attribute specifies the minimum number of selections required. By default, 1 selection is required.
The maximum: attribute specifies the maximum number of selections required. By default, exactly 1 selection is required. A higher value allows multiple selections.
The header: attribute defines text to appear before the list is presented to the user. The information: attribute defines text to be used as a prompt after the list.
The delimiter: attribute specifies the character used in the values: string to separate list items.
The codedelimiter: attribute specifies the character used in the values: string to separate codes (names) and descriptions of list items.
The button: attribute suggests whether a list is best represented as checkboxes or radio buttons in an interface (value "Y") or as a pull-down list.
The casesensitive: attribute defines whether the input must match the exact case of the selection list item.
Example:
list: matrix [
default: "blosum" # default value
minimum: 1 maximum: 1 # must select exactly 1
header: "Comparison matrices" # printed before list
values: "B:blosum, P:pam, I:id" 3 valid values
delim: "," # delimiter default ";"
codedelim: ":" # label delimiter default ":"
prompt: "Select one" # prompt after list
button: Y # use radio buttons rather than
# checkboxes in HTML,
# ignored by ACD ]
What you get is:
Comparison matrices
B : blosum
P : pam
I : id
Select one [blosum] : PAM
The values: attribute contains the list of valid values. The delimiter: attribute specifies how to parse this string into individual selection list items.
The minimum: attribute specifies the minimum number of selections required. By default, 1 selection is required.
The maximum: attribute specifies the maximum number of selections required. By default, exactly 1 selection is required. A higher value allows multiple selections.
The header: attribute defines text to appear before the selection list is presented to the user. The information: attribute defines text to be used as a prompt after the list.
The delimiter: attribute specifies the character used in the values: string to separate list items.
The button: attribute suggests whether a selection list is best represented as checkboxes or radio buttons in an interface (value "Y") or as a pull-down list.
The casesensitive: attribute defines whether the input must match the exact case of the selection list item.
Example:
select: matrix [
default: "blosum" # default value
minimum: "1" maximum: "1" # must select exactly 1
header: "Comparison matrices" # printed before list
values: "blosum, pam, id" # valid values
delimiter: "," # delimiter default ";"
information: "Select one" # prompt after list
button: "Y" # use radio buttons rather than
# checkboxes in HTML,
# ignored by ACD
]
What you get is:
Comparison matrices
1 : blosum
2 : pam
3 : id
Select one [blosum] : PAM
Formalised:
|
Data type |
Attribute definition |
Description |
|
align |
type: string |
[P]rotein or [N]ucleotide |
|
|
taglist: string |
Extra tags to report |
|
|
minseqs: integer |
Minimum number of sequences |
|
|
maxseqs: integer |
Maximum number of sequences |
|
|
multiple: Y/N |
More than one alignment in one file |
|
|
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
featout |
name: string |
Default base file name (use of -ofname preferred) |
|
|
extension: string |
Default file extension (use of -offormat preferred) |
|
|
type: string |
Feature type (protein, nucleotide, etc.) |
|
|
multiple: Y/N |
Features for multiple sequences |
|
|
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null UFO as 'no output' |
|
outcodon |
name: string |
Default file name |
|
|
extension: string |
Default file extension |
|
|
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
outcpdb |
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
outdata |
type: string |
Data type |
|
|
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
|
binary: Y/N |
File contains binary data |
|
outdir |
fullpath: Y/N |
Require full path in value |
|
|
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
|
extension: string |
Default file extension |
|
|
binary: Y/N |
Files contain binary data |
|
|
temporary: Y/N |
Scratch directory for temporary files deleted on completion |
|
outdiscrete |
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
outdistance |
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
outfile |
name: string |
Default file name |
|
|
extension: string |
Default file extension |
|
|
append: Y/N |
Append to an existing file |
|
|
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
|
binary: Y/N |
File contains binary data |
|
outfileall |
name: string |
Default file name |
|
|
extension: string |
Default file extension |
|
|
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
|
binary: Y/N |
Files contains binary data |
|
outfreq |
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
outmatrix |
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
outmatrixf |
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
outproperties |
nulldefault: Y/N |
Defaults to 'no file' |
|
|
nullok: Y/N |
Can accept a null filename as 'no file' |
|
outscop |
n |