ACD Syntax 3.0.0

 

1. Introduction

The EMBOSS package consists of a large number of separate programs that have a specific function. They usually take a (number of) input file(s) and some parameters that are important to the function and produce output in the form of files, plots, web pages or simple text output.

The programs can be invoked in a myriad of ways. Its name could be entered on the command line with all parameters, so the program will have all the information it needs all at once. A more interactive way is a query-answer session with the user, in which the user is asked to enter a piece of information one at a time. A third way could be a web-interface where a user chooses the options for the program using lists, checkboxes, radio buttons etc. In EMBOSS, the way a program interacts with the user, its interface, is independent of the actual program.

 

1.1 Command line syntax

At the moment, EMBOSS programs are called by giving their name on the UNIX command line either with or without parameters. Many parameters can have qualifiers that will give more information about a parameter. For instance, the format of the information in a sequence file that is used as an input file could be specified on the command line, like:

% seqret filename.seq -sformat fasta

In this example the EMBOSS program ' seqret is called with the filename 'filename.seq' as its first parameter. '-sformat fasta' indicates that the sequence file is in 'fasta' format. A complete description of the command line syntax will follow in section 2 Formal Description of the ACD language. The percentage sign '%' indicates that the command was entered on the UNIX command line. This will be used throughout the documentation.

 

1.2 ACD files

Every EMBOSS program will be accompanied by a so-called ACD (Ajax Command Definitions) file, which describes the parameters that the program it refers to needs. It contains information about its input and output files and other parameters the program may need. It will indicate if any of the parameters are mandatory (like an input sequence file) or that certain parameters are within certain limits (a gap penalty for an alignment must be higher then 0 for instance). It can also indicate whether one parameter's value is dependent on the value or the presence of another. (An example: If the input sequence for an alignment program is DNA, it should not accept a protein comparison matrix).

 

1.3 ACD language

The parameters are defined in a special purpose language called Ajax Command Definitions or ACD, specially designed for EMBOSS. It will specify everything that can appear on the command line or can be used in another interface like web pages. It is a very 'forgiving' language in that it does not restrict the available syntax any more than is strictly necessary.

 

2. Formal description of the ACD language

 

2.1 ACD file overview

ACD files are simple text files that contain the definitions. The files usually have the same prefix as the program, but this is not required. ACD files use the extension '.acd'. This is mandatory.

Formalised:

token: token [
   definition
]

is equivalent to

token=token [
    definition
]

The first token in the file must be "application" directly followed by a colon ':' or an equal sign '='. The second token is the application name with which this ACD file is associated. The application name is followed by (required) application attributes enclosed in square brackets.

Formalised:

  application: appname [
     attributes
  ]

Example:

application: wossname [
    documentation: "Finds programs by keywords"
    groups: "Display"
  ]

The first token of a parameter definition is an Ajax datatype, directly followed by a colon ':' (preferred) or equal sign '='. The second token is the name by which this parameter is going to be known (this is also the name that is used by the EMBOSS program to get the value of the parameter). After the name, definitions are in mandatory square brackets, [], which can make a definition span multiple lines.

Formalised:

datatype: parametername [
    definition
  ]

Example:

sequence: asequence [
    standard: "Y"
  ]

Tokens representing data types can be abbreviated up to the point where they are not ambiguous. For example, default: can be abbreviated to default: or even d: although the latter is not recommended due to lack of clarity.

Values can be delimited (i.e. treated as one token) by double quotes

 

2.2 The application definition

The first token of an ACD file must be the application: token, followed by the application name. The application name and the ACD filename (without the .acd extension) are usually identical, but this is not mandatory. When a program calls the embInit("program") function with "program" as its parameter, the function will only look for an ACD file called program.acd. It will not compare the parameter with the string given after the application: token.

 

2.2.1 Application attributes

 

2.2.1.1 The documentation attribute

The application: token has a documentation: attribute which is followed by a string describing the function of the program. This documentation string will be used to generate the description of the program when the program is run or the user specifies the -help qualifier. When the documentation: attribute is missing, a warning will be issued.

Formalised:

application: appname [
    documentation: string
  ]

Example:

ACD file definition (partly):

application: seqret [
    documentation: "Reads and writes (returns) a sequence"
  ]

Command line:

% seqret
Reads and writes (returns) a sequence
Input sequence :

The ACD file starts with the definition of the program seqret. The documentation: attribute is followed by a string briefly explaining the function of the program and this string is shown after the program is invoked and before it prompts the user for any input. The documentation: string is also searched by the wossname utility, which finds applications by keyword (in the doc string) and group.

The length of the documentation: string should be kept to 63 characters or shorter in order to allow the wossname utility to display each program name and its documentation on one 80-character line.

The documentation: string should not end with a '.' character

Any acronyms or capitalised abbreviations in the documentation: string should be written in upper case. (e.g.: SNPs, EST, DNA, ABI, SRS, ASCII, CDS, mRNA, B-DNA, RNA, CpG, ORFs, MAR/SAR, PCR, STS, REBASE, SCOP, PROSITE, PRINTS, EMBL, TRANSFAC, AAINDEX, BLAST, GCG, EMBOSS)

The documentation: string should start with an upper-case letter.

 

2.2.1.2 The groups attribute

The groups: attribute allows the EMBOSS programs to be grouped together based on their functionality. The groups: attribute is followed by a string value, containing the name(s) of the group(s). When an application belongs to more then one group, the group names must be separated by either a comma (,) or semi-colon (;); i.e. a group name is not a token, but a list of tokens.

The groups: string is also searched by the wossname utility, which finds applications by keyword (in the doc string) and group.

Formalised:

application: appname [
    groups: "group name1, group2, ... "
  ]

Example: ACD file definition (partly):

application: seqret [
    groups: "Display"
  ]
 

2.2.1.3 Groups format

Group names can have spaces in them.

The group names can be split into sub-levels by the use of a ':' character:
First Level : Second Level
Several third-party interfaces are starting to rely upon there being a maximum of 2 levels, so do not use more than one ':' in a group name.

The group name is now checked against a list of accepted values in the file groups.standard which is defined and installed in the same directory as the ACD files. This file contains one line for each known group, with subgroups defined with a ":" delimiter, and spaces replaced by underscores. Each group also has a short description.

The table in the following section lists all groups currently defined

 

2.2.1.4 The groups structure

The First and Second level group names are given below with some explanation of what might be expected to be placed in the group.

If a group is composed of two levels, such as
Alignment : Consensus
then the group specification must not use the group names singly, (i.e. you must not use "Alignment" or "Consensus").
If the group consists of only one level, such as
Display
then please don't start adding sub-levels to it. (i.e. you must not use "Display : Features")

You are strongly encouraged to use the following groups structure. This is the set of groups defined by the groups.standard file. We have found that most things will fit in one or more of these groups. When, however, a completely new category of program is written, please discuss the creation of the new group name with the developers' mailing list. Sometimes a new group is required (for example the group "Enzyme Kinetics" which had to be created to hold 'findkm').

Top Level

Second Level

Description

Acd

 

ACD file utilities

Alignment

Consensus

Merging sequences to make a consensus

 

Differences

Finding differences between sequences

 

Dot_plots

Dot plot sequence comparisons

 

Global

Global sequence alignment

 

Local

Local sequence alignment

 

Multiple

Multiple sequence alignment

Assembly

Fragment_assembly

DNA sequence assembly

Display

 

Publication-quality display

Edit

 

Sequence editing

Enzyme_Kinetics

 

Enzyme kinetics calculations

Feature_tables

 

Manipulation and display of sequence annotation

HMM

 

Hidden Markov Model analysis

Information

 

Information and general help for users

Menus

 

Menu interface(s)

Nucleic

2D_structure

Nucleic acid secondary structure

 

Codon_usage

Codon usage analysis

 

Composition

Composition of nucleotide sequences

 

CpG_islands

CpG island detection and analysis

 

Gene_finding

Predictions of genes and other genomic features

 

Motifs

Nucleic acid motif searches

 

Mutation

Nucleic acid sequence mutation

 

Profiles

Nucleic acid profile generation and searching

 

Primers

Primer prediction

 

Repeats

Nucleic acid repeat detection

 

RNA_folding

RNA folding methods and analysis

 

Restriction

Restriction enzyme sites in nucleotide sequences

 

Transcription

Transcription factors, promoters and terminator prediction

 

Translation

Translation of nucleotide sequence to protein sequence

Phylogeny

Consensus

Phylogenetic consensus methods

 

Continuous_characters

Phylogenetic continuous character methods

 

Discrete_characters

Phylogenetic discrete character methods

 

Distance_matrix

Phylogenetic distance matrix methods

 

Gene_frequencies

Phylogenetic gene frequency methods

 

Molecular_sequence

Phylogenetic tree drawing methods

 

Tree_drawing

Phylogenetic molecular sequence methods

 

Misc

Phylogenetic other tools

Protein

2D_structure

Protein secondary structure

 

3D_structure

Protein tertiary structure

 

Composition

Composition of protein sequences

 

Motifs

Protein motif searches

 

Mutation

Protein sequence mutation

 

Profiles

Protein profile generation and searching

Test

 

Testing tools, not for general use.

Utils

Database_creation

Database installation

 

Database_indexing

Database indexing

 

Misc

Utility tools

 

Table 1. Standard application groups

 

2.3 Ajax Data Types

ACD files describe the parameters that a program needs, in an object-oriented manner. The most important types or objects are file objects, sequence objects, number objects, Boolean objects and string objects. The current objects are listed in Table 1.

Data type / Object

Description

Calculated Attributes

Specific Attributes

Command Line Qualifiers

All data types

 

All data types

 

additional: "N"
code: ""
comment: ""
default: ""
expected: ""
help: ""
information: ""
knowntype: ""
missing: "N"
needed: "y"
outputmodifier: "N"
parameter: "N"
prompt: ""
qualifier: ""
relations: ""
standard: "N"
style: ""
template: ""
valid: ""

 

Simple types

array

List of floating point numbers

 

minimum: (-FLT_MAX)
maximum: (FLT_MAX)
increment: 0
precision: 0
warnrange: Y
size: 1
sum: 1.0
sumtest: Y
tolerance: 0.01

 

boolean

Boolean value Yes/No

 

 

 

float

Floating point number

 

minimum: (-FLT_MAX)
maximum: (FLT_MAX)
increment: 1.0
precision: 3
warnrange: Y

 

integer

Integer

 

minimum: (INT_MIN)
maximum: (INT_MAX)
increment: 0
warnrange: Y

 

range

Sequence range

 

minimum: 1
maximum: (INT_MAX)
size: 0
minsize: 0

 

string

String value

length (integer)

minlength: 0
maxlength: (INT_MAX)
pattern: ""
upper: N
lower: N
word: N

 

toggle

Toggle value Yes/No

 

 

 

Input types

codon

Codon usage file in EMBOSS data path

 

name: "Ehum.cut"
nullok: N

format: ""

cpdb

Clean PDB file

 

nullok: N

format: ""

datafile

Data file

 

name: ""
extension: ""
directory: ""
nullok: N

 

directory

Directory

 

fullpath: N
nulldefault: N
nullok: N
extension: ""

 

dirlist

Directory with files

 

fullpath: N
nullok: N
extension: ""

 

discretestates

Discrete states file

 

length: 0
size: 1
characters: "01"
nullok: N

 

distances

Distance matrix

distancecount (integer)
distancesize (integer)
replicates (boolean)
hasmissing (boolean)

size: 1
nullok: N
missval: N

 

features

Readable feature table

fbegin (integer)
fend (integer)
flength (integer)
fprotein (boolean)
fnucleic (boolean)
fname (string)
fsize (string)

type: ""
nullok: N

fformat: ""
fopenfile: ""
fask: "N"
fbegin: "0"
fend: "0"
freverse: "N"

filelist

Comma-separated file list

 

nullok: N
binary: N

 

frequencies

Frequency value(s)

freqlength (integer)
freqsize (integer)
freqloci (integer)
freqgenedata (boolean)
freqcontinuous (boolean)
freqwithin (boolean)

length: 0
size: 1
continuous: N
genedata: N
within: N
nullok: N

 

infile

Input file

 

nullok: N
trydefault: N
binary: N

 

matrix

Comparison matrix file in EMBOSS data path

 

pname: "EBLOSUM62"
nname: "EDNAFULL"
protein: Y

 

matrixf

Comparison matrix file in EMBOSS data path

 

pname: "EBLOSUM62"
nname: "EDNAFULL"
protein: Y

 

pattern

Property value(s)

 

minlength: 1
maxlength: (INT_MAX)
maxsize: (INT_MAX)
upper: N
lower: N
type: "string"

pformat: ""
pmismatch: ""
pname: ""

properties

Property value(s)

propertylength (integer)
propertysize (integer)

length: 0
size: 1
characters: ""
nullok: N

 

regexp

Regular expression pattern

length (integer)

minlength: 1
maxlength: (INT_MAX)
maxsize: (INT_MAX)
upper: N
lower: N
type: "string"

pformat: ""
pname: ""

scop

Clean PDB file

 

nullok: N

format: ""

sequence

Readable sequence

begin (integer)
end (integer)
length (integer)
protein (boolean)
nucleic (boolean)
name (string)
usa (string)

type: ""
features: N
entry: N
nullok: N

sbegin: "0"
send: "0"
sreverse: "N"
sask: "N"
snucleotide: "N"
sprotein: "N"
slower: "N"
supper: "N"
sformat: ""
sdbname: ""
sid: ""
ufo: ""
fformat: ""
fopenfile: ""

seqall

Readable sequence(s)

begin (integer)
end (integer)
length (integer)
protein (boolean)
nucleic (boolean)
name (string)
usa (string)

type: ""
features: N
entry: N
minseqs: 1
maxseqs: (INT_MAX)
nullok: N

sbegin: "0"
send: "0"
sreverse: "N"
sask: "N"
snucleotide: "N"
sprotein: "N"
slower: "N"
supper: "N"
sformat: ""
sdbname: ""
sid: ""
ufo: ""
fformat: ""
fopenfile: ""

seqset

Readable set of sequences

begin (integer)
end (integer)
length (integer)
protein (boolean)
nucleic (boolean)
name (string)
usa (string)
totweight (float)
count (integer)

type: ""
features: N
aligned: N
minseqs: 1
maxseqs: (INT_MAX)
nulldefault: N
nullok: N

sbegin: "0"
send: "0"
sreverse: "N"
sask: "N"
snucleotide: "N"
sprotein: "N"
slower: "N"
supper: "N"
sformat: ""
sdbname: ""
sid: ""
ufo: ""
fformat: ""
fopenfile: ""

seqsetall

Readable sets of sequences

begin (integer)
end (integer)
length (integer)
protein (boolean)
nucleic (boolean)
name (string)
usa (string)
totweight (float)
count (integer)
multicount (integer)

type: ""
features: N
aligned: N
minseqs: 1
maxseqs: (INT_MAX)
minsets: 1
maxsets: (INT_MAX)
nulldefault: N
nullok: N

sbegin: "0"
send: "0"
sreverse: "N"
sask: "N"
snucleotide: "N"
sprotein: "N"
slower: "N"
supper: "N"
sformat: ""
sdbname: ""
sid: ""
ufo: ""
fformat: ""
fopenfile: ""

tree

Phylogenetic tree

treecount (integer)
speciescount (integer)
haslengths (boolean)

size: 0
nullok: N

 

Selection lists types

list

Choose from menu list of values

 

minimum: 1
maximum: 1
button: N
casesensitive: N
header: ""
delimiter: ";"
codedelimiter: ":"
values: ""

 

selection

Choose from selection list of values

 

minimum: 1
maximum: 1
button: N
casesensitive: N
header: ""
delimiter: ":"
values: ""

 

Output types

align

Alignment output file

 

type: ""
taglist: ""
minseqs: 1
maxseqs: (INT_MAX)
multiple: N
nulldefault: N
nullok: N

aformat: ""
aextension: ""
adirectory: ""
aname: ""
awidth: "0"
aaccshow: "N"
adesshow: "N"
ausashow: "N"
aglobal: "N"

featout

Writeable feature table

 

name: ""
extension: ""
type: ""
multiple: N
nulldefault: N
nullok: N

offormat: ""
ofopenfile: ""
ofextension: ""
ofdirectory: ""
ofname: ""
ofsingle: "N"

outcodon

Codon usage file

 

name: ""
extension: ""
nulldefault: N
nullok: N

odirectory: ""
oformat: ""

outcpdb

Cleaned PDB file

 

nulldefault: N
nullok: N

 

outdata

Formatted output file

 

type: ""
nulldefault: N
nullok: N
binary: N

odirectory: ""
oformat: ""

outdir

Output directory

 

fullpath: N
nulldefault: N
nullok: N
extension: ""
binary: N
temporary: N

 

outdiscrete

Discrete states file

 

nulldefault: N
nullok: N

odirectory: ""
oformat: ""

outdistance

Distance matrix

 

nulldefault: N
nullok: N

 

outfile

Output file

 

name: ""
extension: ""
append: N
nulldefault: N
nullok: N
binary: N

odirectory: ""

outfileall

Multiple output files

 

name: ""
extension: ""
nulldefault: N
nullok: N
binary: N

odirectory: ""

outfreq

Frequency value(s)

 

nulldefault: N
nullok: N

odirectory: ""
oformat: ""

outmatrix

Comparison matrix file

 

nulldefault: N
nullok: N

odirectory: ""
oformat: ""

outmatrixf

Comparison matrix file

 

nulldefault: N
nullok: N

odirectory: ""
oformat: ""

outproperties

Property value(s)

 

nulldefault: N
nullok: N

odirectory: ""
oformat: ""

outscop

Scop entry

 

nulldefault: N
nullok: N

odirectory: ""
oformat: ""

outtree

Phylogenetic tree

 

name: ""
extension: ""
nulldefault: N
nullok: N

odirectory: ""
oformat: ""

report

Report output file

 

type: ""
taglist: ""
multiple: N
precision: 3
nulldefault: N
nullok: N

rformat: ""
rname: ""
rextension: ""
rdirectory: ""
raccshow: "N"
rdesshow: "N"
rscoreshow: "Y"
rstrandshow: "Y"
rusashow: "N"
rmaxall: "0"
rmaxseq: "0"

seqout

Writeable sequence

 

name: ""
extension: ""
features: N
type: ""
nulldefault: N
nullok: N

osformat: ""
osextension: ""
osname: ""
osdirectory: ""
osdbname: ""
ossingle: "N"
oufo: ""
offormat: ""
ofname: ""
ofdirectory: ""

seqoutall

Writeable sequence(s)

 

name: ""
extension: ""
features: N
type: ""
minseqs: 1
maxseqs: (INT_MAX)
nulldefault: N
nullok: N

osformat: ""
osextension: ""
osname: ""
osdirectory: ""
osdbname: ""
ossingle: "N"
oufo: ""
offormat: ""
ofname: ""
ofdirectory: ""

seqoutset

Writeable sequences

 

name: ""
extension: ""
features: N
type: ""
minseqs: 1
maxseqs: (INT_MAX)
nulldefault: N
nullok: N
aligned: N

osformat: ""
osextension: ""
osname: ""
osdirectory: ""
osdbname: ""
ossingle: "N"
oufo: ""
offormat: ""
ofname: ""
ofdirectory: ""

Graphics types

graph

Graph device for a general graph

 

nulldefault: N
nullok: N

gprompt: "N"
gdesc: ""
gtitle: ""
gsubtitle: ""
gxtitle: ""
gytitle: ""
goutfile: ""
gdirectory: ""

xygraph

Graph device for a 2D graph

 

multiple: 1
nulldefault: N
nullok: N

gprompt: "N"
gdesc: ""
gtitle: ""
gsubtitle: ""
gxtitle: ""
gytitle: ""
goutfile: ""
gdirectory: ""

 

Table 2. Available Data Types/Objects in ACD.

 

2.3.1 Description of the data types

 

2.3.1.1 Simple

Array

Array parameters are lists of numbers, either integer or floating point. The ACD attributes control validation, for example the number of values, or a list of numbers that adds to a given total. The data value is a list of numbers separated by spaces or commas.

Boolean

Boolean parameters are simple switches. If they are entered on the command line the value will be Y (True), if they are absent from the command line the value will be the default value. The name can also be prefixed by 'no' to force the value to be N (False). This is needed if the default value is Y (True). The data value is Y for yes and N for no.

Integer

The integer data type can hold simple integer values. The value range can be controlled by minimum and maximum values (a minimum value of 0 or 1 is often useful).

Float

Simple float values. The value range can be controlled by minimum and maximum ACD attributes (a minimum value of 0.0 is often useful).

Range

Ranges of sequence positions. Originally defined as a simple list of paired numbers, ranges can now be specified in files with the range syntax "@filename", as pairs of numbers with text comments. For example:

# this is my set of ranges
 12      23
  4      5       this is like 12-23, but smaller
 67      10348   interesting region
String

Any string value. The length can be controlled by ACD attributes, and a regular expression pattern to provide more general validation if necessary. Most string values are free text, although strings can be used by a program for any input that is not covered by a defined ACD type.

Toggle

Toggle parameters are simple switches, and work in the same way as "boolean" parameters. Toggle parameters are intended for use in turning on/off other parameters. When ACD parameters are grouped in sections, a clean ACD file will have all the "required" parameters in the "required" secion and all the "additional" parameters in the "additional" section. Some of these will have calculated values for the "standard" and "additional" attributes, controlled by the value of another parameter. The "toggle" parameters are designed to be used in these calculated values, and can be in the "required" section even if not themselves defined as "standard".

Exactly like "boolean" parameters, if they are entered on the command line the value will be Y (True), if they are absent from the command line the value will be the default value. The name can also be prefixed by 'no' to force the value to be N (False). This is needed if the default value is Y (True). The data value is Y for yes and N for no.

   

2.3.1.2 Input[1]

Codon

Codon usage tables are simple files read from the EMBOSS data search path, and are distributed in the emboss/data directory.

Codon usage files can be read in several formats, including "gcg".

Cpdb

Cpdb (Cleaned PDB) files are simple input files in CPDB format. See the documentation for pdbparse, part of the EMBASSY domainatrix package, which generates CPDB files from PDB file input.

Datafile

Datafile input refers to a formatted data file to be read from the standard EMBOSS data file locations (see the EMBOSS Administrator's guide for full details).

EMBOSS looks for data files in the local/share/EMBOSS/data directory, or in various user directories.

Most data files are already defined as their own ACD types - matrix, matrixf, codon. Otehrs are hard coded file names that do not need their own ACD definition, although users are free to define their own file with the appropriate name to override the default file provided.

Directory

Directory defines a directory that can be used for input or output definitions.

Directory is intended for future use to replace string definitions of directory names in some applications, and to provide additional validation of the user input specific to directory specifications.

Dirlist

Directory defines a set (list) of directories that can be used for input or output definitions.

Dirlist is intended for future use to replace string definitions of directory names in some applications, and to provide additional validation of the user input specific to directory specifications.

Discretestates

Discretestates is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Discretestates input is used by the phylip "discrete character" applications. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.

Distances

Distances is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Discretestates input is used by the phylip "distance matrix" applications. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.

Features

Feature annotation in any known feature format. Features can also be read from a sequence and written with a sequence.

Filelist

Filelist defines a set (list) of input files.

Filelist is intended for future use to replace string definitions of input file names in some applications, and to provide additional validation of the user input specific to multiple input files.

Frequencies

Frequencies is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Discretestates input is used by the phylip "gene frequency and continuous character" applications. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.

Infile

Non-sequence-related data file. This data type refers to files that are to be used in the program and do usually not contain sequence data. The type of data can be identified by a "knowntype" attribute and matched to Outfile standard types, or to report, align, featout, or seqout formats.

Matrix

Comparison matrix files are used by many programs. They are data files read from the EMBOSS data search path, and are distributed in the emboss/data directory. For preference, we use the matrix files distributed with BLAST.

Integer matrices are usually faster and are preferred by most applications. Floating-point matrix files are also available if needed, and an integer matrix file can of course also be read as floating point.

The matrix data type has an attribute to force selection of a nucleic acid or protein comparison matrix. In ACD files, the type of the input sequence is often used here.

Remember that any application which uses gap penalties will need to set them separately for each matrix.

Matrixf

Floating point comparison matrices are required by some algorithms. An integer matrix file can of course be used equally well as a floating point matrix.

Pattern

Pattern definitions files allow multiple search patterns to be described, each with a name.

Pattern files are used for PROSITE syntax sequence patterns. The same syntax is used for "regexp" input. Pattern files also allow mismatch values to be defined for each pattern, and a "-pmismatch" qualifier sets the mismatch default for all patterns in the file. Mismatches are not appropriate for regular expression matches.

Properties

Properties is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Properties input is used by the phylip applications to define weights, ancestral states and factors (multi-state characters). By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.

Regexp

Any regular expression value, or (new in release 4.0.0) a file containing regular expressions and names.

The length can be vallidated and controlled by ACD attributes. The case can be set to upper or lower case only. The regular expression must be supported by the EMBOSS regular expression library.

EMBOSS uses the "Perl-Compatible Regular Expression Library" (PCRE), so any regular expression that is valid in Perl 5.0 should be valid here.

Scop

SCOP files are simple input files in SCOP format.

Sequence

USA (database reference or file) indicating a single sequence. The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.

Seqall

set of single sequences that can be addressed one after another (for example a set of sequences that will be used in an multiple alignment). The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.

Seqset

set of single sequences that can be used all at the same time (for example a database of some sort that is to be used for a pattern search). The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.

Seqsetall

One or more sets of single sequences that can be used all at the same time (for example a database of some sort that is to be used for a pattern search). The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.

Tree

Tree is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Properties input is used by the phylip applications to define one or more phylogenetic trees. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options. The trees are currently parsed by phylip itself, but in the near future we will implement parsing methods in ACD processing.

 

2.3.1.3 Selection lists

Selection lists are a way to present the user with a limited list of options he/she can choose from. For the user, the difference between the list and selection data type is minimal and lies only in the way the choices are labelled. In a selection data type, the choices are numbered automatically from 1 up. In a list data type the choices can be labelled by any arbitrary text label. The user can choose one of the options by either typing the number (for a selection type) or the text of the label (for a list type) or a non-ambiguous part of the value of the choice. In practice, the list data type is much preferred for this reason.

List

A list of text descriptions with short labels. The user can enter one (or sometimes more) labels, or can specify partial text descriptions. The program is given a list of text labels as input.

Selection

A list of text descriptions (usually short, unlike list data), with generated numbers. The user can enter one (or sometimes more) numbers, or can specify partial text descriptions. The program is given a list of text descriptions as input. The listdata type is usually preferred.

 

2.3.1.4 Output [1]

Align

An output file for sequence alignments. Defined in the same way as a plain text "Outfile" but with extra qualifiers to allow a choice of alignment formats, and attributes to specify whether the alignment will have 2 or more sequences (which limits the possible formats). The data is stored as sequences, the available formats include the most common sequence formats.

Featout

Feature annotation in any known feature format. Can also be stored with the sequence if the sequence output "features" attribute is set.

Outcodon

Output file containing codon usage data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Outcpdb

Output file containing cleaned PDB protein structure data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Outdata

Output file containing cleaned formatted data as tables or lists. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Multiple outdata definitions are by default appended to a single file. The individual ACD definitions allow the format of each file section to be defined.

Outdir

Output directory for multiple output files to be written. Specifying an outdir allows other properties to be defined, including the default file extension with the "extension" attribute.

Outdiscrete

Output file containing phylogenetics discrete characteristics data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Outdistance

Output file containing phylogenetics distance matrix data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Outfile

Non-sequence-related data file, usually plain text. This data type refers to files that are to be produced by the program and usually do not contain sequence data. The type of data can be identified by a "knowntype" attribute and matched to an Infile standard type for use as input to another program.

Outfileall

Non-sequence-related data files, usually plain text. This data type refers to files that are to be produced by the program and usually do not contain sequence data. The type of data can be identified by a "knowntype" attribute and matched to an Infile standard type for use as input to another program.

Outfreq

Output file containing phylogenetics character frequency data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Outmatrix

Output file containing integer comparison matrix data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Outmatrixf

Output file containing floating point comparison matrix data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Outproperties

Output file containing phylogenetics property data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Outscop

Output file containing SCOP protein domain data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Outtree

Output file containing phylogenetic tree data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.

Report

An output file for sequence annotation. Defined in the same way as a plain "Outfile" but with extra qualifiers to allow a choice of report formats. Report data is stored internally as a feature table, so the available formats include the most common feature formats.

Seqout

USA (database reference or file) indicating a single sequence. Can also write features if the "features" ACD attribute is set.

The default file extension is the sequence format, but can be specifically set with the "osextension" attribute, for example where appliations produce two or more sequence outputs.

Seqoutall

A set of single sequences to be written to a single file. Can also write features if the "features" ACD attribute is set.

The default file extension is the sequence format, but can be specifically set with the "osextension" attribute, for example where appliations produce two or more sequence outputs.

Seqoutset

A set of single sequences stored in memory together, usually a multiple sequence alignment. Can also write features if the "features" ACD attribute is set.

The default file extension is the sequence format, but can be specifically set with the "osextension" attribute, for example where appliations produce two or more sequence outputs.

 

2.3.1.5 Graphics

Graph

For graphical output of any general kind, including dotplots. The data value is the graphics device, as specified by the "PLPLOT" graphics library used in EMBOSS at present. Example values include "ps" for Postscript, "png" for PNG files, and "X11" for X-Windows. A value of "?" in answer to the prompt will list the available graphics devices on your installation.

Xygraph

For graphical output as a simple two dimensional (2D) XY plot with the sequence along the x-axis. . The data value is the graphics device, as specified by the "PLPLOT" graphics library used in EMBOSS at present. Example values include "ps" for Postscript, "png" for PNG files, and "X11" for X-Windows. A value of "?" in answer to the prompt will list the available graphics devices on your installation.

 

2.3.2 Parameter names

ACD objects have mandatory names.

Formalised:

datatype: parametername [
 ]

Example:

sequence: asequence [
 ]

This defines asequence to be the name of a sequence object.

In order to assign a value to a parameter, the name of the parameter can be specified on the command line (in a number of ways, see section 4) followed by a value that is appropriate for that data type.

Example:

ACD file definition (partly):

sequence: asequence [
  ]

Command line :

% acddemo -asequence filename.seq

This defines filename.seq to be the value of the parameter named asequence for the EMBOSS program acddemo.

If a parameter is defined with a special parameter attribute ( parameter:"Y"), using the name of the parameter on the command line is not mandatory (see section 3.4). This is commonly used for input data and for output filenames.

The name of an object is also used, in the EMBOSS program, to refer to the value of the parameter. After the initiation call using the EMBOSS function embInit(), the values of the parameters have been read in and checked (see 1.4). The program must then assign the parameter to an actual EMBOSS object, like sequence (AjPSeq), string (AjPStr) etc. The actual function calls are beyond the scope of this document, and the reader is referred to the AJAX documentation (http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EDATA for the SRS searchable Object documentation), but some examples can be found in section 1.4 and 1.5.

The name can also be used in the definition of other ACD parameters. The value of the parameter (or variable) is retrieved, using the dollar sign '$' and a the name of the parameter encapsulated by a pair of parentheses.

Formalised:

$(parametername)

Example:

integer: gappenalty    [
    standard: Y
    default: 10
  ]


integer: gapextpenalty [
    default: $(gappenalty)
  ]

This defines the default for parameter gapextpenalty as the value of parameter gappenalty.

Naming conventions

Although everybody is free to use any (valid) name for a parameter, we would like to propose a naming convention, to streamline the development of ACD files.

Name

Datatype

Usage

sequence

sequence

primary input sequence, generally required

outseq

outseq

primary output sequence, generally required, generally should default to the primary input sequence name, extension defaults to the name of the output sequence format.

outfile

outfile

primary output non-sequence results file, generally required. The file extension should be allowed to default to the application name.

data

infile

primary auxiliary input data file, generally optional

minlen

int

minimal length of sequence feature to be found

maxlen

int

maximum length of sequence feature to be found

wordsize

int

word size for hash tables etc. generally minimum=2 for protein, 4 for DNA

window

int

window length for calculating dotplots/features/etc.

shift

int

amount by which window is shifted in each iteration

consensus

bool

flag for whether consensus sequence should be output

gap

float

gap penalty

gapext

float

gap extension penalty

from

int

position of start of input sequence to specify for an operation (e.g. deletion), defaults to start of sequence, minimum value = 1, maximum value = sequence length

to

int

position of end of input sequence to specify for an operation (e.g.: deletion), defaults to the 'from' value, minimum value = 'from', value, maximum = sequence length.

threshold

float/int

threshold for various operations

left

bool

operation should be done at the start of the sequence

right

bool

operation should be done at the end of the sequence

pattern

string

pattern to search for in sequence

patterns

infile

file of patterns to search for in sequence

  Table 3. Recommended naming conventions.

 

2.4 Attributes

There are two types of attributes for parameters. 'Global' attributes cen be defined for any ACD data type. Each data type then has its own set of 'specific' attributes. These definitions can refer to 'calculated' attributes generated automatically by ACD processing. The 'global' and 'specific' ttributes are part of the parameter definition and are placed between the square brackets.

Formalised:

datatype: parametername [ 
    attribute: "value"
  ]
 

2.4.1 Attributes

Attributes to parameters can specify the default value, and the requirements for a correct value, for a parameter. It can specify whether the parameter is mandatory and what the limits are for a valid value. There are global attributes that apply to all data types and there are data type-specific attributes.

 

2.4.1.1 The global attributes

default:

Defines the default value for the parameter, which can be dependent on the values of parameters defined earlier.

Each data type has a default value, which can be valid (for example a boolean will default to "N") or invalid (many input types will default to an empty string).

information:

The string giving information about the parameter, for use on Web forms and in GUIs and also a default prompt to the user

For some data types (sequence is a good example) there are standard prompts so no value is expected, and the acdvalid utility will issue a warning if an information attribute is found.

parameter:

Defines a parameter on the command line which can appear without a qualifier name. Also implies that the value is required and will be prompted for if missing.

standard:

Indicates whether a parameter is mandatory and will be prompted for if missing.

additional:

Indicates if the parameter should be queried for when the -options qualifier is set on the command line.

help:

The string shown when the -help qualifier is used on the command line

Help is usually only defined if a specific string is needed. If help is not defined, the value of the "information" attribute, or the default prompt, will be used.

expect:

A string used in the "Default" column of the command line syntax table in the documentation. This table is automatically generated from the ACD file, and in most cases there is a reasonable value generated. Where there is no suitable value, this attribute should be used to provide one.

valid:

A string used in the "Allowed values" column of the command line syntax table in the documentation. This table is automatically generated from the ACD file, and in most cases there is a reasonable value generated. Where there is no suitable value, this attribute should be used to provide one.

knowntype:

The knowntype attribute defines one of a controlled vocabulary of known value types. Some ACD data types require a knowntype attribute.

These standard values are read from a file knowntypes.standard which is stored and installed in the ACD file directory. A few other values are accepted, for example "(programname) output" for an outfile data type. These are documented under each output type. The acdvalid utility will check all knowntype values in an ACD file, and report any missing values for data types that require a knowntype.

prompt:

The string used if the user has to be queried for a value, though information can be used instead and usually only one will be defined. information is preferred.

missing:

Indicates whether a qualifier can have no value, especially when it appears on the command line (for example to override a default value in the ACD file).

needed:

Indicates whether a parameter is expected to be included in a GUI form. Some parameters are available on the command line, but are not generally useful to users, or can cause confusion when presented in a GUI form with all other options.

outputmodifier:

Indicates that this qualifier modifies the output in ways that can break parsers, for example by changing text output into HTML. Authors of wrappers can use this to test for qualifiers that can be hardcoded to fix the output syntax and content. Please let the EMBOSS team know if any other qualifiers are candidates for marking as output modifiers.

code:

A code word (no spaces) which is searched for in the file codes.english to give a standard prompt, for example when asking for an alignment gap penalty. The standard default prompts are in the same file. The code word is not case-sensitive. information is preferred.

comment:

A comment, provided for use by the EBI's SoapLab project but not defined in the standard ACD files.

style:

Provided for use by the EBI's SoapLab project but not defined in the standard ACD files.

Any global or specific attribute must have a second token representing the value of the attribute. The attribute must be followed by a colon ':' and usually the value will be enclosed in double quotes.

The syntax of the global attributes is

Formalised:

help: "String"
information: "String"
default: "value"
additional: "Y"/"N"
parameter: "Y"/"N"
information: "String"
standard: "Y"/"N"

Example:

sequence: asequence [
    standard: "Y"
    information: "Enter filename"
  ]
 
2.4.1.1.1 Parameter: attribute

The parameter: attribute is a boolean attribute, defining the order of the parameters on the command line, if the parameter name is not explicitly entered on the command line. If set to Y, the parameter can be entered on the command line without using the parameter name.

Formalised:

datatype: parametername [
    parameter: Y/N
  ]

Example:

ACD file definition (partly) :

application: acddemo [
    documentation: ""
    groups: ""
  ]

sequence: asequence [
  ]

Command line :

% acddemo -asequence filename.seq

Is equivalent to:

ACD file definition (partly) :

sequence: asequence [
    parameter: Y
  ]

Command line:

% acddemo filename.seq

In both examples filename.seq is the value of the parameter named asequence for the EMBOSS program acddemo.

The second example will also allow the command line from the first, as parameter names are accepted as qualifiers.

If more then one parameter: attribute is used, the order in which they appear in the ACD file is the same as the order in which they appear on the command line.

Example: ACD file definition (partly) :

application: acddemo [
    documentation: ""
    groups: ""
  ]

sequence: asequence [
    parameter: Y
  ]
outseq: outseq [
    parameter: Y
  ]

Command line :

% acddemo infilename.seq outfilename.seq

will assign the name infilename.seq to parameter asequence, and outfilename.seq to parameter outseq.

 
2.4.1.1.2 Standard: attribute

Any program is expected to have one or more required inputs. An ACD data type that is defined as a "parameter:" (see section 2.4.1.1.1) is automatically counted as required. All other required inputs should have the "standard:" attribute set.

When the program runs, the user will be prompted for any "required" values that are not already on the command line.

The only difference between "parameter:" and "standard:" is that a "parameter" can appear on the command line as the simple value with no name, to provide simple command lines.

 
2.4.1.1.3 Additional: attribute

When the additional: attribute is set, the parameter will only be queried for, when the -options qualifier is set (on the command line or when the system default is set using an environment variable (See 3.7) or any other way). If the -options qualifier is not set, the user will not be queried for this parameter, if it is omitted in the program execution (i.e. not mentioned on the command line or any other way).

 
2.4.1.1.4 The prompt: help: and information: attributes

The information: attribute defines the text hint to the user entering a data value. The same text is intended for use in the prompt to the user at a terminal, and as the text in an HTML form or a GUI.

In rare cases where the information: string is misleading, a prompt: string can be defined for use as a terminal prompt. For general use, information: is now preferred.

To provide standard prompts for common ACD data, there are default information: strings for most data types. These can be found in the file codes.english with the names DEFXXXX where XXXX is the name of the ACD data type.

Common practice is to use the default prompt for input and output ACD data types.

The help: attribute is shown in the help information, when the user requests assistance using the -help qualifier on the command line, or when help in other format is requested (Web page).

Again, there is a default help string in the codes.english file with the name HELPXXXX where XXXX is the name of the ACD data type.

The codes.english file includes some additional standard prompts such as GAP for gap penalties. These prompts can be used with the code: attribute, for example code: "GAP", but GUI developers found these hard to use, so we have replaced them with normal information: attributes.

 

2.4.1 2 Data type-specific specific attributes

The default set of attributes is available for all ACD data type definitions.

Each ACD type has its own set of specific attributes, summarized in Table 1 and described in more detail below.

 
2.4.1.2.1 Simple

Formalised:

Data type

Attribute definition

Description

array

minimum: float

Minimum value
Default: (-FLT_MAX)

 

maximum: float

Maximum value
Default: (FLT_MAX)

 

increment: float

(Not used by ACD) Increment for GUIs
Default: 0

 

precision: integer

(Not used by ACD) Floating precision for GUIs
Default: 0

 

warnrange: Y/N

Warning if values are out of range
Default: Y

 

size: integer

Number of values required
Default: 1

 

sum: float

Total for all values
Default: 1.0

 

sumtest: Y/N

Test sum of all values
Default: Y

 

tolerance: float

Tolerance (sum +/- tolerance) of the total
Default: 0.01

float

minimum: float

Minimum value
Default: (-FLT_MAX)

 

maximum: float

Maximum value
Default: (FLT_MAX)

 

increment: float

(Not used by ACD) Increment for GUIs
Default: 1.0

 

precision: integer

Precision for printing values
Default: 3

 

warnrange: Y/N

Warning if values are out of range
Default: Y

integer

minimum: integer

Minimum value
Default: (INT_MIN)

 

maximum: integer

Maximum value
Default: (INT_MAX)

 

increment: integer

(Not used by ACD) Increment for GUIs
Default: 0

 

warnrange: Y/N

Warning if values are out of range
Default: Y

range

minimum: integer

Minimum value
Default: 1

 

maximum: integer

Maximum value
Default: (INT_MAX)

 

size: integer

Exact number of values required
Default: 0

 

minsize: integer

Minimum number of values required
Default: 0

string

minlength: integer

Minimum length
Default: 0

 

maxlength: integer

Maximum length
Default: (INT_MAX)

 

pattern: string

Regular expression for validation
Default: ""

 

upper: Y/N

Convert to upper case
Default: N

 

lower: Y/N

Convert to lower case
Default: N

 

word: Y/N

Disallow whitespace in strings
Default: N

 

Table 4.1. Simple data types - attributes.

Array

The value for an array is a set of floating point numbers with white space or commas. The size: attribute sets the number of elements in the array. As for the float data type, the minimum: and maximum: attributes define the lower and upper value limits and default to the boundaries as specified by the systems set-up. For validation purposes, the sum: attribute defines the total for all values in the array (tested unless the sumtest: attribute is false), and the tolerance: attribute specifies how closely the sum should match the total. Remember that most floating point fractions cannot be represented accurately in binary form.

Boolean

Although there are (currently) no specific attributes for a boolean ACD type, care should be taken over the definition of the information: and help: attributes. These are used to prompt the user (interactively or via a GUI), and to provide help text. The text provided in each case should reflect the expected default value of the boolean option, which may be the opposite of what the name implies. For example, if set to "Y" by default, then the command line option would typically be "-noxxx" where "xxx" is the qualifier. If set to "N" by default, then the default action may be the opposite of what the information or help text implies. If the value is calculated, the user may need some extra guidance.

The outputmodifier: attribute is set where this parameter changes the content or syntax of the output. This is provided for the developers of other interfaces and parsers of EMBOSS output so that they can fix the value, or provide parsers for each alternative.

Float

The minimum: and maximum: attributes define the lower and upper value limits and default to the boundaries as specified by the systems set-up.

The increment: attribute defines the steps that this parameter is allowed to take, in case there is a need to iterate this parameter. The increment: attribute can be any valid float value.

The precision: attribute defines the maximum number of significant decimal places that will be taken into account for this value.

Integer

The integer data type can hold simple integer values. The minimum: and maximum: attributes define the boundaries and default to the boundaries as specified by the systems setup. The increment: attribute defines the steps that this parameter is allowed to take, in case there is a need to iterate this parameter.

Range

Sequence ranges have similar attribute to integers. The minimum: and maximum: attributes define the boundaries and default to the boundaries as specified by the systems setup. The minlength: attribute defines the minimum number of values required.

The size: attribute defines an exact number of values required. The minsize: attribute defines a minimum number of values required for ranges that can be any length. Only one of these values should be defined for any range.

The value provided by the user is a list of sequence position pairs to be interpreted by the application. The upper and lower bounds (sequence positions can be negative to count back from the end) will depend on the length of the sequence to which they are applied.

String

The minlength: attribute defines the minimum length the string must be, the maxlength: attribute defines the maximum length the string can be. The default minimum length is zero. There is no default maximum.

The pattern: attribute defines a regular expression used to check the string value. ACD uses the Perl-compatible regular expression library (PCRE) so any Perl-compatible regular expression should be usable.

The word: attribute requires the result to be a valid word with no whitespace. The default minimum length of zero allows an empty string but this is not accepted as a word. This may change in future.

Toggle

Although there are (currently) no specific attributes for a toggle ACD type, care should be taken over the definition of the information: and help: attributes. These are used to prompt the user (interactively or via a GUI), and to provide help text. The text provided in each case should reflect the expected default value of the toggle option, which may be the opposite of what the name implies. For example, if set to "Y" by default, then the command line option would typically be "-noxxx" where "xxx" is the qualifier. If set to "N" by default, then the default action may be the opposite of what the information or help text implies. If the value is calculated, the user may need some extra guidance.

The outputmodifier: attribute is set where this parameter changes the content or syntax of the output. This is provided for the developers of other interfaces and parsers of EMBOSS output so that they can fix the value, or provide parsers for each alternative.  

2.4.1.2.2 Input

Formalised:

Data type

Attribute definition

Description

codon

name: string

Codon table name
Default: "Ehum.cut"

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

cpdb

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

datafile

name: string

Default file base name
Default: ""

 

extension: string

Default file extension
Default: ""

 

directory: string

Default installed data directory
Default: ""

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

directory

fullpath: Y/N

Require full path in value
Default: N

 

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

 

extension: string

Default file extension
Default: ""

dirlist

fullpath: Y/N

Require full path in value
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

 

extension: string

Default file extension
Default: ""

discretestates

length: integer

Number of discrete state values per set
Default: 0

 

size: integer

Number of discrete state set
Default: 1

 

characters: string

Allowed discrete state characters (default is '' for all non-space characters
Default: "01"

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

distances

size: integer

Number of rows
Default: 1

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

 

missval: integer

Can have missing values (replicates zero)
Default: N

features

type: string

Feature type (protein, nucleotide, etc.)
Default: ""

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

filelist

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

 

binary: Y/N

File contains binary data
Default: N

frequencies

length: integer

Number of frequency loci/values per set
Default: 0

 

size: integer

Number of frequency sets
Default: 1

 

continuous: Y/N

Continuous character data only
Default: N

 

genedata: Y/N

Gene frequency data only
Default: N

 

within: Y/N

Continuous data for multiple individuals
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

infile

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

 

trydefault: Y/N

Default filename may not exist if nullok is true
Default: N

 

binary: Y/N

File contains binary data
Default: N

matrix

pname: string

Default name for protein matrix
Default: "EBLOSUM62"

 

nname: string

Default name for nucleotide matrix
Default: "EDNAFULL"

 

protein: Y/N

Protein matrix
Default: Y

matrixf

pname: string

Default name for protein matrix
Default: "EBLOSUM62"

 

nname: string

Default name for nucleotide matrix
Default: "EDNAFULL"

 

protein: Y/N

Protein matrix
Default: Y

pattern

minlength: integer

Minimum pattern length
Default: 1

 

maxlength: integer

Maximum pattern length
Default: (INT_MAX)

 

maxsize: integer

Maximum number of patterns
Default: (INT_MAX)

 

upper: Y/N

Convert to upper case
Default: N

 

lower: Y/N

Convert to lower case
Default: N

 

type: string

Type (nucleotide, protein)
Default: "string"

properties

length: integer

Number of property values per set
Default: 0

 

size: integer

Number of property sets
Default: 1

 

characters: string

Allowed property characters (default is '' for all non-space characters)
Default: ""

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

regexp

minlength: integer

Minimum pattern length
Default: 1

 

maxlength: integer

Maximum pattern length
Default: (INT_MAX)

 

maxsize: integer

Maximum number of patterns
Default: (INT_MAX)

 

upper: Y/N

Convert to upper case
Default: N

 

lower: Y/N

Convert to lower case
Default: N

 

type: string

Type (string, nucleotide, protein)
Default: "string"

scop

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

sequence

type: string

Input sequence type (protein, gapprotein, etc.)
Default: ""

 

features: Y/N

Read features if any
Default: N

 

entry: Y/N

Read whole entry text
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

seqall

type: string

Input sequence type (protein, gapprotein, etc.)
Default: ""

 

features: Y/N

Read features if any
Default: N

 

entry: Y/N

Read whole entry text
Default: N

 

minseqs: integer

Minimum number of sequences
Default: 1

 

maxseqs: integer

Maximum number of sequences
Default: (INT_MAX)

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

seqset

type: string

Input sequence type (protein, gapprotein, etc.)
Default: ""

 

features: Y/N

Read features if any
Default: N

 

aligned: Y/N

Sequences are aligned
Default: N

 

minseqs: integer

Minimum number of sequences
Default: 1

 

maxseqs: integer

Maximum number of sequences
Default: (INT_MAX)

 

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

seqsetall

type: string

Input sequence type (protein, gapprotein, etc.)
Default: ""

 

features: Y/N

Read features if any
Default: N

 

aligned: Y/N

Sequences are aligned
Default: N

 

minseqs: integer

Minimum number of sequences
Default: 1

 

maxseqs: integer

Maximum number of sequences
Default: (INT_MAX)

 

minsets: integer

Minimum number of sequence sets
Default: 1

 

maxsets: integer

Maximum number of sequence sets
Default: (INT_MAX)

 

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

tree

size: integer

Number of trees (0 means any number)
Default: 0

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

 

Table 4.2. Input data types - attributes.

Codon

Codon usage tables are species-specific, and in some cases specific to a class of genes within a species. This makes it useful to specify a default value for a codon usage table name. Internally, a default is set in the ACD source code. Usually this is "Ehum.cut", the human codon usage table provided in the EMBOSS distribution.

Individual codon inputs can set their own default names with the name: attribute which in the current version has the same effect as setting the default: attribute.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a data file. The application must be able to accept a null value for this qualifier.

Cpdb

Cleaned PDB file input has a default value (typically "1azu") set in the ACD source code although this is not really a good idea.

Individual cpdb inputs can set their own default names with the name: attribute which in the current version has the same effect as setting the default: attribute.

Datafile

The default datafile name is defined by two ACD attributes, name: and extension:. The directory: attribute defines the EMBOSS data subdirectory to be searched.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a data file. The application must be able to accept a null value for this qualifier.

Directory

The extension: attribute sets the extension for all files read from the directory. Files with other extensions will not be read

The fullpath: attribute can be used to require a full rather than a relative path specification for a directory.

If a null value (the current directory) is allowed,the nullok: attribute must be set true.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a directory. The application must be able to accept a null value for this qualifier.

The nulldefault: attribute overrides the default name generation, and uses an empty string (no directory) as the default for programs where a directory is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.

Dirlist

The extension: attribute sets the extension for all files read from the directories. Files with other extensions will not be read

The fullpath: attribute can be used to require a full rather than a relative path specification for a directory.

If a null value (the current directory) is allowed,the nullok: attribute must be set true.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a directory. The application must be able to accept a null value for this qualifier.

Discretestates

The discretestates data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.

The attributes define characteristics required for Phylip programs.

The length: attribute defines the number of state values (the length of the discrete characters string) in each set

The size: attribute defines the number of sets of values, usually 1 but some programs will accept multiple sets.

The characters: attribute defines which discrete state characters can be specified. This is defined as a string containing all possible characters.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a discretestates file.

Distances

The distances data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.

The attributes define characteristics required for Phylip programs. The distance matrices accepted by ACD include all the formats read by Phylip, with automatic interconversion.

The length: attribute defines the number of rows in the distance matrix.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a distance file.

Features

The type: attribute defines whether the feature input is "protein" or "nucleotide". There is a default based on the type of any input sequence, but a value should always be specified.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without features input. The application must be able to accept a null value for this qualifier.

The nulldefault: attribute overrides the default name generation, and uses an empty string (no feature input) as the default for programs where a directory is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.

Filelist

Filelist is equivalent to infile, but allows the user to specify one or more input files.

The nullok: attribute specifies that a missing input file is acceptable to the application, and that -noxxx can be used on the command line to avoid reading the default input file (if any)

Frequencies

The frequencies data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.

The attributes define characteristics required for Phylip programs. The frequencies files formats accepted by ACD include all the formats read by Phylip, with automatic interconversion.

The length: attribute defines the number of loci (or values) in the frequencies file.

The size: attribute defines the number of sets of values, usually 1 but some programs will accept multiple sets.

The continuous: attribute specifies a frequencies file with continuous character data values.

The genedata: attribute specifies a frequencies file with genetic locus data values.

The within: attribute specifies a frequencies file with continuous data for multiple individuals (additional values on each line).

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a frequencies file.

Infile

The nullok: attribute specifies that a missing input file is acceptable to the application, and that -noxxx can be used on the command line to avoid reading the default input file (if any)

The trydefault: attribute specifies that the default filename may not exist. If nullok: is also defined as true then no error is reported.

Matrix

The protein: attribute will determine if the scoring matrix is used as a DNA or Protein matrix.

Matrixf

The protein: attribute will determine if the scoring matrix is used as a DNA or Protein matrix.

Pattern

Patterns are processed by an internal set of library functions designed to handle PROSITE-style pattern definitions.

The minlength: attribute defines the minimum length the string must be, the maxlength: attribute defines the maximum length the regular expression string can be.

The upper: and lower:attributes convert an input regular expression to upper or lower case before compiling.

The type: attribute describes the pattern as applying to nucleotide or protein sequence. Nucleotide patterns are compared in both directions.

Properties

The properties data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.

The attributes define characteristics required for Phylip programs. The properties files accepted by ACD include all the formats read by Phylip, with automatic interconversion.

The length: attribute defines the number of values in the properties file.

The size: attribute defines the number of sets of values, usually 1 but some programs will accept multiple sets.

The characters: attribute defines which property characters can be specified. This is defined as a string containing all possible characters.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a properties file.

Regexp

Regular expressions are processed by the "Perl-Compatible Regular Expression Library" (PCRE). Any value must be accepted by this library's compilation function. Some additional attributes are provided for further validation by ACD.

The minlength: attribute defines the minimum length the string must be, the maxlength: attribute defines the maximum length the regular expression string can be.

The upper: and lower:attributes convert an input regular expression to upper or lower case before compiling.

The type: attribute describes the pattern as applying to nucleotide or protein sequence. Nucleotide patterns are compared in both directions.

Scop

Scop file input has a default value (typically "d3sdha") set in the ACD source code although this is not really a good idea.

Individual scop inputs can set their own default names with the name: attribute which in the current version has the same effect as setting the default: attribute.

Sequence

The type: attribute will force the sequence to be of the given type. By default, any sequence type is accepted.

We recommend always defining the type: attribute so that the accepted input sequence type is always clear.

If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).

If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.

The sask: attribute sets the defauklt for the -sask qualifier, and if set to "Y" specifies that the program will prompt the user for a sequence begin and end position, and prompt for the reversing of a nucleotide sequence. The EMBOSS "yank" program works with fragments of sequences, and uses the sask: attribute to prompt the user.

The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where seqeunce input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.

Seqall

The type: attribute will force the sequence(s) to be of the given type. By default, any sequence type is accepted.

We recommend always defining the type: attribute so that the accepted input sequence type is always clear.

If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).

If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.

The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where seqeunce input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.

The minseqs: attribute specifies a minimum number of sequences to be read. By default, a single sequence is acceptable.

Seqset

The type: attribute will force the sequence set to be of the given type. By default, any sequence type is accepted.

We recommend always defining the type: attribute so that the accepted input sequence type is always clear.

The aligned: attribute, if true, specifies that all sequences in the input are expected to be aligned. If false, then the sequences are assumed to be unaligned, and are simply read into memory together for processing. We recommend always defining the aligned: attribute so that the nature of the sequence set if clearly defined.

If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).

If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.

The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where sequence input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.

The minseqs: attribute specifies a minimum number of sequences to be read. By default, a single sequence is acceptable.

Seqsetall

The type: attribute will force the sequence set(s) to be of the given type. By default, any sequence type is accepted.

We recommend always defining the type: attribute so that the accepted input sequence type is always clear.

The aligned: attribute, if true, specifies that all sequences in the input are expected to be aligned. If false, then the sequences are assumed to be unaligned, and are simply read into memory together for processing. We recommend always defining the aligned: attribute so that the nature of the sequence set if clearly defined.

If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).

If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.

The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where sequence input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.

The minseqs: attribute specifies a minimum number of sequences to be read for each set. By default, a single sequence is acceptable.

Tree

The tree data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.

The attributes define characteristics required for Phylip programs. The tree files accepted by ACD include all the formats read by Phylip, with automatic interconversion.

The size: attribute defines the number of trees in the input file, usually 0 but some programs will accept multiple sets. Some can only accept a single tree (so the value should be set to "1" for these.

The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a properties file.

 
2.4.1.2.3 Selection lists

Formalised:

Data type

Attribute definition

Description

list

minimum: integer

Minimum number of selections
Default: 1

 

maximum: integer

Maximum number of selections
Default: 1

 

button: Y/N

(Not used by ACD) Prefer checkboxes in GUI
Default: N

 

casesensitive: Y/N

Case sensitive
Default: N

 

header: string

Header description for list
Default: ""

 

delimiter: string

Delimiter for parsing values
Default: ";"

 

codedelimiter: string

Delimiter for parsing
Default: ":"

 

values: string

Codes and values with delimiters
Default: ""

selection

minimum: integer

Minimum number of selections
Default: 1

 

maximum: integer

Maximum number of selections
Default: 1

 

button: Y/N

(Not used by ACD) Prefer radiobuttons in GUI
Default: N

 

casesensitive: Y/N

Case sensitive matching
Default: N

 

header: string

Header description for selection list
Default: ""

 

delimiter: string

Delimiter for parsing values
Default: ":"

 

values: string

Values with delimiters
Default: ""

 

Table 4.3. Selection data types - attributes.

For both selection list types, the values that the user can choose from are defined in the values: attribute as a string, delimited by the character that is given by the delimiter: attribute (which defaults to the semi-colon ';'). For the list data type there is a second delimiter ( codedelimiter:) character that defines the delimiter that separates the label from the value (defaults to the colon ":"). The minimum: and maximum: attributes define the number of choices that this parameter can handle. The header: attribute will hold the text that is displayed above the option list. The casesensitive: attribute will indicate if the options are case sensitive or not, but the value of the parameter will be exactly what the list value is. The button: attribute, which can either be Y(es) or N(o), is used in for web front ends, to indicate if radiobuttons/checkbox/selection lists are to be used or if the list is simply displayed with a text entry box beneath it, to enter the option.

List

The values: attribute contains the list of valid code names and values. The delimiter: and codedelimiter: attributes specify how to parse this string into individual list items.

The minimum: attribute specifies the minimum number of selections required. By default, 1 selection is required.

The maximum: attribute specifies the maximum number of selections required. By default, exactly 1 selection is required. A higher value allows multiple selections.

The header: attribute defines text to appear before the list is presented to the user. The information: attribute defines text to be used as a prompt after the list.

The delimiter: attribute specifies the character used in the values: string to separate list items.

The codedelimiter: attribute specifies the character used in the values: string to separate codes (names) and descriptions of list items.

The button: attribute suggests whether a list is best represented as checkboxes or radio buttons in an interface (value "Y") or as a pull-down list.

The casesensitive: attribute defines whether the input must match the exact case of the selection list item.

Example:

list: matrix [
     default: "blosum"   # default value
     minimum: 1 maximum: 1   # must select exactly 1
     header: "Comparison matrices" # printed before list
     values: "B:blosum, P:pam, I:id" 3 valid values
     delim: ","      # delimiter default ";"
     codedelim: ":"  # label delimiter default ":"
     prompt: "Select one" # prompt after list
     button: Y       # use radio buttons rather than
                      # checkboxes in HTML,
                     # ignored by ACD  ]

What you get is:

Comparison matrices

      B : blosum
      P : pam
      I : id

Select one [blosum] : PAM
Selection

The values: attribute contains the list of valid values. The delimiter: attribute specifies how to parse this string into individual selection list items.

The minimum: attribute specifies the minimum number of selections required. By default, 1 selection is required.

The maximum: attribute specifies the maximum number of selections required. By default, exactly 1 selection is required. A higher value allows multiple selections.

The header: attribute defines text to appear before the selection list is presented to the user. The information: attribute defines text to be used as a prompt after the list.

The delimiter: attribute specifies the character used in the values: string to separate list items.

The button: attribute suggests whether a selection list is best represented as checkboxes or radio buttons in an interface (value "Y") or as a pull-down list.

The casesensitive: attribute defines whether the input must match the exact case of the selection list item.

Example:

select: matrix [
       default: "blosum"  # default value
       minimum: "1" maximum: "1"  # must select exactly 1
       header: "Comparison matrices" # printed before list
       values: "blosum, pam, id" # valid values
       delimiter: ","     # delimiter default ";"
       information: "Select one" # prompt after list
       button: "Y"      # use radio buttons rather than
                      # checkboxes in HTML,
                      # ignored by ACD
  ]

What you get is:

Comparison matrices

      1 : blosum
      2 : pam
      3 : id

Select one [blosum] : PAM
 
2.4.1.2.4 Output

Formalised:

Data type

Attribute definition

Description

align

type: string

[P]rotein or [N]ucleotide
Default: ""

 

taglist: string

Extra tags to report
Default: ""

 

minseqs: integer

Minimum number of sequences
Default: 1

 

maxseqs: integer

Maximum number of sequences
Default: (INT_MAX)

 

multiple: Y/N

More than one alignment in one file
Default: N

 

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

featout

name: string

Default base file name (use of -ofname preferred)
Default: ""

 

extension: string

Default file extension (use of -offormat preferred)
Default: ""

 

type: string

Feature type (protein, nucleotide, etc.)
Default: ""

 

multiple: Y/N

Features for multiple sequences
Default: N

 

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null UFO as 'no output'
Default: N

outcodon

name: string

Default file name
Default: ""

 

extension: string

Default file extension
Default: ""

 

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

outcpdb

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

outdata

type: string

Data type
Default: ""

 

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

 

binary: Y/N

File contains binary data
Default: N

outdir

fullpath: Y/N

Require full path in value
Default: N

 

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

 

extension: string

Default file extension
Default: ""

 

binary: Y/N

Files contain binary data
Default: N

 

temporary: Y/N

Scratch directory for temporary files deleted on completion
Default: N

outdiscrete

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

outdistance

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

outfile

name: string

Default file name
Default: ""

 

extension: string

Default file extension
Default: ""

 

append: Y/N

Append to an existing file
Default: N

 

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

 

binary: Y/N

File contains binary data
Default: N

outfileall

name: string

Default file name
Default: ""

 

extension: string

Default file extension
Default: ""

 

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

 

binary: Y/N

Files contains binary data
Default: N

outfreq

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

outmatrix

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

outmatrixf

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

outproperties

nulldefault: Y/N

Defaults to 'no file'
Default: N

 

nullok: Y/N

Can accept a null filename as 'no file'
Default: N

outscop

n