Features may also explicitly or implicitly hold the name of the program or database that they are derived from, the sense (in a nucleic sequence), the score and many other pieces of information.
Feature Tables are groups of features.
If you only intend to look at the resulting features and not read them into any other programs, then it is still worth having a standard set of formats as you will very quickly get used to the look and feel of a format and be able to compare the features from different programs more easily.
Different programs may have different default feature formats. You may accept the default or chose your preferred format when you run the program.
The output types range from graphical displays of where restriction enzymes cut, to probabilities of the three states of a protein secondary structure prediction along a sequence, to rigidly defined text tables of the start and end positions of things like predicted exons or motif matches.
We will confine this document to describing the well-defined and flexible feature formats that have been developed for the major sequence databases (EMBL, Genbank, SwissProt, PIR) and for the input of features into the genome databases (GFF, acedb).
EMBOSS programs which write out in these feature formats will all obey the commands described below.
There are two ways feature tables can be stored; they can either be part of a sequence file or database entry or they can be in a file that does not contain the sequence that it refers to (a raw feature table).
When feature tables are held together with the sequence they refer to, then the format is identical to the sequence format of the same name as the feature format. e.g. EMBL sequence format is EMBL feature format.
Even when the feature table is not held in the same file as the sequence information, the format of the feature table is the format defined by the feature table definition of the equivalent sequence format. i.e. SwissProt feature table format is defined as part of the SwissProt sequence format definition.
Because most feature table definitions have a controlled vocabulary (i.e. there is a specified list of feature key names that can be used), you cannot edit feature tables to add in features with keys like 'PhD-motif-3'. If you edit the feature tables, you must stick to the allowed set of feature Keys. See the documentation below.
The commands you can give to modify the behaviour of the programs with regards to feature formats differ depending on whether the features are included in a sequence file or database entry, or whether the features are in a file which is separate from the sequence that it refers to.
Name | Comments / Documentation |
---|---|
embl em |
The format used by the EMBL nucleic database. |
gff | The General Feature Format defined by the Sanger Centre |
swissprot swiss sw |
The format used by the SWISSPROT protein database.
The feature table keys are also defined |
pir | The format used by the PIR protein database. |
nbrf | Only available for input - the same as PIR format |
UFOs can be used to specify feature format and file both on input or on output.
If no format is specified, then 'GFF' format is the default.
If the feature table is already a part of the sequence (which is generally the case when you are reading the sequence from a database), then the feature table will be read with no problem. If the feature table is in a separate file, you can force the application to read it in using the '-ufo' command-line qualifier, e.g. '-ufo gff:results.dat'.
The '-fformat' and '-fopenfile' qualifiers can be used together to specify the feature format and the feature file name individually instead of as part of a UFO.
-ufo string UFO features -fformat string features format -fopenfile string features file name
Using '-ufo' or '-fopenfile' to read in a feature table will cause the new feature table to replace any existing feature table that is part of the sequence data.
If you wish to combine feature table files from various sources, then the easiest way is to concatenate the GFF format feature files into one file and to specify that file using '-ufo'.
This behaviour can be overridden by using the following command-line qualifiers. Even if a sequence format that is capable of holding a feature table has been specified, then these will enable you to specify an output file and format for the features.
-oufo string UFO features -offormat string features format -ofname string features file name
These command-line qualifiers change the behaviour of a 'features' input parameter.
-fformat string features format -fopenfile string features file name -fask bool prompt for begin/end/reverse -fbegin integer first base used -fend integer last base used, def=max length -freverse bool reverse (if DNA)
The default output feature format is 'gff', but this can be changed to the required format using the command-line qualifier '-offormat' followed by the format name.
These command-line qualifiers change the behaviour of a 'featout' output parameter.
-offormat string output feature format -ofopenfile string features file name -ofextension string file name extension -ofname string base file name -ofsingle bool separate file for each entry -ofdirectory bool Output feature file directory
You might find the program extractfeat useful for extracting the sequences of features.