Change Log |
The SRS server at EMBL-EBI no longer serves the EMBL database! EBI's SRS server databases in server.srs have been updated to reflect their reduced service.
Reading large sequences is more efficient. Reference counted strings are used for output. Where gaps do not need to be replaced, a single copy of the sequence string is used for input, processing and output.
New sequence format iguspto supports a variant of the intelligenetics format with tolerance for format variants on input.
Calculation of isoelectric point has been updated to use the same data values as Expasy and the Open Bio packages. New data file Epkexpasy.dat holds the values used by Expasy.
The final position of the reverse strand is now correctly numbered in the output of sixpack and showseq.
Eukaryote join features in union were not correctly copied after subfeatures were implemented to hold exons. The union code now correctly relocates subfeatures.
Complex (join) feature positions were not relocated when the parent sequence was trimmed by start and end position. This was introduced when subfeatures were implemented, and is now corrected.
New option -methionine for transeq translates any start codon as methionine when a specific range is given (including 1 to end) and an alternative genetic code is specified.
Wildcard filenames were broken by the query language rewrite. The previous functionality is restored. Any query can use a wildcard filename with '*' or '?' characters. The order in which files are processed is determined by the operating system.
Dbxreport and dbxstat now support databases with a dbalias (alternative base name for the database files).
Restriction digest applications occasionally reported more than one identical match where several enzymes recognize the same target site. The testing of isoschizomers has been improved to catch these cases. In practice most runs are with only a few named enzymes with different sites.
Fragment lengths in restrict are now included as extra columns in the output, giving the fragments to the 5' and 3' side of each cut in the forward strand. Note that the output includes all possible cut sites, though it may be impossible for a double digest to physically cut at each of two closely spaced sites.
The -name option of restrict had no effect on report output and has been removed.
Cachedbfetch corrects bad EDAM references to EDAM_syntax: instead of EDAM_format: in the definitions returned by EBI's dbfetch and wsdbfetch servers.
Sequence identifiers now remove characters that may confuse output file generation, changing to underscore any forward or backslash (interpreted as host system paths), commas, semicolons and colons.
Sequence input now warns for bad sequence characters when the format is known. When auto-detecting the format the warnings are turned off so that failed formats can silently be ignored, but when reading further sequences from the same input file warnings are enabled. They can be disabled for individual format parsers by passing zero as the format code to seqAppendWarn.
New nibble (nib) format stores sequence data in half-byte binary compressed format. The format is available for input and output, but as a binary format can only be read from a file, not from a pipe.
New GDE format for sequence input and output - a simple format with a #id prefix.
Support added for SwissProt OH (viral host) records.
New sequence input associated qualifier (available for all sequence inputs) -squick reads only the id, accession, description and sequence, saving unnecessary parsing of more complex input formats such as swissprot, embl and genbank.
String parsing objects are now reused rather than deleted to save memory reallocation in parsing input streams with a large number of entries. Input source code now uses reusable token objects cleared only when the program exits.
Acdpretty now correctly preserves in-line comments in ACD files.
Efficiency improvements in matching sets of characters in strings, especially in functions used for each entry in a large set of input sequences.
New applications xmlget and xmltext read XML data, for example from dbfetch:embl which offers emblxml format. Output can be as input or in reformatted versions.
QA tests of EMBASSY applications look in a test/data directory in the EMBASSY package as an alternative place for data files prefixed by TESTDATA:
Clustal omega data types added to knowntypes.standard file.
Ranges can use a syntax of start+len or start,+len to give the length rather than the end position. The end is calculated from the start and length and used internally. This syntax allows a closer fit to the command line of primer3_core in eprimer32 where ranges in the native application are always specified as start and length.
List file inputs now report an error if any text follows the first token on a line, unless it is a comment following a '#' character. Previous versions treated any remaining text as a comment and silently ignored it.
New sequence format iguspto supports a multi-line IG format used by the US Patent Office. The multi-line descriptions are preserved only if EMBOSS reads and writes in this format. We can add the capability to any other multi-line input format where the original description lines should be preserved. Other formats treat descriptions as a single record to be wrapped where there is a maximum record length (e.g. in EMBL format).
Programs dreg and preg now only report sequences where a pattern match was found, which is the same behaviour as fuzznuc, fuzzpro and fuzztran.
New code added to handle xml datatype. Supports multiple named XML formats, using the DOM parsers to interpret data. Multiple XML input formats are supported, but on output, in the absence of a conversion method, the original XML is normally reported as plain "xml" format.
In database definitions, "example" is now a list attribute which can appear multiple times, allowing multiple example queries to be defined as separate records, with possible documentation following a '!' delimiter.
Showserver now scales the column headers better for long cache file names.
Showdb now displays the taxons, examples, and aliases defined for a database. Examples and aliases can be preceded by a count of the number of each. All columns are displayed with -full, individual elements are controlled by -numtaxons, -taxscope (-taxonomy is a database type option) -examples -numexamples -aliases and -numaliases.
Showdb now displays a count of the number of fields in addition to the list of field names. New command line qualifier -numfields controls the display of the field count.
Showdb now displays all types defined for a database, separated by commas, but will only display a database once so that, for example, a protein and protfeatures database will appear in the protein database set first (if displayed). If only the features databases are displayed then it will appear with them.
Showdb no longer shows the access levels (id, query and all) by default for a database. New command line qualifier -access or the existing -full qualifier will show these values.
Entrez access was specific to sequence data retrieval. Entrez server retrievals can now automatically detect ID and accession fields and read text entries with textget where a text format is available.
Genbank-related protein formats Refseqp and Genpept are updated to process all record types. Genpept feature handling is updated to correct the handling of multiple locations by using subfeatures.
GenBank and Refseq formats now handle the full set of record types including common species names, reference details and comments.
Dbtell -full reports any alias names for a database after the definition.
Dbtell recognizes alias names for a database, reporting the master database definition and a comment describing where the alias is defined.
Dbtell -server reports the database definition for a server. All attributes are reported in the database definition, whether defined for the database or at the server level.
Servertell -full now reports the definitions of all databases for the server, including all aliases defined in the server definition file. Without -full an extra comment line in the output suggests running with -full for more detailed information.
Restrict output now sorts by the position closest to the start for matches on the reverse strand (for an asymmetric target site). This sort change can produce additional matches in the output of restover.
Embossversion is now set to fail with a message if the update information URL is unreachable.
HTTP and FTP error messages were simplified and blank lines removed.
The valgrind.pl script has a new qualifier -debug which runs the test with -debug on the command line.
Needle, needleall and water now fail with "die" message if there is insufficient virtual memory to calculate the alignment between two long sequences
Indexing with the dbx applications miscalculated the secondary page capacity when the secondary page size is less than the primary page size.
Ranges in a file can use a dash as a delimiter for the start and end positions in addition to white space.
For all data types, format names can be replaced by EDAM format term identifiers, for example 1927 for "embl". The format terms are defined in the source code. We will need to define aliases or use more complex queries if a format splits into a hierarchy but this is unlikely in most cases.
On FreeBSD systems embossversion source code has quotes corrected on the line that reports FreeBSDLF is defined.
On Windows (mEMBOSS) the user home directory is checked for the .embossrc file and .embossdata directory, using emboss.default for settings defined for all users.
Database definitions with multiple types and formats now check that there is at least one valid format defined for each type of data.
The qatest.pl script handles references to the user's home directory on Windows. "~/" is replaced with the user's home directory, with the full path or filename enclosed in quotes.
The qatest.pl script has a new qualifier -debug which runs the test with -debug on the command line. For ACD utility tests the application name is taken from the first command line parameter and will not match the debug file so these will give an error for an unknown .dbg file. For all other tests this is a simple way to obtain debug output for a problematic test result.
EMBOSS supports soap protocol access using the Apache axis2c library. We use version 1.6.0 for testing. Installation can be tricky on some systems. We are happy to help with anyone who finds problems. A copy of the library is included in the initial 6.5.0.0 mEMBOSS build.
Date parsing for EMBL, GenBank, SwissProt, Refseq and related formats has been made more robust.
New application embossupdate checks for the availability of an updated EMBOSS distribution or patches from the EMBOSS website and FTP server. Embossupdate can be run at the end of a successful installation or reinstallation. We hope this will help our users to keep their versions up to date more easily.
Feature data can be read from PIR and GCG formatted databases.
EDAM is updated to release 1.1. EDAM is used to define EMBOSS and EMBASSY applications, to describe EMBOSS defined databases and entries in the DRCAT data resource catalogue. This is a prerelease from the EDAM team to ensure EMBOSS has the most recent set of terms.
Lists and tables now support very large numbers, requiring long integers (datatype ajulong) to represent the return values from ajListGetLength, ajTableGetLength and ajTableGetSize. Further extensions are planned in future releases.
Directory inputs now interpret ~/ or ~user/ in the user response in the same way as file inputs.
Application embossversion -full now reports the versions of all libraries, and all configuration settings used to compile EMBOSS, plus the sizes of standard data types.
Dbxfasta has a new format "idsv" which finds sequence version values if the accession number has a .number suffix.
Dbxflat creates a sequence version for UniProt entries using the accession number and the sequence version from the DT records.
Dbx indexing stores secondary reference file positions only if the database has more than one data file per entry. The entries file records the number of files in the database and can if needed store more than one reference file. Identifiers indexes can store more entries per page for databases with one file (embl, uniprot), but support reference files for gcg, pir and taxonomy indexing.
Dbx indexing supports separate caches for primary and secondary pages. Larger caches can reduce the number of physical reads and writes at the cost of a small increase in CPU time. The organism and description indexes for large databases can have terms that appear in a very large number of entries (e.g. 'protein' in UniProt or 'bacteria' in EMBL). Secondary cache sizes up to 100k can be used to try to reduce the physical page rewrites needed as these indexes grow.
Dbx indexing supports a smaller size for secondary index pages. These hold the lists of entry ids for indexed strings, and the file offsets for non-unique identifiers (e.g. secondary accession numbers). The environment variable EMBOSS_SECPAGESIZE defaults to 512, a quarter of the EMBOSS_PAGESIZE value of 2048. Resource definitions can specify field-specific secondary page sizes using, for example accsecpagesize: "256"
Dbx indexing applications (dbxflat, dbxfasta, dbxgcg, dbxedam, dbxobo, dbxresource, dbxtax) secondary index files (e.g. keyword, taxonomy and description indexes) are more compact. The entry ids for each keyword are stored as a simple list unless more than one index page is needed. As most indexed tokens are in only a few entries this saves many pages while the index is being built. The compressed index size is also smaller.
Dbxflat, dbxfasta and dbxgcg now report index terms that exceed the maximum length (attributes idlen, acclen, deslen, orglen, keylen, svlen, gilen). Each term beyond the current maximum is reported. When the run is completed, the longest term length for each index field is reported so that excessively large values can be reduced.
Dbxflat dbxfasta and dbxgcg have improved memory efficiency on large indexing runs. Many more internal data structures are reused in the parsers.
Window length options are renamed to -window consistently across all EMBOSS applications. The change applies to pepwindow and pepwindowall
Multiple inputs to einverted gave inconsistent results as two internal variables were not reset for each new sequence.
Resource definitions for uniprot (swissresource) and embl (emblresource) are updated to allow the maximum size for database index keys. If the database contains longer values in future they will be truncated and the maximum size found by the parser will be reported by dbxflat.
New resource definitions chebiresource and sworesource are provided in emboss.standard to index ontologies with exceptionally large index keys.
Ontologies CHEBI, ECO, GO, PW, RO, SO are updated.
Ontology SWO is added. This is the software ontology, in its OBO format. Some identifiers are really URLs.
Sequence and other databases with an organism ('org') or taxonomy ('tax') index can restrict retrieval to one or more indexed organism names or any other indexed level in the taxonomy. Examples include EMBL or UniProt whether indexed locally with dbxflat or accessed through the EBI's SRS server as srs:embl or srs:uniprot. A new database attribute 'organisms' can be used to define one or more organisms or taxonomy levels to restrict data retrieval from the master index of the complete file. A value using EMBOSS query syntax of "rattus|mus" will allow data from both genera to be retrieved. Values can also be separated by tabs, commas ',' or semicolons ';' As organisms can include spaces we chose not to allow space as a delimiter. The organisms attribute is implemented for method "emboss" and "srswww" to allow remote retrieval. We can implement organisms for other access methods if there is a demand from the user community,
Ontology databases can combine more than one branch of an ontology in a single file. Examples include the Gene Ontology (GO) with namespaces for cellular_location molecular_function and biological_process and EDAM with data, format, identifier, operation and topic. A new database attribute 'namespace' can be used to define one or more namespaces to restrict data retrieval from the master index of the complete file. This is tricky for EDAM data which is in the data or identifier namespaces. A value using EMBOSS query syntax of "data|identifier" or spaced with "data identifier" will allow data from both namespaces to be retrieved. The namespace attribute is implemented for method "emboss" (how the ontologies are indexed in the distribution) and "srswww" to allow remote retrieval. We can implement namespace for other access methods if there is a demand from the user community,
EDAM release 1.0 is included. Major changes were needed to EMBOSS internals as the identifiers are all changed (different term ID number and different prefix). ACD files and the DRCAT data resource catalogue are updated with the nearest equivalent terms from EDAM 1.0.
Assembly data is now loaded a few records at a time using a new "loader" object. This allows very large files to be processed in chunks.
Variation data is now loaded a few records at a time using a new "loader" object. This allows very large files to be processed in chunks.
Support for BioPerl/Open-Bio OBDA flatfile indexes is included as database access method 'obda'. The indexing in BioPerl 1.6 is broken for EMBL as the semicolon is not removed from identifiers. The secondary index files have duplicated records. Both problems should be fixed in a future BioPerl release. Note also that OBDA indexing parses only the primary accession number so that other accessions are not retrievable from OBDA index files.
EMBL entries with a single (source) feature could ignore the feature.
Output files for fuzznuc, fuzzpro, fuzztran, dreg and preg included the pattern name and the pattern string in the last release. The output format is changed to remove the space between the pattern name and string so that parsers see the expected number of space-delimited fields in the output.
The query language parser has been rewritten to handle the new -iquery and -ioffset qualifiers. Badly formed queries may now produce different error messages.
Any input type that uses queries, with the exception or URL inputs, can use two new associated qualifiers. -ioffset is the initial non-zero offset when reading from a file or a URL. -iquery if the query field which can be applied to an FTP or HTTP URL or to any query in a list file. These names also apply to sequence and feature input where other qualifiers begin with 's' and 'f' respectively.
FTP and HTTP URLs can now be used directly as input queries for all data types in place of file names. EMBOSS automatically detects the ftp:// or http:// prefix and uses the appropriate protocol. Any query or offset is ignored as there is no way to distinguish these from a genuine part of the URL.
Patterns for fuzznuc, fuzzpro and fuzztran can include escaped codes to skip the expansion of ambiguity codes and look for them explicitly in the input. A backslash (shells may need two) before the code specifies an exact match, for example \S will only match S in the input.
Patterns for fuzznuc with ambiguity codes are now expanded to include the ambiguity code (and any overlapped ambiguity codes). For example, S matches [GCS] and B (not A) matches [TGCBSYK]
A new AJAX source file ajtagval.c handles general tag-value pairs of strings which have uses beyond feature internals.
Pepwheel can plot up to 5 sets of residues, with a total of "steps" at each level. Leucine zipper plots with a step of 7 and 2 turns required more residues to be visible. The updated pepwheel rescales the size of the inner wheel to allow more residues to be displayed.
Sequence and assembly reading in BAM format always fails if no match as found in the first pass - attempting to read again could loop with the same result as the file is rewound. Rereading is intended for text formats such as FASTA where the next entry may match.
Header files in AJAX and NUCLEUS have been cleaned to remove redundant references. A new include file ajlib.h includes the core set of ajdefine, ajarch, ajmem, ajmess, ajfmt and ajstr which were almost universally included. Applications are expected to use emboss.h as their only include, but references to ajax.h and emboss.h in the libraries are now all replaced with the minimally required set of include files.
The server.entrez file has been updated using a script serverentrez.pl which queries Eutils to obtain a list of database names and fields. An internal array is used to define the datatypes and formats for each database as these are defined only in a series of HTML tables in other pages.
Reading from the NCBI Entrez server failed. The cause was trimming newlines from a reference-counted string where the data returned has CR-LF format but only one character was removed.
New xygraph output device support for datafile formats. "bedgraph" outputs in BedGraph format. "wig" outputs in Wiggle format.
The "sequence" attribute is implemented for xygraph outputs. If set true, the X-axis label defaults to the name of the first input and the source name used in datafile outputs is also the name of the first input.
Dottup and dotmatcher now have the first sequence on the X axis and the second on the Y axis. This follows standards for datafile output of graphical data which default to the X axis relating to the first input sequence.
Dbx index files from earlier releases defaulted to "secondary" indexes. The test for an index with no "Type" parameter defined now picks up the standard Identifier indexed fields (id, acc, sv and gi) correctly. The files were identified by field name, but the test was using the file extension.
Fuzznuc, fuzzpro, fuzztran, dreg and preg when searching with a regular expression found only the largest possible match at each start position. A new function in recent releases of the PCRE regular expression library supports searching for all matches using function ajRegExecallC instead of ajRegExecC. These applications can now find all overlapping matches to a pattern using a regular expression.
The PCRE library is updated to include the pcre_dfa_exec function. This is called by ajRegExecall and ajRegExecallC. The regular expression can be compiled as usual. The new calls set an internal value to the number of matches found, retrievable by ajRegGetMatches. Offsets (ajRegOffsetI) and substrings (ajRegSubI) return these matches, starting at zero which is the longest match (the same as in ajRegExec). Any shorter matches with the same start are stored in place of bracketed substrings.
Prettyplot options are changed to remove dependencies on other options. Option -plurality (which depended on the sequence alignment weight or the number of input sequences) is now -ratio with a default of 0.5. This is exactly equivalent to the default -plurality value or half the total weight. Option -resbreak is replaced by -blocksperline with a default value of 1. This has the same default output as the -resbreak option which defaulted to the -residuesperline value.
All header files now have an @include comment block which includes the LGPL licence and RCS tags. Header files are commented in consistent sections. The C++ compile extern wrapper for C declarations is now a macro to avoid indentation issues in emacs and other editors.
All obsolete functions are moved to the end of source files and wrapped in an #ifdef AJ__COMPILE_DEPRECATED block. The configure option --enable-buildalldeprecated includes these functions in compilation. Functions described in the 6.2.0 books are included in a similar AJ__COMPILE_DEPRECATED_BOOK block and built with the --enable-buildbookdeprecated configure option.
Diffseq produced incorrect results when reporting an insertion in the second sequence. The error was introduced in release 6.0.0. It is fixed by defining a "between" location for the insert site in the first sequence, and by adding support for "between" features to diffseq and other report formats. A new constructor ajFeatNewBetween with one position makes creating such features easier.
New function ajListDrop removes a node from a list by searching for its address.
Test data includes a new EMBL data file syn.dat containing a circular sequence.
GFF3 input combines features with the same ID under a generated parent so that features can be linked as subfeatures and sorted together. These features are identified by the Flags attribute and excluded from GFF3 output.
GFF3 output is required to use different feature types for parent and child. This is broken by the annotated parent feature we need to represent EMBL/GenBank/DDBJ joins. For these, the parent has a new type of biological_region with a new featflag type=CDS (for example) so we can restore the correct internal representation when reading the GFF3 file.
A new sequence associated qualifier -scircular defines a sequence input as a circular molecule where this is not defined in the input format, for example EMBL/Genbank and GFF3 have the information but FASTA input does not. For feature input there is a new -fcircular qualifier. Any circular definition in a sequence format overrides this qualifier. Sequences with features are set circular if the feature table input is defined as circular.
GFF3 format has been corrected using the online GFF3 validator. Protein feature type names are corrected to use the current SO term name. Tags are converted to lower case on output and back to standard case on input, for example /EC_number in EMBL format, as GFF tags must start in lower case.
In GFF3 protein features now always use '.' for the strand. Previous releases could also write '+'. Both are acceptable as input.
GFF3 and GFF2 scores now use a general floating point format to write 4 significant figures (rather than 3 decimal places) to cope with very large and very small score values. Trailing zeroes after the decimal point are omitted in this format. A score of zero is written as a dot (missing value).
Sequence queries can use two alternative syntaxes for sequence ranges. Appending :start:end allows a syntax similar to DAS queries. Appending :start..end allows a syntax similar to EMBL/GenBank locations in other entries. Both can be followed by :r to reverse the sequence region.
Sequences and reference sequences can be read from EMBL CON division entries by using the same database with an ACC (accession number) index to read the sequence fragments defined in the CO record(s).
New code added to handle reference sequences in ajrefseq* source files. The AjPRefseq object will hold large reference sequence data in managed memory buffers.
Database definitions can use a new attribute "special" to give a name=value definition for any attribute specific to one access method. The first instances are SpeciesIdentifier for ensemblgenomes databases, and tags for processing assembled entries in CON (constructed) entries in EMBL. ConDatabase is the database name used, ConField is the index field. By default CON entries use the ACC field of the same database.
Standardized all licensing references in the libraries to GNU Lesser GPL version 2.1. Added CVS keywords to record the CVS file version, and the date and user of the latest commit.
Microbial genomes in ensemblgenomes have an enumerated species code which must be included in an data retrieval request. The codes are temporarily added to the comment attribute of the databases in the server cache file. This will be replaced by a more complete solution in the next release.
The DRCAT.dat file has a new set of lines to handle Nucleic Acids Research classifications. A new NARCat line code is now separately parsed by dbxresource into the NAR category name and the URI.
Long tag values in GFF3 format could exceed limits in the regular expression. This is fixed by first testing for and replacing escaped quotes and then using a simpler expression to extract quoted string values.
When reading ranges from a file the strings were overwritten by the parser.
Application tcode results disagreed with the original publication. The calculation parameters have been corrected.
EDAM.obo is updated. 28 terms were added. Descriptions were updated and names changed.
Short descriptions of EMBOSS and EMBASSY applications have been updated to use consistent terminology and grammar rules.
Dbxflat failed to parse the organism ('org') field of a GenBank entry when another secondary field (keyword or description) was also parsed in the same run.
Dbxflat and dbiflat now use a separate parser for SwissProt format data files. Previous releases used the EMBL parser which failed to identify the first word in the specially formatted SwissProt description records. The change only affects the 'des" index field.
Reading ABI format failed to read the sample name field and machine name. The sample name is now correctly parsed. The sample name is used by EMBOSS as the sequence identifier.
Formats specified on the command line were ignored by database queries. This behaviour was correct in previous releases where only one format was permitted, but is required from 6.4.0 where a database may have multiple possible formats. Any format defined elsewhere on the command line is now used if there is no format in the query string.
ACD files are stricter in checking ambiguous qualifiers. Options that are also a short form of another qualifier now generate warnings. These can be turned off with the application attribute wrapper: "Y" where a third party command line is wrapped.
Showfeat had an option -type which was ambiguous. Changed the options so those with a match option (-typematch) have a show equivalent -typeshow to display the column.
Emma had options -dend and -slow which were short forms of other qualifiers. They are renamed -dendreuse and -slowalign. The old qualifier name will now give an "ambiguous qualifier" error message and report the new name.
Eprimer3 and eprimer32 had options -otm and -osize which were short forms of other qualifiers, and could cause confusion between optimum and oligo values. They are renamed -opttm and -optsize.
Helixturnhelix had an advanced option -sd which was a short form of sequence qualifier -sdbname. It is renamed to -sdvalue.
Prettyplot had an option -box which was a short form of other qualifiers. It has been renamed -doboxes to match the related qualifier -docolour.
Showserver had an option -server which was a short form of -serverversion (itself named to avoid a clash with -version). This option is now renamed -servername.
Supermatcher and wordfinder had an option -errorfile which was a longer form of the standard qualifier -error which can suppress the reporting of error messages. The -errorfile qualifiers are renamed -errfile.
Revseq added 'Reversed:' to the sequence description. For use cases where the original sequence description is preferred (e.g. FASTQ format formatted descriptions) a new -notag option retains the original description.
Cirdna prints text inside solid blocks invisibly. When printed outside the text scaling was too small. The text scale is now adjusted for the radius and sequence length so that labels should be readable outside the box.
Fuzznuc, fuzzpro and fuzztran using a pattern file ignored the command line -mismatch qualifier for the first pattern. The default mismatch is now set to this value at the start of the pattern matching loop in the library.
qatest.pl which runs the QA tests now checks for a qatest.dat file in the EMBOSS source directory and additional qatest.dat files in the test subdirectory for all EMBASSY packages found under the source embassy/ directory. By providing individual qatest.dat files for each package we can simplify testing for a core distribution. Some of the older EMBASSY packages derived from domainatrix have cross-dependencies where one test uses the output of an application from another package. New AX and AY lines define foreign tests which are executed even where a single EMBASSY package has been specified with the -embassy=package qualifier on the command line.
DBXFLAT can index FASTQ format short read sequence files, allowing individual sequences to be rapidly retrieved by name.
Genpept format has changed since we last tested it. The LOCUS line is simpler. EMBOSS now supports GenPept as documented and distributed by NCBI.
Sequence in SAM format ignores the reference sequence name. Previous releases saved it as the accession number, but this is inappropriate as it is then reported as the identifier in EMBL format.
The -help output (and documentation) for align and report output types now includes the default format if defined in the ACD file.
New code added to handle variation data in ajvar* source files. The AjPVar object will hold genetic variation data from the Ensembl API and from VCF input files.
New access methods for URLs have been added as ajurlread.c and for URL output methods as ajurlwrite.c - supporting collecting and reporting of URLs as output. URLs are saved as an array of strings, intended to be reported as a set of links to the underlying data.
Sequence format "raw" now only reads binary files, which means it cannot be used for piped data. The change was needed to avoid accepting binary data where a file has a NULL and then no newline, for example ABI data files where the initial 'ABIF' could be read as a valid sequence.
Application tcode failed to plot results for more than one sequence. It also reported a plplot error when reading random non-coding input. It also failed to report the threshold lines when they were outside the range of observed scores.
Four new functions combine tables where the keys and values are of the same types. In each case the tables are resized to the larger of the hash array sizes, and then at each hash array position all keys in both tables are compared. The functions differ only in the actions taken when a match is or is not found. ajTableMergeAnd keeps all keys that are in both tables. ajTableMergeEor is the inverse keeping only keys that are in only one table. ajTableMergeNot removes keys that are also in the second table. ajTableMergeOr adds keys from the second table that do not match. All remaining keys and values are deleted using the tables built-in destructor functions.
Some data resource catalogue applications failed when run with the -debug option. Their debug calls have been updated.
New application dbtell reports the attributes for a database.
All messages written to the user are also logged to the debug file to help locate where they are generated when debugging.
Applications showfeat, extractfeat and coderet are updated to follow the new features /subfeatures data structures.
When using a simple numeric database identifier, the SV field is only searched if it is defined.
Access to local SRS databases created an invalid command line for getz with a stray '+' character needed only in the web version.
Nexus format input can now handle a missing taxlabels block by using the matrix block to read sequence names.
GFF3 tag names are automatically converted to lower case unless they match a known GFF3 "special" tag name.
GFF3 format has been rewritten to comply strictly with the GFF3 standard on the sequence ontology website. Characters are now escaped in tag values. The 'featflag' tag has been changed to convert the hex value into a readable list of flags, with some flags now inferred from the content of the GFF line. The GFF3 special tags (all starting with an upper-case letter) are now stored separately. The ID and Parent tags are used in post-processing to build subfeatures which are stored under the feature with an ID matching their first Parent tag.
GFF3 input requires the optional EMBOSS type comment to identify a protein GFF3 file as there is currently no safe way to distinguish protein from nucleotide features using only the standard GFF3 format.
GFF3 format sequence format failed to read files with additional ## comment records after the header block. These comments are now ignored.
Feature objects have been extended. A feature may now include a list of subfeatures. This is intended to allow exons to be stored under the feature to which they belong. With this new structure, sorting feature tables becomes easy as there is no need to match group tags and sort by ID. Features simply sort by their main (parent) feature, with the other subfeatures (exons) unseen by the sort algorithm.
Application restrict crashed when the enzyme list was empty. If reported invalid enzyme names, but not 'no enzyme name given'.
Reference-counted lists are enabled with the constructor ajListNewRef creating a reference-counted copy. Lists are only deleted when the reference count falls to zero.
Reference-counted tables are enabled with the constructor ajTableNewRef creating a reference-counted copy. Tables are only deleted when the reference count falls to zero.
Table code has been rewritten to automatically delete keys unless the table is created with a Const version of the constructor. All table constructors are renamed, with the older names retains as "deprecated" functions which do not delete keys or values. All EMBOSS code has been changed to use the new function names.
New functions ajTableMatch, ajTableMatchC and ajTableMatchS test a key is present in a table. They can be used where the ajTableFetch is inadequate because the value may be NULL. Some code used ajTableFetchKey but this is intended only for case-insensitive keys.
Tables (AjPTable) have defined functions to hash and compare keys. Two new functions can be defined to delete keys and values. By default these are NULL and no keys or values are deleted. The functions can be ajMemFree to simply free memory, or more complex object destructors. As these require a void** argument (all keys and values are void* internally) wrappers are needed around object destructors. We recommend appending 'Void' to the standard destructor name and casting the void** argument to pass to the object-specific destructor.
Tables (AjPTable) can be resized using the ajTableResizeLen function. When adding to a table with ajTablePut the table is automatically resized when the number of entries exceeds an average of 8 per bucket.
Function ajMemFree now accepts a void** argument and sets the pointer to zero after free the memory. All EMBOSS code calls this through the AJFREE macro which is now safer to use as the pointer appears only once in the generated code.
Application digest conflicted with the name of a utility on some systems. It has been renamed to pepdigest.
In the emboss.standard and emboss.default files certain attributes can appear more than once if defined as type "ATTR_LIST" in the ajnam.c source file. These include a new attribute 'field:' defined once for each database query field, superseding the 'fields:' list of field names. The 'field:' attribute has a list of field names, with the first being the name preferred by EMBOSS and others acceptable on the command line. A '!' delimiter marks the end of the field names and the start of a free text description. This style of description is also allowed for other attributes, including 'taxon:' and the 'edam*:' attributes. The syntax is taken from the metadata in OBO format.
Data retrieval using the HTTP protocol now checks for redirects in the header and replaces the file buffer with the results from the new URL. This allows EMBOSS to read outdated URLs for database access.
New trace functions ajTableFetchTrace and ajTablePutTrace help to debug adding new keys to a table.
New parsing function ajStrTokenNextParseDelimiters returns the delimiter string in addition to the token parsed from a string token handler.
Application einverted could report a bad alignment if the matched region reached the end of the search window. Matches which go beyond the search window are now ignored. This bug was reported with a very low threshold score and was unlikely to be noticed with the default settings.
Sequence format treecon failed if the only line of input started with a number. Failure to find a second record now simply returns false.
Tables can now use integer keys and values of four types - integer and long, signed and unsigned. The unsigned longs are used internally for emblcd index reading and for b+tree index creation.
Report output in from pattern patching applications (fuzznuc, fuzzpro, fuzztran, dreg, preg) now includes the pattern as well as the pattern name in the '*pat' or 'Pattern_name' feature tag value.
New applications search the EDAM ontology by each of its query fields, with common options to restrict the results to one of the 7 EDAM namespaces. Also new applications to look for EDAM term with each of the 5 common relationships for EDAM data terms: has_input, has_output, is_identifier_of, is_format_of and is_source_of. The sixth relationship has_attribute is only used by the obsolete 'entity' namespace terms.
New application dbxresource indexes the data resource catalogue DRCAT.dat which is distributed with EMBOSS. Most fields in DRCAT are indexed. The EDAM and Taxon fields are used by other applications to search the EDAM and TAXON databases for terms which are in turn used to select DRCAT entries by taxon, data type, format, identifier and resource.
Any menu (list and selection ACD types) which allows all options to be selected now accepts "*" to select everything. This can be the default (e.g. for database index fields) or can be specified by the user with quotes to protect it from interpretation by the Unix shell.
Tokens indexed with the dbx* programs now have white space indexed as underscores. Any index files with spaces in the tokens need to be re-indexed. This applies to keyword and organism indexes.
New code added to handle short read assemblies in ajassem* source files. The AjPAssem object will hold large numbers of short reads in managed memory buffers.
New template for adding data types with specific formats for input and output and data access methods. These templates are stored in ajwxyz* source files with a script newdatatypes.pl to automatically create new, properly named, stub functions in the AJAX core and ajaxdb libraries.
Program nthseq now simply reports an error (not a fatal error) if too few sequences were read.
Feature input and output was in one large file. This has now been refactored with ajfeatdata.h for the data structures, ajfeatread.c for input formats, ajfeatwrite.c for output formats and remaining feature object handling code in ajfeat.c.
New access methods for text have been added as ajtextread.c and for text output methods as ajtextwrite.c - supporting text and (preserved) HTML and XML output. Text is saved as an array of strings, intended to be used as one per input record although storing the entire text in the first string is also possible.
Data queries have been made general. A new AjPQuery object handles queries for any datatype, storing a list of field names and queries, plus an operator (OR, AND, NOT, EOR, ELSE) for combining fields. Previous releases had a hard-coded search for "id or accession" which now uses the new query structure. Extensions to the query language will allow more complex combinations, and will allow any field to be defined for an external data resource (e.g. fields for an SRSWWW server).
All data reading access methods have been restructured. Methods that essentially return an open file with the pointer set to the start of an entry (which covers most of the original access methods) are moved to a new source file ajtextdb.c and use a new AjPTextin input object which is included within AjPSeqin for sequence input and AjPOboin for OBO term input. These functions are generalized for any input data in some text-based file format. Sequence access will first check for a text-based access method, and then for a sequence-specific method (e.g. ensembl). Other input datatypes can do the same. The code for OBO ontology terms will use the new text access methods. Code for access to other input data types (feature, alignment) will now be relatively easy to add. Text retrieval of data from a new list of data resources can also use these access methods.
Program einverted required at least one base between the halves of an inverted repeat. Blunt joins are now reported where previous versions reported a 2 base gap.
Error messages from database indexing now include the filename of the index file. This is useful when identifying the indexing operation where the problem occurred.
EMBOSS database index files are extended to mark numeric and string index pages. In previous releases all were marked as strings. Older index files remain valid for sequence retrieval, but not for the new dbxreport index analysis application.
New application dbxreport analyses the contents of an EMBOSS index, reporting the numbers of keys of various types, number of pages, and percent free space. It also checks that all pages in the index have been used and are linked to a higher page.
New application dbxedam is an extended version of dbxobo which also indexes EDAM-specific relationships between terms.
New application dbxobo indexes OBO format ontology files. Index fields are id, acc (alt_id records), name (name and synonym records), ns (namespace records), isa (is_a records pointing to the parent term) and des (def records).
EMBOSS database index files include an extra count value "fullcount" for the total number of words indexed. The "count" value is the number of unique terms (for example, words in descriptions or accession numbers).
EMBOSS database index files include an extra type value "Type" with the value "Identifier" for a simple primary identifier such as ID or accession, and "Secondary" for an index of secondary terms which points to the entry unique ID.
Database indexing application dbxfasta may corrupt index files with long words in the description index. Dbxfasta now checks the maximum word length, and as an added safeguard the indexing library code also checks and truncates any word longer than the maximum.
New application seqcount returns the number of sequences read. This simple application was requested on the EMBOSS mailing list to avoid complicated command line manipulations and unnecessary sequence output.
acdpretty now writes lines up to 75 characters wide. The width was restricted to 50 to allow space for in-line comments but this restricted the length of indented text too severely.
In emboss.defaults and the user's .embossrc file variables are now resolved at read time, including the names of include files. This can simplify the configuration files for sites running more than one installation.
Patched: SAM format file entries with negative insert sizes are valid but were wrongly rejected.
Patched: BAM format misread the quality scores. An offset of 33 used to report values for debugging was incorrectly included in the stored values.
Configuration now uses autoheader and has less dependency on the libtool version.
seqret ensembl:human:ENST00000262160 seqret ensembl:human:ENST0000026216? seqret ensembl:human:ENSE00001533831showing that transcripts, translations and exons are retrievable and that partial queries are allowed. Example database definitions are given in the emboss.default.template file. Please read the note above those definitions regarding fair use of the public Ensembl servers.
'sql' is a new access method for networked SQL servers (MySQL or PostgreSQL). The server and database is described using the 'url' field. As for biomart (described below) the database definition must include definitions of new attributes 'sequence' (the sequence column) and 'identifier' (the column used in the query). Additional columns may be returned as description text if they are listed in the 'returns' attribute of the DB definition. An example definition is given in emboss.default.template.
tfextract has been updated to deal with multiple pattern lines and empty sequence lines.
Three automatic EMBOSS environment variables are added. EMBOSS_INSTALLDIRECTORY is the installation directory reported by embossversion -full, EMBOSS_BASEDIRECTORY is the base directory reported by embossversion -full, and EMBOSS_ROOTDIRECTORY is the root directory reported by embossversion-full. These are needed to allow the QA test database definitions to point to the test data for the current installation, and appear in the test/.embossrc file.
Validation of EMBL/GenBank feature tables has been updated by reading EMBL release 104 (June 2010) and allowing many feature qualifier non-standard values that appear in that release.
Biomart is a new access method for sequence databases, The database definition must include definitions of new attributes 'sequence' (the biomart sequence attribute) and 'identifier' (the Biomart identifier attribute). Additional attributes may be returned as description text if they are listed in the returns' attribute of the DB definition. An example definition is given in emboss.default.template.
Database definitions have a new attribute serverversion which is used by SRSWWW access to choose the best way to retrieve data.
SRSWWW database access, for example from the EBI's srs.ebi.ac.uk server, had a problem processing queries returning more than 30 entries. This is now corrected by first asking the server for the number of entries and then accessing the data in chunks. This will unfortunately slow down SRSWWW access for single entries but was the only solution available after checking with EBI's SRS support team.
Infoseq has a new column "organism" which shows the species line from an EMBL or UniProt entry. In a future release this may be changed to show the standard name for the NCBI taxon identifier from an entry as the species definitions for these databases can be long with alternative names and possibly additional species.
Amino acid 280nm extinction coefficients in file Eamino.dat have been adjusted to match those of the Expasy 'protparam' tool. Pepstats now reports values with cysteine residues reduced and as cysteine bridges.
Database types, originally defined as simply "N" for nucleotide and "P" for protein, should now be named in full. The names are expanded automatically when reading the definitions in the emboss.default and .embossrc files. Expanding the types allows for new database types to be added in the near future.
EMBOSS can now read and write BAM (binary SAM) sequence files to extract all sequences and quality scores, for example to write them out in FASTQ format. Although BAM data can also be read through a pipe as standard input, in this case the format must be specified on the command line as it is not currently possible for EMBOSS to read a buffered text file as binary data.
Needle dynamic programming algorithm updated to allow adjacent gaps in opposite strands.
Rabin-Karp multi pattern search algorithm moved into the nucleus library. supermatcher application seed finding step updated to use Rabin-Karp multi-pattern search.
Banded Smith-Waterman algorithm used by supermatcher and wordfinder applications has been revised, fixing a problem with occasional inconsistent alignments. Basic SAM format support for these two applications as well as for the wordmatch application. supermatcher assumes the second sequence as the reference sequence while wordfinder and wordmatch considers the first sequence as the reference sequence.
The acdvalid application now reads the EDAM (EMBRACE Data and Methods) ontology to validate EDAM references in relations attributes. All applications are expected to have at least one topic and at least one operation term. Other qualifiers can have any number of data terms.
New source file ajtax.c provides parsing and validation for the NCBI taxonomy in its .dmp file form. The parser reads all taxonomy data into memory. This takes up too much space for practical use, so is only intended for subsets. The parser will be reused to develop indexing applications to provide fast lookup of taxon identifiers.
New source file ajobo.c provides parsing and validation for OBO format ontology files. The parser includes strict warnings according to the OBO format documentation, but these can be turned off as in many cases the OBO foundry ontologies do not follow the exact standard. Examples include terms not in sorted order, and Typedef stanzas following Term stanzas, and dbxrefs to non-existent terms (e.g. GO:ma in the gene ontology to cite a curator).
Support for PDF and SVG graphic file output has been added. SVG requires no additional libraries. PDF support requires the libhpdf library (which, somewhat confusingly, is provided by the libharu project). EMBOSS will attempt to find the library and development files automatically and add PDF support (or not) appropriately. However, if libhpdf is in a non-standard place, a --with-hpdf=DIR configuration switch can be optionally used.
The output of showalign has changed. The reference sequence now appears at the top, of selected. The ticks and sequence position numbering is relative to a selected reference sequence. Gaps within the reference appear as '.' and are not counted in numbering. End gaps appear as '.' with 'V' and 'v' as the major and minor tick marks, and numbering from -1 before the start and from +1 after the end of the reference. The additional copy of the consensus is no longer reported.
When reading ABI trace files the quality scores can now be read. They are undefined in ABI files, but assumed to be phred scores. ABI files can have two sequences and sets of quality scores. The first is from the instrument base calling. The second is from a second base caller. Where two sets are found, EMBOSS now reads the second set.
Application nospace has a new -menu option to trim all, trailing, or excess whitespace.
Output type outfileall is obsolete (it is essentially an outfile) and has been deleted. No application was using it.
Input type filelist (comma-delimited list of filenames) now trims excess whitespace from the beginning and end of each filename.
Command line qualifiers with an '=' but no value now have a value of an empty string. Previous releases set the value to "="
The file extension for directory, dirlist and outdir ACD datatypes is now a qualifier. This allows it to be defined as a default in the ACD file but also substituted by the user. An empty string means 'ignore the extension'. To specify 'no extension' a single space can be used as the value.
On the command line, for a parameter (with no qualifier name given) a single dot was used as a missing value in previous versions. This causes problems when specifying the current directory as a dot. On the command line an empty (missing) value must now be an empty quoted string '' or "".
Ampersands in application descriptions have been removed. They confuse HTML versions of documentation.
The QA test script qatest.pl has new options -simple to turn off messages when running with a local test file, and -with to cancel -without options
Output redirected to a file can now use ajSysExecOutname functions to pass the filename to be used for standard output and possibly standard error. The filename is most usefully picked up from a new function ajAcdGetOutfileName which closes an ACD outfile and returns the name of the file. The file will be empty if simply opened, or will have existing contents if the append attribute is true in the ACD file.
The output from tfscan is now in report format, replacing the undefined text file produced in previous releases.
Where a new string is created by ajStrAssignS (the standard string copy functions) the reserved space for the string is enough to hold the current string value. In past releases the reserved memory was the same size as the reserved memory of the string being copied. This wasted memory where a large string had a short value, especially when copying records read from a buffered input file.
Sequence input formats now turn off buffering of input once they can no longer fail (for example, FASTA format after the header record will read everything until it finds another header).
Make ajaxdb code IPv6 compliant. Remove gethostbyname config check.
pcre, expat & zlib include files now install to separate subdirectories.
Showfeat failed to sort features with 'join' locations. The sorting is corrected. A future internal change will improve feature sorting in all cases.
Restriction mapping applications now process bad enzyme input files without crashing.
PNG graphics output had an unwanted blank margin that did not appear in other output formats. This is now turned off through plplot.
Prettyplot formatting is corrected to improve the centring of characters within boxes.
Restriction mapping applications no longer have an upper limit on the number of cuts.
Warning messages for EMBL format sequences created by ENSEMBL have been turned off.
Corrected references to the EMBL/GenBank feature table documentation in ACD files and web pages
embossversion now reports the setting of debug options, and corrects variable name warnrange to acdwarnrange.
Any numeric ACD type (integer, float, range or array) with calculated values for the minimum or maximum attributes can potentially have an impossible range (maximum less than minimum) at run time. ACD processing now discovers these calculated values, and requires a definition for a new attribute 'failrange' If this is defined true, a 'failmessage' attribute must also be defined to explain why the values are invalid (e.g. input sequence too short for the algorithm). If 'failrange' is false, a value for another new attribute 'trueminimum' must be set to define which of the minimum or maximum values if to be used as the only accepted value.
PNG graphics output had a plplot-defined margin limiting the available plot space. This is now removed, allowing applications such as prettyplot more space to display results.
Resource attribute identifier: is obsoleted. No code used it. It is no longer allowed in resource definitions.
Database attributes identifier: description: and command: are obsoleted. No code used them. They are no longer allowed in database definitions.
Fixed GFF2 and GFF3 feature formats to always have the start position less than the end position for features on the '-' strand.
Updated sequence format refseqp to handle features for proteins in the latest release of refseq protein.gpff files.
A new function ajDebugTest can be used to turn on/off specific debug calls. The only argument is a quoted string. A file .debugtest in the current directory or the user's home directory is read. This contains a list of tokens to be debugged, so ajDebugTest returns true if any of these tokens is passed in. Optionally, the name in .debugtest can be followed by a number which is the maximum number of times that token will be reported. ajDebugTest is intended for developers who use ajDebug calls that may be expensive or be excessively called.
Some attributes in ACD files may appear more than once. These include any relations: attribute (now being populated with references to the new EDAM ontology), the groups attribute for applications, the (currently unused) keywords attribute for applications, and the external attribute for applications.
Any external application must now be defined in the ACD file with an external: attribute in the application section. The string value has the name of the application as the first word, followed by a message to be printed if it is not found. When the ACD file is parsed, before any user prompts, the external applications are searched for by first looking for an environment variable EMBOSS_appname and then checking for an executable file in the current directory or in the path.
All applications should be launched by using the name returned by the new ajAcdGetPathC or ajAcdGetPathS functions. This ensures the application has been found in ACD processing and any EMBOSS_appname variable has been tested.
The acdvalid utility now tests for duplicate attributes.
Format specifiers for strings and characters (%S, %s and %c) now have two flags U (e.g. %US) for uppercase and L for lower case output.
The configure.in and main package Makefile.am files handle --enable-devwarnings differently. For the imported libraries this level of warning message is turned off. Messages are still generated for warnings from the main EMBOSS libraries and applications.
The QA testing script qatest.pl has new options -nocheck to skip "make check" applications and -noembassy to skip EMBASSY packages.
Extractfeat processed failed to accept all features by default.
Extractfeat failed on reverse direction nucleotide features.
Coderet miscounted non-coding sequences in the output table.
Graphics devices now have improved and additional checks. 'tek' was rejected as an ambiguous match. 'das' is only valid for an xygraph - one based on sequence positions. On Windows (using mEMBOSS) the plplot version supports fewer devices and these are now excluded from selection.
The change to graphics library access makes the ajGraphInit call which registered graphics functions for use by ACD parsing redundant. In its place we need to register data access functions. As all applications make use of this, we now include this automatically in embInit so there is no longer a need for applications to make a separate call before invoking code (e.g. ACD parsing) that may require registration of functions.
The AJAX ACD code is now in a separate library. New core library functions store and retrieve ACD persistent data such as the program name, command line and list of inputs. As ACD is now linked separately from core AJAX and the graphics library, the callback mechanism for ajGraph functions to be called from ACD is no longer needed.
The database access code in ajseqdb.c has been moved to a separate higher level library. This is where we will insert code to access the new ensembl library functions in AJAX, and possible future data access libraries. A callback mechanism is used so that the embInit call automatically registers data access methods to make them available within the core library functions that read sequences. This allows ajSeqRead to remain in the core library while calling database access methods that in turn may invoke ensembl access.
The PCRE (perl-compatible regular expressions) code in AJAX has been updated to release 7.9 of PCRE. Previous releases were still at version 4.3. The code is standard PCRE code with the LINK_SIZE set to 4 bytes to allow matches in long sequences.
ACD files include relations attributes with text taken from terms in the EMBRACE EDAM ontology. These terms are also described in the knowntypes.standard file and are matched to the known types when validating ACD files.
EMBOSS now uses a more complete User-Agent string when communicating with HTTP servers.
FASTQ short read sequence formats now read and write faster using lookup tables to avoid calculations in the conversion of quality scores.
FASTQ short read formats have additional warning messages for bad or incomplete data.
All sequence input formats now recognize invalid partial entries at the end of the input data and report an error message. A notable exception is FASTA format where a partial entry is still a valid ID line - these will give errors for zero length sequence unless empty sequences are allowed.
Common output formats now write faster, using lightweight output functions to copy strings to the output file.
SwissProt output formats now wrap long OS lines.
Needle has been updated with end-gap penalties support, allowing complete global pairwise alignments. Three new options have been added; the endopen and endextend options are used to specify the gap opening and extension penalties for the end gaps, while the endweight option turns on/off weighting of the end gaps.
New application needleall for all against all global/overlap pairwise alignment of sequences in two multi-sequence files.
wordmatch updated for multi-sequence files using a modified version of the Rabin-Karp algorithm for multi-pattern search. Also added is a log file with statistical information on pattern matches. The updated wordmatch can, for example, be used for efficiently finding multiple patterns in large fastq files.
Application documentation has a new format HTML table for the command line options. This is excluded from the text documentation, where the format of the help output is improved.
Function names standardized for ajcod.c ajrange.c ajtranslate.c ajgraph.c ajhist.c and a few other functions renamed. The old names continue to work as "deprecated" functions although these will generate warning messages with the gcc compiler.
Infoseq option -version is renamed -seqversion to avoid a clash with the new global -version qualifier.
Three new "make check" applications entrailshtml, entrailsbook and entrailswiki generate tables of internal data in HTML, DocBook or WikiText formats. These are intended to update the website, books and Wiki with the latest internal details. The -tables qualifier specifies one or more tables to be printed. By default, all tables are produced. The book tables are sorted in format name order.
Alignment output included headers only for EMBOSS-specific formats. The headers have been dropped from the FASTA MARKX0 through MARKX10 formats to allow standard FASTA suite parsers to use the EMBOSS versions of these outputs.
Fastq-solexa sequence formats converted phred scores of 1 to Solexa scores of -6. They now convert to the limit of -5.
Fastq-sanger sequence format incorrectly stopped when the quality scores started with a '@' (phred quality 31).
Intelligenetics sequence format now correctly ignores additional carriage control characters.
Genbank-like protein formats (genpept and refseqp) failed when reading more than one sequence. The input is now buffered when the format is automatically reassigned to a related parser.
The -help output now includes the one-line documentation string from the ACD file and the version number information reported by --version.
All applications have a -version (or --version) qualifier which will report the EMBOSS version number. For EMBASSY applications it will also report the EMBASSY package version number as "PACKAGE:version". All EMBASSY applications need to call embInitP with an additional parameter of VERSION which will be defined automatically by the configure.in template. If the "versionnumber" attribute is defined in the ACD file this will also be reported as the application version "progamname:version"
The ACD application attribute "version:" is renamed "versionnumber:" to avoid a name clash with the new -version qualifier. We need to use the qualifier name "-version" for compatibility with other systems and applications, so the renaming of the attribute is unavoidable. We believe it was only used (as originally intended) for the definition of external applications by SoapLab.
New application showpep displays protein sequences. Showseq is now limited to nucleotide sequences. Many of the showseq options are not appropriate for proteins. Showpep makes the remaining showseq options available.
A new data structure AjPSeqXref holds details of cross-references between a sequence object and any other data resource. The cross-reference attributes include a type to indicate the source of the cross-reference, for example XREF_DR for a reference in a DR line from EMBL or Swiss-Prot. The other attributes are the database name and up to 4 identifiers (as in the Swiss-Prot DR line definition) and a start and end position where the source is a feature table entry.
When reading a sequence with an identifiable species, attempts are made to define the NCBI taxonomy identifier for the species. Possible sources include the OX line in Swiss-Prot, the taxon cross-reference in the EMBL/GenBank/DDBJ feature table (available only if the feature table is read) and the species name which can be matched to a set of common species obtained from NCBI.
Swissprot entry descriptions in FASTA output no longer have a trailing '.'. Where the source entry has the new Swiss-Prot DE line format the name is built from the recommended full name with other names in round brackets.
Binary files now consistently have null characters after strings to pad them to full length. Previous versions wrote whatever followed the NULL in the string object. The resulting files now look cleaner although any extra characters were always ignored when reading dbi index files.
Test databases were updated on 24th June 2009.
Blank lines are ignored before any sequence input. This is to support the use of seqret to read data pasted into web forms where extra blank lines are often accidentally included.
FASTQ is now a valid sequence format and can be detected automatically. "fastq" format ignores all quality scores as there is no automatic and safe way to determine whether scores are for Sanger/phred or Illumina/Solexa quality. To read the quality scores we support formats "fastq-sanger" and "fastq-illumina". We also support "fastq-int" to read quality scores as integers. These scores are assumed to be Sanger quality. For Illumina quality scores out of range, a warning message is written once for each sequence. Sanger scores do not have out of range values as they allow the full set of quality characters, although high values (over 40) should only appear for contig consensus sequences.
MEGA format has been rewritten to support the file format used by MEGA 4. Title can be in mixed case. Format and Gene/domain command lines are processed. Multiple gene/domain files are read by EMBOSS as separate alignment sets by seqretsetall. This may change in a future release as MEGA4 processes them as one alignment with annotated gene regions. While EMBOSS has no annotation specific to alignments this is a reasonable compromise.
embossdata will now always return directory listings alphabetically.
A new ACD function replaces an attribute value with an EMBOSS or environment variable. The attribute syntax is (@value:VARNAME).
Infile datatypes in ACD have a new attribute directory: which defines the default directory to be searched. If the user specifies an explicit path the directory attribute is ignored.
Applications writing out multiple sets of sequences now correctly reset the sequence output. This only affected one test application in EMBOSS 6.0.1 (input type seqsetall and output type seqoutall).
Applications that use single letter qualifier names (for example the HMMERNEW wrappers for HMMER applications) can be confused if a single letter qualifier name matches uniquely an associated qualifier for a preceding command line qualifier. An additional check now ensures that a unique qualifier (for example -o) is correctly recognized.
Global alignments with needle in rare cases missed the optimal alignment of the first 2 residues. This was a bug introduced in 6.0.0.
When reading data using a launched application, including the SRS access method which launches "getz", closing the input without reading to the end caused the file close function to loop forever. Examples included nthseq and seqret -firstonly both of which stop reading when they have reached the nth or first sequence. File closing now only waits if the input has reached end of file, and has a timeout on the wait to break out of the loop.
Intelligenetics format sequence files with more than one sequence are now read correctly. Where the sequence ends with a number, intelligenetics format sequences can now be automatically detected.
Add -methylation option to restrict/restover/remap/showseq to simulate (e.g.) dam/dcm restriction enzyme knockouts.
remap now correctly reports restriction enzymes cutting a greater number of times than an optionally-supplied maximum value. The primary function of the application was unaffected.
showfeat has a new option -joinfeatures to display all exons on one line for a join feature location. In previous releases this was one of the -sort options. It is now possible to use -joinfeatures and to select a sort order.
Installing without X11 (using the --without-x option for ./configure) used "x11" as the default graphics device in some applications. These now use "png" (if available) or "ps".
needle and water with the -nobrief option repeated report header information on the longest and shortest similarity and identities because the previous header content was not cleared. This only affected results where there was more than one sequence as the second input.
In the EMBL/GenBank feature table the group() and one_of() operators are obsolete. They are automatically converted to order().
The command line syntax using the master qualifier name as a suffix (for example -sreverse_asequence) ignored the master qualifier name and set values for all matching inputs. This syntax is intended as a way for wrappers to better control the use of associated qualifiers, as it is cleaner than using a numeric suffix (-sreverse1 -sreverse2 etc.)
Using -sreverse on the command line could reverse protein sequences for inputs that can read more than one sequence (seqall, seqaset, seqsetall). -sreverse is now only set for nucleotide sequence inputs. Single sequence inputs correctly ignored the -sreverse value.
Multiple sequence sets can be read as input type seqsetall, but when this input was used for a single sequence set input (type seqset) all sequence sets were read. seqset input now stops after the first set (for example a PHYLIP or MSF alignment).
Genbank test data had incorrect format. The data was extracted from a set of test GCG databases and had spaces in the feature locations.
extractfeat now uses the new feature fetch functions and can retrieve features that include joins across entries.
Feature parsing functions are added to fetch sequences from other entries. These depend on reusing the USA of the original sequence, with the identifier of the external sequence inserted in place of the original. This is known to work for database references and flat files.
coderet was limited to EMBL/GenBank feature tables. It now processes any valid feature input including GFF files. The previous parsing functions are obsolete and have been removed as coderet was the only application calling them.
Very large pairwise alignments can fail to back trace through the alignment because of rounding error. The alignment and traceback functions now use double precision to maintain accuracy.
pepwindow and pepwindowall missed the plot value for the last window in the sequence.
pepwindow and pepwindowall now process sequence ranges -sbegin and -send.
pepwindow and pepwindowall now default to a window length of 19, ideal for transmembrane regions. The old default of 7 was short and gave noisy results.
pepwindow and pepwindowall have an extra option -normalize to convert the amino acid data in the datafile to mean 0.0 and standard deviation 1.0. The default Kyte-Doolittle data is not normalized.
The EMBL/Genbank feature table definitions have been updated to version 8.0 (October 2008). Sequence ontology terms are now available for all feature types except S_region for which no specific SO term exists. S_region is attached to an internal term derived from SO:0000301 as a placeholder.
Programs searching with regular expressions and patterns reported the pattern name with '1' added to the end. This was to support pattern and regular expression files with multiple patterns. When only one pattern is given on the command line the '1' is no longer added.
Programs searching with regular expressions (dreg and preg) missed overlapping matches to the pattern. The algorithm now steps forward one character from the start of the match and searches again. Some regular expressions with wildcards may produce a large number of overlapping matches especially in low-complexity regions.
Protein sequences in GFF format now use GFF3 by default. For release 6.0.0 protein sequences were written in GFF2 while the GFF3 protein feature definitions were redefined using the Sequence Ontology. This process is now completed.
When a sequence is reversed by revseq the description is tagged with "Reversed: " so that the output and any sequence derived from it has a note of the history.
EMBL and GenBank formats when used to read multiple entries failed to reset the list of citations. Although the first set of citations was reported correctly, all other entries in the same run included the citation list from the first entry.
SwissProt/UniProt entries now preserve the complete entry content when read and rewritten. All feature types are preserved and feature lines wrap according to the widths in UniProt 14.8. Date lines are stored and written. Comments are stored in blocks. Database cross-references are stored in a list. The description lines are saved in the new SwissProt structure. Tests on a set of complex entries confirm that EMBOSS is able to read and write an exact copy of this sample set.
Protein feature keys now use the Sequence Ontology identifiers as internal names. This may change the way some feature keys are converted between data formats. Protein feature keys have been updated to correct some conversions, for example to distinguish between "coiled coil" from pepcoil and "random coil" from garnier output.
Fitch sequence format was only able to read a single sequence. EMBOSS can now read 'fitch' as a multiple sequence format.
Extractfeat now cleanly processes minscore and maxscore as limits on the score. By default any score is allowed if these are unchanged. Previous releases required minimum and maximum to be equal - or minimum greater than maximum - to permit any feature score.
New feature XML output format DASGFF. Feature output functions have a changed interface to pass the AjPFeattabOut object so that additional processing can handle the opening and closing of an XML output file.
New sequence output formats "dasdna" and "das" write DASDNA and DASSEQUENCE XML outputs. Sequence output functions have a new capability to define a Cleanup function to write the final lines of an XML output file. The AjPSeqout data structure already has the Count attribute needed to identify the first sequence so that the XML header can be written.
New environment variable EMBOSS_ACDFILENAME provides an alternative way to set the default output filename for EMBOSS applications. If set to true, the filename is used rather than the current behaviour of using the first sequence name as the default filename. When the filename is used the case of the name is preserved.
Corrected display of exon ranges in showseq. Exons now display in their original frame (all were displayed in frame 1 in earlier versions). Display of 3-letter amino acid names corrected (but we hope nobody is using 3-letter codes any more!)
Added create attribute for outdir datatype in ACD. If true, the output directory will be created if it does not already exist. The default is false. output directories must already exist. This is the behaviour in previous releases.
Added attribute aligned for datatype seqoutall in ACD files. Applications can write multiple sequences as a seqoutset (aligned or unaligned) and can also write seqoutall - writing sequences one at a time without first storing them as a set.
For phylogenetic applications (PHYLIPNEW) reading distance matrix files failed for some formats written by other applications. Distance matrix input now works for multiple matrices in square, upper-triangular and lower-triangular formats.
The PLPLOT graphics library uses 4 environment variables to allow local configuration. EMBOSS uses a local copy in libeplplot. For sites that have the native PLPLOT also in use we have renamed the environment variables to use the prefix EPLPLOT. This protects EMBOSS from any configuration set only for the local plplot. The variables are: EPLPLOT_BIN EPLPLOT_LIB EPLPLOT_TCL and EPLPLOT_HOME. Versions of EMBOSS up to 2.8.0 defined PLPLOT_LIB but this value is now automatically set and the environment variable is no longer needed.
Command line qualifiers are renamed where the first 5 characters are the same. These were:
eprimer3 major revision of all options est2genome -splice to -usesplace prettyplot -boxcolval to -boxuseoctanol -*plot to -plot* showfeat -match* to -*match; -source to -origin showpep -match* to -*match showseq -match* to -*match; -source to -origin
vectorstrip -vectorfile to -readfile; -linker* to -*linkerand similar changes for EMBASSY applications.
ACD processing now objects if two or more qualifiers are not unique in the first 6 characters. In a future release we would like to reduce this to a 5 character unique name. Several EMBASSY applications need to be modified to comply with this requirement.
MEMENEW updated for meme/mast version 4.0.0. ememe now produces fasta, html, text, xml and xsl outputs. A new variant, ememetext, produces only the text and fasta outputs.
DBX index file key deletion code added for ID/ACC/SV/KW/DE/TX indexes.
HTTP access now adds a User-Agent string with the EMBOSS version number so that servers can count the number of EMBOSS requests.
PDB model structures failed to generate a new name for each model. Duplicate sequence names are not ideal. The model number (from the MODEL record) is now appended to each sequence name in "pdb" and "pdbnuc" format. The "pdbseq" and "pdbnucseq" formats read a single copy of each sequence from the SEQRES records.
Added two new PDB formats to read nucleotide data. These are named "pdbnuc" and "pdbnucseq". They are not available by default, to avoid the problem of reading both protein and nucleotide sequence data from a structure file for an oligonucleotide binding protein.
Alignment outputs now include most of the multiple sequence alignment formats that EMBOSS can write. The functions for these are trivial to write. New functions can be added to use any existing sequence output format for alignments.
PDB entries can be read in two ways, with two named formats. Sequence format "pdb" reads the ATOM records. Sequence format "pdbseq" reads the SEQRES records. By default, only "pdb" format was used, and could crash on entries where the ATOM records were missing. Both formats now fail silently if no sequences are found. By default, "pdb" format is used first, and if that fails "pdbseq" will be tried.
The EMBOSS logfile (defined by variable EMBOSS_LOGFILE) now reports two extra values: the number of cpu seconds and the number of elapsed time seconds.
Extra stop codons in getorf for ORFs ending close to the end of the input sequence no longer appear.
For optional qualifiers (defined as "nullok" in the ACD file) the command line option -no(qualname) was causing output files to appear by resetting the value to an empty string, which in turn was converted to the default filename. Now -no(qualname) turns off any output file defined with nullok, and -(qualname) "" asks for an output file that is off by default and uses the default filename for it.
Report output has a new tail format that reports the total sequences and total sequence length read by the applications. The previous "Total_sequences" report was the number of sequences included in the report. This is renamed to "Reported_sequences". Where the number of hits was limited by the -rmaxseq or -rmaxall options, the number of unreported hits also appears. If the rmaxall limit was exceeded, the report tails ends with "Maxhits_stop: Y". If the -rmaxseq limit is exceeded, the sequence report includes (as before) "HitLimit: max/total"
Refseq protein and Genpept now use a modified genbank format to avoid warnings for "aa" replacing "bp" on the LOCUS line and to provide better control over any other differences between nucleotide and protein entries. Genbank format automatically calls refseqp format if a LOCUS line has "aa".
Swissprot output was missing a '.' at the end of the organism line.
vectorstrip failed if the user failed to provide a filename for the -vectorsfile option and failed to specify -novectorfile to turn off file reading. The ACD file is changed so a vectorsfile is required if -vectorfile is true and a check is put into the code to catch the problem if the ACD interface changes in future.
Allow user-defined -carboxyl parameter for iep.
jaspscan now allows multiple sequences to be scanned.
New application aligncopy reads a set of aligned sequences and prints a report in one of the standard alignment formats that can accept the same number of sequences. Pairwise alignment formats can only be used if the input has exactly two sequences.
New application aligncopypair reads a set of aligned sequences and prints a report or each pair of aligned sequences in one of the standard alignment formats.
New application featreport reads a sequence and a feature table, and writes a report in and of the standard report formats.
New application featcopy reads and writes a feature table to convert feature formats.
New applications maskambignuc and maskambigprot replace ambiguity characters in nucleotide sequences with 'N' and in protein sequences with 'X'.
New application consambig reports an alignment consensus sequence using ambiguity characters. The intended use cases are sequencing reads and SNP reporting.
New application sizeseq sorts sequences in ascending or descending order of length. This is a port of the application seqsort from the domsearch EMBASSY package.
New application skipredundant uses pairwise sequence matches to exclude sequences that are similar from an input set. This is a modified version of the application seqnr from the domsearch EMBASSY package.
New applications provide utility functions for former GCG users: nohtml removes HTML tags, notab replaces tabs with spaces, nospace removes all whitespace from a file, skipspace removes extra whitespace from a file.
Older EMBOSS applications can now generate a warning message stating that they are marked as 'obsolete' with an explanation and an indication of alternative programs in EMBOSS or in an EMBASSY package. This warning can be turned off by defining environment variable EMBOSS_WARNOBSOLETE with a value of "N" or by defining the same variable in the emboss.defaults or ~/.embossrc files. We will begin to mark applications as 'obsolete' in future releases.
A new EMBASSY package "myembossdemo" contains the demonstration applications demoalign, demofeatures, demolist, demoreport, demosequence, demostring, demostringnew and demotable that illustrate how to use EMBOSS data types in your own applications. The myembossdemo package allows novice developers to try simple EMBOSS programming. The myemboss package is available for adding your own applications. The demo applications are no longer distributed with the main EMBOSS package. They were not installed and were only built with the "make check" option.
Application short descriptions have been revised. The minimum length of application one line descriptions is increased from 60 to 70 characters. The descriptions are easier to write. Output from wossname can now be 90 characters wide. Interfaces that use the description in menus may need to allow some extra space.
Function names in ajfile.c have been standardized. Old names are still accepted but are marked as "deprecated" and will generate warnings with the gcc compiler (see ajstr below). Other compilers will see no difference. New source files ajfiledata.c and ajfileio.c have been added. The buffered file data structures are renamed internally to be more consistent (AjPFileBuff to AjPFilebuff).
notseq was unable to search for IDs containing '|' characters but uses string matching (not regular expressions) and these characters are valid in NCBI-style FASTA files if read with the "pearson" format which accepts the whole ID string without parsing.
The sequence alignment code has been updated. Sequence alignments with low gap penalties failed to allow two gaps (one in each sequence) without a match in between. The embAlign functions are now simplified. Scores are returned by the PathCalc functions. The Walk functions that walk through the path and return the aligned sequences are faster and need fewer parameters. Profile alignments occasionally duplicated residues in the sequence around gap positions. Fast alignments around a limited width include additional residues at each end and require an offset rather than separate start positions. The offset if the difference between the two start positions used in 5.0.0 and earlier releases.
Eprimer3 citations are corrected in the help text (from the ACD file) and in the documentation. The citation errors were traced to the original primer3_core documentation which has now been corrected.
Wordmatch could confuse overlapping matches. It occasionally extended the wrong match and missed a corresponding new match.
Seqmatchall results were correct with the default output format which reports match positions, but gave incorrect results with some other local alignment formats that include the sequence. Seqmatchall now stores alignments in the same way as other local alignment applications, and the alignment internals are corrected to ensure other applications will not have the same problem.
Emma was officially supporting clustalw 1.83. Issues with clustalw 2.0 are now resolved and this version is supported if clustalw2 is installed. Emma executes an applications called clustalw (not clustalw2) so version 2.0 must be installed under this name or an environment variable EMBOSS_CLUSTALW needs to be defined to point to the executable clustalw2 file.
Sequence format "selex" allows invalid sequence data files to be accepted as input. Selex format is still available but is no longer included in the formats that can be automatically detected. When reading selex format data, users need to put "-sformat selex" on the command line, or specify "selex::" at the from of the USA. See the HMMER (old version EMBASSY package) documentation for examples. HMMERNEW (recommended) examples use Stockholm format and so are unchanged.
Program dbxfasta now defaults to a filename of "*.fasta" The previous default "*.dat" is not commonly used for FASTA format databases.
Program msbar block mutations were 1 longer than the specified block and may crash if the block size was fixed (minimum and maximum block sizes the same). This off-by-one error is now corrected.
In GenBank output format, multiple line KEYWORD sections were not formatted correctly.
ACD list and select values (the menus that appear in the user prompt) can now have ACD variables. Although useful for local application development these are not used in EMBOSS distributed ACD files because the variables are difficult for web and GUI interfaces to resolve when presenting the menu text.
List and Table internal data structures are now cached so that creating and deleting temporary lists and tables is more efficient.
In emboss.default database definitions the filename and exclude values can be delimited by spaces, commas or semicolons. Previous releases used only spaces. Parsing is now consistent with the fields definition which allowed all the above characters.
Protein sequences with pyrrolysine ('O') had 'O' converted to a gap because this was a gap character in early versions of Phylip. This was patched in 5.0.0 to allow 'O' in UniProt release 13. The gap character is upper case only, so 'o' was correctly read as pyrrolysine.
Wordfinder used the same descriptions for two pairs of qualifiers. The descriptions are changed to make their meaning clear in commandline help and in web interfaces.
New function ajTimeDiff returns the difference in seconds between two time values.
Profiling tests showed that file reading and string handling can be made faster. String handling called functions many levels deep. Making this code inline and using macro versions improved performance for applications (e.g. database indexing) that use many string calls. File input requires each input line to be copied. Using copy-by-reference (ajStrAssignRef) often makes this more efficient. Existing macros now test for undefined strings: MAJSTRGETLEN, MAJSTRGETPTR, MAJSTRGETRES and MAJSTRGETUSE. New macros are added for string handling: MAJSTRDEL, MAJSTRGETUNIQUESTR, MAJSTRCMPC and MAJSTRCMPS.
Memory management includes new macros AJCRESIZE0 and AJRESIZE0 provide resize functions that guarantee new memory is set to zero. The functions must be given the original allocated size.
Using the GNU C run-time library, calls to mcheck and mprobe are available to test for memory corruption by examining the bytes before and after an address allocated by malloc. This can be turned on for any application, including Unix commands, with the environment variable MALLOC_CHECK_ which has values 0, 1, 2 or 3. 1 writes to standard error when a problem is found, 2 aborts the programs, 3 does both and 0 ignores errors. No recompilation is needed for this simple method. EMBOSS now has a ./configure option --enable-mprobe which enables two new functions. ajMemProbe, passed an address from malloc (AJNEW0, AJCNEW0, etc.) tests the bytes before and after and reports any errors. The advantage of using ajMemProbe rather than mprobe is that a macro MAJMEMPROBE also reports the file and line number where it was called. To avoid large numbers of messages (when code has problems) a limit can be set with ajMemCheckSetLimit after which the program will exit. Note that enable-mprobe is incompatible with using valgrind to test for memory leaks - as mprobe and mcheck have to look at illegal bytes before and after allocated memory blocks. Memory checking is turned on by a call to mcheck, passing the function ajMemCheck, in ajnam.c before the first memory allocation. If any program calls malloc before calling embInit or embInitP this call will fail and issue a warning (if compiled with --enable-mprobe). A special call ajStrProbe tests any string with mprobe. Special calls ajListProbe and ajListProbeData test lists and their contents. For more details see http://www.gnu.org/software/libc/manual/
Protein sequences from the Staden package were read as nucleotide because they were missing information on the ID line to identify EMBL of SWISSPROT format. The sequences are now tested and correctly typed.
Wordcount now accepts protein sequences as input. Previous releases only allowed nucleotide sequences.
Wordfinder options had the same information prompt. These have been changed from "limit" to "minimum" and "maximum" to make their function clear.
Prompting for values from the user now includes a test for standard input in use as an input file. If standard input is open, the default response is accepted and a message is written to the user. This is to avoid problems with command lines that use "stdin" as an input and do not include -auto.
The acdpretty utility can now preserve comments in ACD files. Comments are maintained in blocks with blank lines before and after. Inline comments are started in column 50 unless they are exceptionally long. Comments themselves have white space cleaned up but otherwise are not reformatted.
A new function ajAcdGetValueDefault is added to return the default value of an ACD qualifier. This can be combined with ajAcdIsUserdefined in wrappers to test for values changed by the user.
Infile qualifiers in ACD have a new attribute "trydefault" which allows the default filename to fail. Any filename provided by the user has to exist. This was added to support the behaviour of the MIRA EMBASSY package. To allow an infile to fail the attribute "nullok" also must be set to "Y"
Applications which produce an output file or graphics often created an empty output file when the plot was selected. The ACD files have been corrected to only create the file if it will be written to. Applications changed are charge, dan, freak, hmoment, iep and tcode.
Whichdb only writes to its output file if -get is false. With -get it creates sequences. The outfile is no longer created when whichdb is in -get mode.
String functions corrected so that Case in the name always means case-insensitive and works by converting to upper case. Some functions were defined the wrong way, with "Case" for the case-insensitive form.
GFF3 format is now the default feature output.
A new function ajFeatIsCds identifies protein coding nucleotide features (CDS) using the SO identifier. A new function ajFeattagIsNote identifies feature tags that are for the default feature tag.
Protein features now use the new Sequence Ontology terms defined by BioSapiens. These are not yet accepted by GFF3 validators. The new SO identifiers are added to protein feature definitions and used internally.
Feature format definitions (the Efeatures and Etags files) now allow #include references to other files. This allows a standard EMBL and Swissprot feature table definition to be included by the internal and GFF definitions. Redefinitions are allowed using + and - prefixes to add and remove tags for existing feature types.
GFF3 format feature (and report) output is added.
A new application "density" has been added. This reports the A+C+G+T and AT+GC densities of nucleic acid sequences within an adjustable sliding window. Plots of A+C+G+T or AT+GC are optionally produced.
Molecular weight programs (e.g. digest, mowse) now have a -mono switch to allow use of monoisotopic weights. By default, average molecular weights are used.
The Eamino.dat format has changed. Molecular weight information has been removed and put in its own Emolwt.dat file. This latter now allows specification of average and monoisotopic weights. Values for hydrogen and oxygen are specified as well as the amino acid weights.
The library representation of amino acid property information has been changed. The EmbPropTable global table has been removed and replaced with EmbPPropAmino and EmbPPropMolwt objects.
Pepcoil now produces a report (replacing a text output) in "motif" format. The default is changed to not report non coiled-coil regions as they are hard to distinguish in this format.
The "motif" report format is extended to allow two score positions marked with "*" and "+" and labelled internally as "pos" and "pos2". No application uses pos2 (it was added for pepcoil, but both score maximum positions are always the same)
A new function ajAcdIsUserdefined allows wrappers to test which qualifiers have values changed by the user so that they can use shorter command lines to launch the wrapped application.
jaspscan application added. Scans sequences for transcription factors using the JASPAR matrices.
jaspextract application added to move the JASPAR matrices into the EMBOSS data area subdirectories.
Alignment format "trace" used to display internal data content, is renamed to "debug" to be consistent with other formats. A "debug" format is added for feature output.
Application documentation has been updated to remove obsolete references to EMBL database identifiers. These are replaced with the correct accession numbers.
Two new entries have been added to the "tembl" test EMBL database for use in the QA tests.
Report output now checks the sequence and feature table type. Is the sequence is not a valid protein, protein-only formats (pir, swiss) will fail with an error message. Similarly, if the sequence is not a valid nucleotide sequence then nucleotide-only formats (embl, genbank) will fail with an error message.
Garnier now uses the correct SwissProt and internal feature keys for protein secondary structure. The results will appear much better for example as a swissprot feature table. This required rewriting of the internals by recoding the secondary structure features with a "garnier" tag replacing the previous "helix", "sheet", "turns" and "coil" tags. The default output is unchanged. The results in other report formats will be changed.
Silent no longer reports the "Dir" column. This is replaced by the new "Strand" column which reports "+" for a forward feature and "-" for a reverse feature.
The following programs have changed default report output, with the strand included for nucleotide sequences: equicktandem, etandem, fuzznuc, fuzztran, recoder, restrict, silent, tcode, twofeat. The strand column can be removed with the new command line associated qualifier -norstrandshow.
Reports for nucleotide sequences have confusing ways to represent the start and end positions for features on the complementary strand. A strand column has been added to these reports, controlled by a new -rstrandshow qualifier and attribute. By default the strand is shown for all nucleotide reports (see a list of changed program outputs above). The start position is always lower than the end position for features on the complementary strand indicating the region that should be reversed. In past releases the seqtable report format (fuzznuc, dreg, dan) confusingly reversed start and end positions to indicate the unreported strand. For all report formats (nametable, table) the start and end positions are now consistent with nucleotide feature formats (gff, embl, genbank).
Reports from dreg incorrectly reported sequences reversed with the -sreverse qualifier.
Report headers now include the text "(Reversed)" when the input sequence(s) are reverse complemented.
Phylogenetic trees in newick format are now parsed into internal trees and converted back for use by Phylip. This allows us to read other tree formats and pass them to Phylip (e.g. Nexus)
Some ACD data types did not allow the input to be NULL because extra tests were carried out on the results. These are all cleaned up and tested so that they can safely be set to nullok and missing in local applications.
New sequence reading formats for PDB files. By default the ATOM records are used (format "pdb"). An alternative format "pdbseq" will read the SEQRES records which give the original sequence. The ATOM records give the sequence determined from the structure.
Improved the help text for the -stdout and -filter options to explain output files are written to standard output. Some users expected graphics output (from plplot) to be controlled.
Extractalign is a new applications to extract regions from a sequence alignment in the same way extractseq extracts regions from single sequences.
The MRS server in Nijmegen changed its syntax just before our release. A new database access method "MRS3" supports the main MRS3 server. We have very little documentation on the changed URL query syntax. Access by ID appears to work at this stage. The database URL is defined as http://mrs.cmbi.ru.nl/mrs-3/plain.do The plain text output is now defined in the URL. The database names have all changed on the server. At present the same server appears to still support the old MRS access method with the URL http://mrs.cmbi.ru.nl/mrs/cgi-bin/mrs.cgi
ACD parsing now allows square brackets within quoted strings.
Functions for lists and tables have been renamed to new standard naming conventions. Some source files remain to be standardized after the release, most importantly ajfile, ajfeat and some remaining ajseq source files.
Warning messages are available for sequence formats that do not allow additional characters. The environment variable EMBOSS_SEQWARN needs to be set to "Y" to enable warnings. For example, EMBL format allows numbers in the sequence records. Fasta and related formats now warn for any characters that are not whitespace and not known sequence characters. These warnings are controlled by an environment variable so they can be disabled (or enabled) for specific installations and/or wrappers. We expect many cut-and-paste inputs can generate warnings. EMBOSS will normally silently remove non-sequence characters.
Regular expression pattern file names (for dreg and preg) were converted to upper case if the ACD file required the patterns to be upper case.
The EMBOSS commandline now accepts gnu-style syntax with --qualifier (we allow one or two '-' characters). Users who tried this syntax were confused because EMBOSS treated --qualifier as a parameter. In many cases it was used as the output filename, which would give no error message but make it hard to find the output.
Antigenic now accepts any protein sequence as input (earlier versions did not allow ambiguity codes). B and Z are treated as weighted averages of D/N and E/Q. All others are converted to X and treated as a weighted average of all values. The data table used has no information for selenocysteine or pyrrolysine.
Dottup is corrected to plot only the selected sequence range. The plot lines were 1 residue too long (only noticeable on very short sequences).
Distance matrix data can now read multiple distance matrices from a single input file. This is used by three programs (fneighbor, ffitch and fkitsch) in the phylipnew EMBASSY package.
Discrete states input now correctly defaults to all non-space characters if no characters attribute is given in the ACD file. This was the intention, but two programs (fpars and fdiscboot) were instead accepting only 0 and 1. Other phylip programs have their discrete state character set specified in the ACD file.
A new function ajSystemOut calls a system command, and redirects standard output to a named file.
Function names are standardized for the ajsys, ajtime and ajutil functions.
New function ajStrTableFreeKey frees only the key from tables where the value is a constant.
Error messages from reading badly formatted comparison matrix files are improved to report the line and the token that failed to parse.
Test data has been updated. EMBL and SwissProt entries are updated to the latest versions of these entries. Swnew entries are now a selection from the SpTrEmbl subset in UniProt. The wormpep database is obsolete. We do not have current data for the gb directory which contained GCG reformatted genbank entries.
NBRF (or PIR) format failed to read some entries from SRSWWW servers because the sequence ID does not match if the protein is a fragment.
Efficiency of building large strings is greatly improved by doubling the reserved space each time the end is reached. This speeds up the reading of all long sequences.
String function ajStrFmtWrap to wrap strings for output now respect newlines in the original string. A new function ajStrFmtWrapAt prefers to wrap at a selected character, for example ',' for author lists.
Sequence objects are extended to include the full set of fields defined in EMBL, Genbank and UniProt database entries. The "embl" "genbank" and "swissprot" formats now read and write all fields, so that entries will be rewritten exactly as in the originals except for a few minor corrections (extra spaces in feature tables are removed). We cannot guarantee that information is preserved when writing out in a different format. For example, EMBL and Genbank formats do not contain the same information.
GIF graphics output added where the gd library is a recent enough version to provide support.
The plplot graphics library has been updated to 5.7.2. New files are disptab.h pldll.h, file gd.c replaces file gdpng.c and needed one change for FREETYPE.
Infoseq can now optionally display the database name.
The acdvalid utility warns about qualifier names that do not fit the standard naming convention. The messages now include a suggested valid name, for example an input file called -sites will be suggested as -sitesfile.
Sequence output in EMBL and SWISS formats now defaults to the new format of the databases from 2006. The previous formats are still available as "emblold" and "swissold". As sequence input, "embl" and "swiss" formats will read both versions of the files.
Function ajTableRemove deletes an entry in a table, but only returns the value. This is replaced by ajTableRemoveKey which also returns the original key. The caller now owns both the value and the key, and is responsible for deleting them. ajTableRemove is now declared obsolete and will be removed from a future release.
Infoseq by default uses columns with fixed width, but this fails to delimit long sequence names (for example, long file names and paths). Two changes make this better. Infoseq now inserts a space in column-delimited output (the default) when a string fills the whole column. It is also now possible to specify a tab as delimiter with -nocolumn -delimiter "\t" to return to 3.0.0 behaviour. This was needed for the W2H interface and maybe some other wrappers.
Renamed libplplot to libeplplot and plplot headers are now installed to include/eplplot. This avoids collisions with later versions of plplot.
Bugfix 1: graphics output failed to reset the title correctly in some applications. Prettyplot and banana badly rescale the output from the second page of multipage output. Abiview produced additional blank pages with only the title. Abiview also had bugs in display when the user changed the window size or asked for separate plots for each trace.
A new ACD attribute outputmodifier: "Y" identifies qualifiers that cause the kinds of output changes that can break parsers. An obvious example is the -html qualifier on may of the utility programs. This attribute is a warning to wrapper developers and maintainers that they may want to fix the value of this qualifier and not allow users to change it. In some cases (as with toggle qualifiers) it may be useful to wrap each possible value separately. For example, tfm can run as an HTML version (-html) and a text version (-nohtml -nomore).
Backtranseq now keeps stop positions in the sequence and replaces them with the most common stop codon. Previous releases converted stops to 'X' and back translated them as 'NNN'.
Reading sequences in NBRF (or PIR) format now only removes one '*' from the end, allowing protein sequences to end with a stop codon.
Reading NBRF format sequences in FASTA format was retaining a ';' in front of the sequence ID. This is now fixed.
Pattern files and regular expression files now use the -pformat and -pname associated qualifiers which were ignored when they first appeared in 4.0.0. Pattern file formats are "fasta" for the original format in 4.0.0 with FASTA style identifiers, and "simple" for files with a single pattern on each line. The format defaults to testing the first character for a '>'. The pattern name is used to set a name of "name1", "name2" and so on if no name is in the FASTA file. By default patterns are called pattern1, regular expressions are called "regex1".
Added a new function to read from a buffered file and trim newlines. It was not needed before because input functions were doing their own trimming.
Valgrind memory leak tests now cover all QA tests. The command line is captured and used to generate test cases. Script valgrind.pl knows about the few cases that need input files copied and preprocesses them by name. A few tests can be flagged as ignored. This is intended for tests known to run for a very long time under valgrind. Memory leaks are fixed for all programs in the main EMBOSS package and for the most used ones in the EMBASSY packages.
A new environment variable ACDCOMMANDLINELOG takes a filename as its value. This saves the command line equivalent of a program run, converting user responses to prompts into their command line equivalents. A number of bugs in command line saving for report headers were identifier and fixed.
Two string functions had their names reversed. ajStrRemoveWhite is to remove all white space from a string, ajStrRemoveWhiteExcess is to remove white space from the ends and replace internal whitespace with single spaces. When function names were standardized these names were reversed. As function calls were converted automatically EMBOSS code worked as before, but developers will notice the functions to not behave as expected. This is now corrected, and all existing calls in the EMBOSS code have been checked and converted.
Showseq with a sequence end position now stops output at the end of the user-specified range, Previous releases printed the whole of the line with the last base/residue.
SRS servers use "gid" as the field name for GI numbers. The field name has been changed to allow GI searches with local SRS and remote SRSWWW access to Genbank.
A new configure option for developers --enable-devwarnings turns on many more warning messages from the gcc compiler. Not all warnings are useful - the less useful gcc options are documented (and commented out) in the configure.in file devwarnings section. Warnings include missing function prototypes, signed/unsigned comparisons, potential loss of precision in casts, use of global names (index for example) as variables.
Function names in ajseqwrite.c have been standardized. Old names are still accepted but are marked as "deprecated" and will generate warnings with the gcc compiler (see ajstr below). Other compilers will see no difference.
Edialign is a new application, a port of the DIALIGN2 program by B. Morgenstern, using an ACD file written by Guy Bottu. It takes as input nucleic acid or protein sequences and produces as output a multiple sequence alignment. The sequences need not be similar over their complete length, since the program constructs alignments from gapfree pairs of similar segments of the sequences.
Wordfinder is a new application to find word-based matches of limited size. It is based on code from supermatcher. The inputs are reversed so the query sequence set (unaligned) is compared to a streamed database of sequences. (Supermatcher should perhaps have its inputs in this order too). Limits are provided for the length of the word match and the length of the alignment. The default gap penalties are also increased to limit the gaps allowed in alignment.
Word-based algorithms found too many matches where both sequences contains runs of X (protein) or N (nucleotide). These are now ignored when building the word table.
Word-based algorithms complained if a sequence was shorter than the wordsize. This was a problem for database searches with some short sequences present. They now run silently and simply return no word matches.
The EMBL format sequence entry parser was able to read swissprot sequence data, but not the feature table. Efficiency improvements to set the sequence type to nucleotide for EMBL entries showed that swissprot entries were being read by the EMBL parser. A test for swissprot protein information on the ID line should redirect these entries to the swissprot parser. In previous releases the sequence type was not set, so there was no problem with the sequence type - although feature lines may not have been readable from swissprot format flat files. Database definitions specify the swiss or embl format so they are not affected.
Large sequences were running very slowly. This was traced to the way sequence types are tested using regular expressions processed by calls to the PCRE library. These calls were replaced by simple string functions as they are only testing that a sequence is entirely composed of characters from an allowed set. An additional speedup was achieved by defining only upper case characters as required (almost halving the number of tests) and testing the upper case version of the sequence characters.
Sequence translation in the reverse direction adds extra amino acids for partial codons. In the forward direction the overhang was miscalculated so these codons were missed. No users have complained, probably because in most cases they are translated as 'X' (it needs a 4-base wobble in the code to convert the first 2 bases of a codon into a single amino acid).
Sequence translation was relatively slow, at least on very large sequences. Profiling with gprof indicated some changed to reduce the number of string handling calls (each was very fast, but there was a very large number of calls. The internal tables were resized (from 15 elements to 16) for more efficient mapping.
Parsing NCBI format ID lines saves the database. This is available for writing NCBI formatted output ID lines, but is not to be used in reporting the USA.
Added "refseq" as a sequence and feature format. Initially a simple alias of GenBank but we may let them diverge later.
REFSEQ entries have their own idea of what a ProteinID in the feature table looks like, as they use REFSEQP protein IDs. Validation now allows the third character to be an underscore.
Large numbers of database files could make the dbi indexing programs (dbiflat, dbifasta, dbigcg, dbiblast) fail at the sort merge stage when the index files are combined. The sort merge is now in 2 steps to limit the number of open files required in the system sort utility.
Added a script emblsplit.pl to split EMBL and UniProt database files into 2Gbyte chunks.
The -sid qualifier now overwrites the sequence id if used. The -sid value will be used for creating the output filename and for reporting the sequence identifier in output files. For more than one sequence as input currently the same ID is used. We may change this in future to generate new IDs from this base name.
New sequence format gifasta is the same as "ncbi" but uses the GI number as the identifier. Because the output is the same for both formats we have to require -sformat gifasta to be on the commandline. The default for such files will remain "ncbi" as the automatically processed format. On output if there is no GI number a dummy value of "000000" is currently used.
coderet now writes non-coding sequence to a new output file.
New feature function ajFeatLocMark marks selected features as lower case. Used by coderet to report non-coding regions.
The help output now correctly reports output sequence default filenames.
Phylip input distance matrices now allow integer values to be treated as reals, although there is a possible confusion over integer replicate values so the use of a trailing ".0" is strongly recommended.
Sequences with NCBI deflines and no ID after the final "|" were using the version part of the seqversion ("1" from "AB123456.1") instead of the "AB123456" part to set the ID.
Graph titles were not standard on the general "graph" type output, but are consistent for xygraph outputs. A new attribute gdesc defines a prefix for graph titles which can be appended to by the calling program, usually with a description of the input (sequence USA, input filename). A new call ajGraphSetTitlePlus defines the text to add to the gdesc as "[gdesc] of [text]". All graphs were standardized except pepinfo which has 10 subplot titles already in the intended format. This will be corrected later to have standard main titles and shorter subplot titles.
The version of plplot we use has a bug in calculating character sizes where the origin in user units is not the default of (0,0). This has been fixed in the plgchrW and plstrlW functions in the copy that is included with EMBOSS.
Dreg and preg ignored sequence begin and end positions. Both programs now use the embpatlist function calls to process sequence ranges.
Fuzznuc, fuzzpro and fuzztran lost the ability to use the sequence begin and end positions when we switched to pattern lists. This has been restored in the pattern list processing code.
The logfile caused a file close error if it was read only (because it had not been successfully opened). Opening the logfile now tests the file is writable and ignores logging for a read-only file.
More case-sensitive sequence comparison and matching functions added to be consistent about providing both versions.
A few sequence databases have no accession number. For these a new database attribute hasaccession: "N" in emboss.default prevents EMBOSS trying to search the ACC field in addition to the ID field.
A few databases with duplicate IDs should be treated as case-sensitive. The original example was a pdbprot database, containing FASTA format sequences of individual chains from PDB entries. In PDB, the entry itself is a 4-character string, and the chain is a single character A through Z. When an entry has more than 26 chains, the next 26 are labelled a through z. Pdbprot appends these as _A, _B, etc. PDBPROT is available from some public SRS servers - see the official list at http://downloads.lionbio.co.uk/publicsrs.html. This is resolved by adding a new database attribute caseidmatch in emboss.default. A value of "Y" will force EMBOSS to exactly match the case of the whole ID. This is done by post-processing and rejecting entries with an ID that fails to match.
The run date included in report output has changed format to have the day first and to lose the leading zero when the day is 1st to 9th of the month.
Program cpgplot can run on more than one input sequence, but the plot failed on the second sequence. Fixing this required adding a new function ajGraphDataReplaceI to replace the 1st, 2nd 3rd, etc. subgraph. Some memory cleanup was also added to remove the replaced graph data objects.
Programs pepwindow and pepwindowall can now process any protein sequence. In previous versions pepwindow was restricted to pureprotein (no ambiguity codes) while pepwindowall accepted any protein sequence (it has to handle gaps) but was using a score of zero for unknown amino acid residues. Changed so that missing amino acid values can be filled in using Dayhoff frequency weighted averages for B, J and Z and an overall average for X, J and O.
Program octanol can accept any protein sequence. Interpolated values are used for B, Z and J. An average over all values is used for X and also for O and U where there is no data. Interpolations and averages used the Dayhoff amino acid frequencies.
Program iep can accept any protein sequence. Ambiguity codes B and Z are resolved by converting to the carboxylic acid (D or E) or amide (N or Q) according to the Dayhoff amino acid frequencies, giving a consistent value for any input protein.
Sequence set type testing was checking whether the seqset is defined as protein but ignoring the type of the first sequence. This is now fixed.
Program tfm looks in the obsolete install directory with the -html option. Changed to find the embassy package name from the installed ACD file and then to find the installed HTML file. If EMBOSS has not been installed, will also search the original source files.
Modified NCBI/FASTA format to preserve the database name from the NCBI style ID. The database name is reported in one of the many and varied NCBI syntax variants, depending on whether there is a version or accession number, and whether there is an EMBOSS database name also involved (for example, an entry in a file indexed with dbxfasta or dbifasta)
Modified "pearson" sequence format to keep the FASTA file ID complete. For historical reasons GCG-style dbname:id syntax was still having the db part trimmed. This will still be trimmed from fasta or ncbi format.
The report for digest has Cterm and Nterm columns capitalized to match the rest of the report. Sequence ranges now give correct cterm and nterm results.
The list file Cut.index for codon usage tables was changed to remove old file names (commented out list at the end) and to remove underscores from the species names.
Programs water, needle, merger and prophet calculate an internal path size from the lengths of the input sequences. For sequences that are too long, a fatal error is produced. But if the sequences are extremely long, the test failed and the program gave a segmentation fault. This fix tests in a different way that will catch all cases. (added as a fix to 4.0.0)
The new MRS access method used a general search. This gave strange results when the ID or accession appeared in any other entry. It appears that MRS can search for id or accession only. This worked on the main MRS server at least. (added as a fix to 4.0.0)
New database access methods MRS and DBFETCH need to be explicitly turned on so that showdb can report them. (added as a fix to 4.0.0)
When deleting the last line of buffered input, failed to reset the pointer to the last buffered line. This only affected debug traces. Unfortunately, the ajFileBuffClear function does call the debug trace. In practice we have only seen this bug when processing sequence data in EMBL format from an MRS server. (added as a fix to 4.0.0)
Pattern and regular expression searches failed to correctly reverse a nucleotide sequence. The change is to use ajSeqReverseForce (always reverses the sequence provided) instead of ajSeqReverseDo (which only reverses if the reverse flag is set). (added as a fix to 4.0.0)
Reports in list format failed to write a usable USA for "asis" sequence input, and incorrectly reported reverse strand nucleotide features. (added as a fix to 4.0.0)
The lists files Matrices.nucleotide, Matrices.protein and Matrices.proteinstructure now have comment headers explaining their format. Fixed issues with nucleotide features in the reverse direction in reports. The start/end positions were stored the wrong way around and then reversed again when reported in one of the report formats. However, reporting as EMBL features showed the incorrect storage. ajFeatNewII now checks start/end and reverses the feature if start is greater than end. ajFeatNewIIRev sets the reverse strand and also checks that the start position is greater than (or equal to) the end position (added as a fix to 4.0.0)
To reduce the size of very large reports, for example when fuzznuc or fuzzpro run over very large databases, new qualifiers are added to report output. -rmaxseq gives the maximum hits for any one sequence, -maxall gives the total maximum number of hits. The report tail contains a record of the number of hits reported and found. The qualifiers are intended for web interfaces to control the maximum output they need to report. When the maximum hits figure is reached, ajReportWrite returns false so that programs can terminate at that point. (added as a fix to 4.0.0)
Reports now write a header and tail when closed, to make sure that all programs will write something to the report file. The default header contains the command line provenance, the tail contains the number of sequences and hits. (added as a fix to 4.0.0)
The format of the knowntypes.standard file in the emboss/acd directory has changed to list the knowntype first, then the datatype and finally the description. The file should be sorted by knowntype, and any description should not end in "file" so that file and directory prompts can be generated.
Standard prompts can be generated from the knowntype for files, directories and other data types. This can reduce the need for special information: attributes, but to help those who maintain parsers and wrappers we will try to keep an information string in the ACD file to match the prompt generated by EMBOSS. Acdvalid will report cases where the information string does not match the generated prompt. There may be a few cases where two inputs or outputs of the same knowntype are needed.
The output produced by -help provides more information about associated qualifiers than the HTML table view (from acdtable) which is included in the HTML documentation in the distribution. However, there is also a lot of extra information in the acdtable output on the default values and the allowed values for each qualifier. The -help output is now expanded to include all the information provided by the acdtable view. A benefit of this is that we can now remove the badly formatted acdtable from the text version of the documentation. This is used by tfm so the output of the tfm program will now be easier to read.
The default prompts for input and output files have been very simple for the first 10 years. EMBOSS now has a "known type" defined for all files in ACD. The known type is now included in the automatically generated prompt for input and output files. To help in this process, the known type should not have the word "file" at the end. This will be added automatically in the prompt.
Printing with conversion type %g could write extra zeros where the decimal point was stripped. In C, %g conversion removes trailing zeros and the decimal point if nothing remains after it. The AJAX print conversion functions added extra zeros at start of the output to extend the result up to the expected width.
Prophet modified to use an "align:" ACD definition rather than an "outfile:". A bug which was mixing up the name of the profile with the name of the sequence has been fixed.
Simple XML DOM added. This has no additional library dependencies. This is a preliminary step in producing (revisiting) XML graphics output etc.
EMBL/Genbank have agreed to add a new amino acid code 'O' for pyrrolysine. O has been added to EMBOSS checking for protein sequence data, and to the existing data files that contain 'U' (selenocysteine). IUPAC/IUBMB has accepted the use of O for protein sequences. This means that any alphabetic text is now a valid protein sequence. There are 20 naturally occurring amino acids, plus 'X' (unknown) 'B' and 'Z' ('D' or 'N' and 'E' or 'Q' for analysis of complete digests) 'J' ('I' or 'L' in mass spectrometry) plus 'U' (selenocysteine) and 'O' (pyrrolysine). There is a small complication - older versions of phylip sometimes use 'O' as a gap character. EMBOSS will still allow this in nucleotide sequences.
New sequence access method "mrs" uses CMBI's "Maarten's Retrieval System" http://mrs.cmbi.ru.nl/mrs/cgi-bin/mrs.cgi to query databases by ID or accession.
New sequence access method "dbfetch" uses the EBI's dbfetch REST services http://www.ebi.ac.uk/cgi-bin/dbfetch to query databases by ID or accession.
iep changed to allow users to specify number of modified (uncharged) lysines and intrachain disulphide bridges. This includes extensions to embIep functions to include the two new parameters. These updates were provided by Clemens Broger of F.Hofmann-La Roche Ltd.
Changes to splitter and union by Kim Rutherford (Artemis maintainer at the Sanger Institute) allow features to be preserve for nucleotide sequences. The default operation of both programs is unchanged.
Regular expression pattern lists are accepted by dreg and preg. The output reports include pattern names which default to regex1, regex2, and so on. The "regex" prefix can be set using the new associated qualifier -pname on the command line.
Prosite pattern lists are accepted by fuzznuc, fuzzpro and fuzztran. The output reports include pattern names which default to pattern1, pattern2, and so on. The "pattern" prefix can be set using the new associated qualifier -pname on the command line.
Regular expressions have the same syntax as the new pattern datatype - they can be in a file, with pattern names, and have a qualifier -pname to set the name for a pattern. Regular expressions also have a type defined in ACD which can be nucleotide (e.g. for dreg), protein (e.g. for preg) and string for general patterns. Function ajAcdGetRegexSingle will read a single regular expression. ajAcdGetRegex now reads a list of regular expressions.
New ACD pattern type reads a PROSITE style pattern, or @filename where filename contains patterns with names in FASTA format. Patterns in the file are concatenated if on multiple lines. The file may also contain mismatch=n after the ID to set the number of mismatches for a pattern. Patterns also have associated qualifiers -pmismatch and -pname for the pattern on the commandline or all patterns in the file.
Pattern processing is changed to use lists of patterns, as submitted by Henrikki Almusa of Medical in Helsinki. This is implemented as new ACD data type "pattern" which required some nucleus embPat functions and data types to be moved to AJAX ajPat so that they can be called from ajacd.c
"a2m" alignment format (which is just fasta) is now supported in ACD.
New EMBASSY MEME package containing "wrapper" applications providing an EMBOSS-style interface to the applications in the original MEME package version 3.0.14 developed by Timothy L. Bailey. The package is fully documented.
New EMBASSY HMMER package contains "wrapper" applications providing an EMBOSS-style interface to the applications in the original HMMER package version 2.3.2 developed by Sean Eddy. The package is fully documented.
ACD dirlist: order of list of files is now system-independent.
fuzztran: now always generates an output file, even if there is no data.
coderet: now writes any permutation of cds, mrna and protein sequence output to separate files. Output file formats may be set independently and have the default file extensions of "cds", "mrna" and "prot".
oddcomp: New ACD option to set the window size equal to length of the current protein. Code cleaned up.
Restrict: alphabetic sorting fixed in the case where -limit is specified
Digest changed to add ragging option. Original code was contributed by Gregoire R Thomas.
infoseq: code largely rewritten. Two new advanced ACD options to specify output using a user-defined delimiter or in columns. Output much cleaner, e.g. columns are aligned.
Digest changed to read a sequence stream (earlier versions read only one sequence). Code for this was contributed by Henrikki Almusa of Medicel in Finland.
Two new programs makenucseq and makeprotseq have been submitted by Henrikki Almusa of Medicel in Finland. They create sets of random sequences, Sequence composition can be specified by a codon usage file or by pepstats output.
New format "swissnew", with aliases "swnew" and "swissprotnew", added. UniProt has announced future changes to the UniProt entry format, which is still called "swiss" in EMBOSS. The ID line had "Reviewed" and "Unreviewed" in place of "STANDARD" and "PRELIMINARY", and no longer has the "PRT;" placeholder for the EMBL format "division" - now obsolete as EMBL has changed this part of their ID line in the latest release. In EMBOSS 4.0.0 we replace "STANDARD" with "Unreviewed" as more appropriate to entries that come from FASTA files and other sources.
Programs which analyze nucleotide features now call ajFeatGet functions in most places. In previous releases, some of these programs used the internal feature data structures directly.
GFF format feature files are designed for nucleotide sequences. EMBOSS supports the use of GFF for protein sequence.
Feature keys (to use the EMBL/Genbank feature table term) are now defined with external names for each format and a list of internal names to be used by EMBOSS. This greatly simplified the conversion of SwissProt and PIR feature tables. The internal table also has a list of aliases. The internal aliases for nucleotide features are as far as possible identifiers from the Sequence Ontology SOFA (feature annotation) subset. In a few cases, where multiple EMBL/Genbank terms map to a single SOFA term, new terms have been added to extend the SOFA name uniquely (we simply append the EMBL/Genbank feature key).
MSF format files with more than 5000 sequences were truncated on input - only the first 5000 names were being read. This limit has been removed. As "emma" uses MSF format for the clustalw run it launches, this problem limited emma to 5000 output sequences in previous releases.
The EMBL database has changed its ID line. The new line has semicolons after each token, the primary accession instead of the ID (there is no ID in the new EMBL format), and the sequence version as a number. Internally in EMBOSS we continue to build the accnum.n style sequence version. We expect most other packages will take some time to change EMBL formats, so for output this is called "emblnew" format. As input, "embl" format will accept both the old and new style entries. For database indexing, dbiflat and dbxflat will read old and new formats as "embl" by looking for SV on the ID line. EMBL and EMBLNEW format output is also improved by wrapping long DE lines.
Wossname will now search for each word in a phrase used as the search text. By default, all words must match. A new qualifier -noallmatch tells wossname to match any word in the search. Partial word matches are accepted so "restrict" will match "restriction". The search term is also compared to the groups and keywords attributes in the ACD file. A new qualifier -showkey will report the keywords to help explain why applications were matched.
All ACD files have a new application attribute keywords: which provides keywords to search for in addition to the groups. This is intended for keywords which are hard to include correctly in the short description. A file keywords.standard is provided with a list of all keywords. this is for use by utilities searching programs by keyword, which will be expected to check the groups and keywords attributes in a single query.
Reading a sequence of type "any" sets the sequence type to nucleotide by default. Any x or X ambiguity codes will be converted to 'n' or 'N' to avoid confusion in programs that will convert a second nucleotide sequence (alignment programs, for example). X is allowed as an unknown character in nucleotide sequences (and N is also allowed as 'any base').
Stockholm and Selex sequence formats, used mainly by the HMMER and HMMERNEW embassy packages, have been corrected for a few cases where automatic format detection generated errors.
Function names in ajseq.c have been standardized. Old names are still accepted but are marked as "deprecated" and will generate warnings with the gcc compiler (see ajstr below). Other compilers will see no difference.
Further correction to reversed sequence numbering for local alignments from water and supermatcher. For these local alignments all reversed alignments were ending at "1" because the end offset was not calculated correctly. Matcher called a different function to set sequence positions and reported correct positions.
For alignments with a line of gaps, adjusted the numbering to report the last sequence position instead of the next at the start of the line.
Program einverted output is changed to include the sequence ID and the program input is changed to process more than one sequence as input. The change to the output format was needed to indicate which sequence is reported. The program is also speeded up by not dynamically resizing the internal arrays used to hold sequence positions.
Added additional information to "entrails" output (entrails is built by "make check" and displays internal data to assist developers of wrappers and interfaces). The output now includes application attributes and reports definitions which are aliases (with -full on the commandline).
Added -mincount option to wordcount to report only words occurring a given number of times. The default of 1 does not change the previous results.
Oddcomp had a number of bugs. A window size equal to the sequence length resulted in no hits. The word size was used before reading the input file. A match in the last possible window was missed.
Biosed modified to specify a position so it can be used to edit A to L in position 2 (for example) in a single sequence or throughout an alignment. Normal use is unchanged. If there is demand, the target could be changed from a string to a pattern.
Clustal sequence format output is now version 1.83 with 60 bases/residues per line. Previous EMBOSS releases reported it as 1.4 and printed 50 bases/residues per line.
The tmap program had an upper limit of 6000 residues and 300 sequences. All fixed size arrays were made dynamic. The length limit was exceeded by one of our users.
GCG formatted databases were found to have split entries into more than 1000 chunks - for example human chromosome 7 in a TPA (third party annotation) entry in EMBL. A regular expression is now used to check for any number of subsequences in GCG data.
ajSysStrTok and ajSysStrTokR changed to match the behaviour of the C run time library function strtok. Both now keep their internal pointer at the first delimiter after the matched token. This only changes the result if the delimiter set is changed on the next call.
Another code cleanup is the addition of Exit functions to all AJAX and NUCLEUS source files that could still have static memory allocated when a program ends. We aim to clean up memory for all the standard memory tests in test/memtest.dat. This includes creating a new function acdReset which resets the stats of ACD processing so that a new ACD file could, in theory, be read once a program has completed. All programs need to call the embExit function at the end to call the NUCLEUS and AJAX cleanup functions. Some of these functions will also log memory usage statistics if debugging is turned on (-debug on the command line).
We are working through all the library code making standard function names. Old function names will be retained at least until release 4.0.0. They are marked with the __deprecated flag, which causes the gcc compiler to report all uses of the old name. Other compilers are not affected. The first set to be processed is in ajstr.c (string and character functions).
Sequence reading from website URLs now defaults to HTTP 1.1, with chunked blocks of data. A bug in processing small (single line) chunks was fixed.
Report and alignment output now includes the full commandline used to run the program, with any replies to prompts included.
Excel report format includes a column for Strand to indicate sequences on the reverse strand. The strand column is + for a forward feature (all protein features are forward) or - for a reverse direction feature.
New sequence type gapstopprotein for proteins with gaps and internal stops.
Translation functions in ajax/ajtranslate.c have been cleaned up.
New program backtranambig to backtranslate as most ambiguous codons.
Phylip sequence format can now read sets of alignments with blank lines in between. Such formats were produced by the new fseqboot program and used by the new phylip programs and seqsetall in ACD.
The list of graph devices produced when an invalid device (or '?') is given now lists only the unique devices (those defined differently in the plplot library code) with alternative names (xwindows for x11, for example) added in brackets. Specifying an ambiguous device used to accept the first match found, now an error message is given.
Prettyplot and cons were producing different consensus sequences. Comparison of the results showed two problems. Cons was missing consensus characters because of an error in calculating the plurality (since fixed in prettyplot, but the library function used by cons had not been corrected). Prettyplot was missing consensus characters for a different reason - prettyplot has a "collision detection" feature to skip consensus characters for positions where more than one amino acid or base is valid as a consensus character. This was turned on by default, when the ACD file clearly states it should be turned off. In fixing both bugs the two programs will give the same consensus, except for cases where collisions occur - in these cases prettyplot may not select the same character as cons, where both are equally valid.
Programs that write sequences need to call ajSeqWriteClose before they exit. This forces output from sequence formats that save up sequences in memory and write at the end. An example is MSF, which has to wait for all sequences in order to calculate the file checksum.
Functions that process directories now skip the '.' and '..' directories so that '*' wildcards will work correctly.
Prettyplot has been revised. A debugging commandline option has been removed. String commandline options have been changes to array and select types for better validation with the same user responses. Colours are now corrected for proteins - in version 3.0.0 and earlier the colours depended on the column order in the matrix. Nucleotide colours follow the ABI base colours used in abiview. The examples in the documentation showed no boxes because of low sequence weights in the MSF format input data. The weights have been updated to give the 'expected' results.
All programs now store the command line needed to recreate the run. The result is logged by the database indexing programs, and will be added to other program outputs in a future release. The command line includes all non-default responses to prompts by the user.
dbiflat, dbifasta, dbigcg and dbiblast set the system sort to use normal "C" sort order. On systems where the locale is set to a language other than English, sort can have strange behaviour. In particular, the underscore character fails to sort in the correct place so that indexing SwissProt/UniProt or RefSeq entries fails to put certain entries in the correct sort position for retrieval. There is now no need to set LC_ALL=C locally, although this is good practice whenever sort is used.
Gap penalty qualifiers were standardized for all programs.
water, needle and other alignment programs occasionally could report suboptimal alignments (off by the gap extension penalty score). The reported alignments were correct, but rearranging the gaps could give a slightly higher score. Matcher and stretcher use different alignment functions and were unaffected.
Cpgplot no longer has a -shift option to speed processing on long sequences. The output was broken. We will restore it if there is demand.
Two new variables added for developers using the MYEMBOSS package to write their own EMBOSS programs. EMBOSS_MYEMBOSSROOT (the same will work for other EMBASSY packages) points to the location of the ACD files for an EMBASSY package which is not installed - as would be the case for an ordinary user developing and maintaining their own code using MYEMBOSS. This requires the use of embInitP rather than embInit to pass the package name - something all EMBASSY programs should (and will do). The second variable is EMBOSS_ACDUTILROOT and is required so that utilities such as acdvalid can also find the ACD files. Utilities acdvalid, acdc, acdhelp, acdtable and acdpretty use embInit as they no nothing about any package name.
Sequence sets (seqset and seqsetall) have a new ACD attribute "aligned" which is true or false. If true, the sequences will be extended with gaps and passed to the application as a full alignment. It is assumed that they are already aligned. If false, the application needs all sequences in memory but has no need for aligned input. The aligned attribute is required (to help ACD parsers) so acdvalid will object if it is not found.
embossdata now requires a filename, or an empty string to search for all files. If no filename is given, it will prompt for one with a default of an empty string.
acdvalid now tests the order in which sections appear in the ACD file. The order must be: input, required, additional, advanced, output. There are already constraints on which ACD data types can appear in each section. All existing ACD files passed this test. If any external ACD files have a problem the acdvalid tests can be revised.
Sequence format "experiment" is now correctly the Staden package experiment file format. The description is taken from the "EX" experiment description line. EMBL line types (including features) are allowed in this format and are supported if used before the sequence. The accuracy values are read and stored (one per base, using the highest base value if all 4 bases have individual numbers) and written. These values could possibly be passed to primer3, for example.
Staden and GCG input formats can now parse out comments from anywhere in the sequence records.
Nexus and nexusnon output formats now correctly report the datatype for protein alignments.
Documentation of the @data datatype header tags updated on the developers webpages.
Coderet reports the number of CDS, mRNA and translation sequences to an output file. Requested for easier tracing of inputs that gave no sequences.
Nbrf (pir) input can now read from an SRSWWW server. The problem was that SRS reports an extra ">P1;seqid" header before the sequence. Now if there is no sequence, a duplicate header (one with the same ID) can be skipped.
Clustal output format no longer writes in blocks of 10.
Clustal and other multiple sequence formats were unable to return single named sequences. Fixed for all such formats.
Phylip3 output renamed phylipnon for compatibility with other formats. The phylip3 name is retained for back compatibility. The header for phylip non-interleaved format is corrected to that accepted by phylip 3.6 (no need for YF on the header line, and correct number of sequences). Documentation of these formats (for seqret and general format documentation) has been updated.
Programs chips, cusp, prettyseq and showtran used a codon usage table as input only to define the genetic code (amino acids for each codon) for the table they produce. This is no longer needed as a new AjPCod constructor ajCodNewCode can be given a genetic code (default 0 to use the standard code) and will set the amino acid data.
The ajCodClear function now clears all data, including the amino acid assignments, for use in reading multiple codon usage formats. A new function ajCodClearData clears only the data and other values, and leaves the amino acid assignments in case other applications may make the same assumptions.
Codon usage input filenames can now be used to set the output filename. The codcmp program for example will no longer default to "outfile.codcmp" for output. However, this can cause unexpected results when a codon usage table and a sequence are read in, so codon usage filenames are only used if no other input file (or sequence, or feature table, or other input type) has been read. This is done by passing a "reset" boolean when setting the saved first input file name so that other inputs can overwrite a name defined by a codon usage input. A remaining side effect is that if the first input is stdin (for example with -filter on) then a second input file can now set the default for output. The recommendation for anyone developing wrappers is to always explicitly set the output filenames if there is a need to know the name for a specific output.
Codon usage tables support multiple formats. All can be read automatically. EMBOSS will now, for example, accept native GCG codon usage tables including those used by the codonusage and transterm databases. The format can be specified for "codon" input by a -format qualifier. Outcodon is now used as an ACD datatype for writing codon usage tables, and has a -oformat qualifier. A new application codcopy can inter-convert the codon usage table formats. The default codon usage table format is called "emboss" and includes structured comments to identify the species, database release, database division, number of CDSs and codons, and GC content. These values are calculated of searched for in the text within a file for other formats.
In the emboss.default and .embossrc files the same name can be used for variables, databases, and resources. In previous versions a single table was used and name clashes could occur. This becomes an issue with the increasing use of resource definitions.
Colours for abiview set to the ABI standard colours.
Sequence types explicitly set in source code for cons, sixpack and backtranseq. GCG format output was showing nucleotide instead of protein sequence type.
Correction to reversed sequence numbering for local alignments from water.
Profile analysis with gprof indicates that the regular expressions (and the PCRE library) are very inefficient. Wildcards in regular expressions lead to millions of recursive calls to the match function. Although they are very readable for code maintenance, replaced them for EMBL sequence and feature reading to get about a 4-fold speedup. Profile analysis will continue up to version 3.0.0
Feature table updated for nucleotide sequences to EMBL/GenBank/DDBJ version 6.2. A few obsoleted qualifiers.
tranalign now allows for the proteins to have Methionine residues at the start which now match a START codon in the corresponding nucleic acid sequence.
diffseq has a new option '-global' which makes it treat the whole of the sequences as regions to be aligned, rather than the default which looks for the longest region of overlap and only reports differences within that overlapping region. This new option is useful when looking at protein and mRNA sequences which are expected to align over their whole length.
Alignment output issues resolved. Specifying begin and end of input sequences now works for all alignment formats. Markx formats have been rewritten as the original code we used has nasty dependencies on global variables which we struggled to reproduce for all cases. The rewritten code is much simpler. Note that the gap penalty reported by markx10 format is the EMBOSS penalty. Markx10 as used in the FASTA package subtracts the gap extension penalty from the gap penalty ... and adds it back when calculating.
transeq failed to check sequence ranges in list files correctly. It was only using the range from the first sequence if the USA included a start and end. The range is now reset for each sequence.
remap (and other programs that display translations) had problems with masking ORFs (using strange characters instead of '0'), caused by bad calls to an AJAX function.
Entrez added as an access method. Sequence format must be genbank. Server URL is hard-coded at NCBI (for now). Works by finding GIs GenInfo Identifiers) that match the query, and then retrieving them one at a time. This is still a prototype - more work is needed. Note that apparently Entrez cannot retrieve by LOCUS (id).
Seqhound added as an access method. Sequence format must be genbank. Needs a URL to find the server. Works by finding GIs (GenInfo Identifiers) that match the query, and then retrieving them one at a time. This is still a prototype - more work is needed. Some Entrez error conditions are less graceful in SeqHound. Des and Key searches are turned off until SeqHound adds indexing for these. Org searches work, but require the numeric taxon ID. This is not friendly, so we are looking for a way to get the taxid from the species or genus.
Direct access databases now support exclude wildcards. The syntax is as for emblcd indexing, but only files listed in filename are included.
Database names must be letters, numbers and underscores only. Reading emboss.default and .embossrc now generates a warning message for any bad database name. Bad names were ignored by USA processing, leading to confusing results.
seqretsplit has a new -feature option (as for seqret)
noreturn can write files for PC or Mac file systems using a new -system qualifier.
FASTA format sequence files with a sequence ID starting P1; were assumed to be PIR format. These can now be read as FASTA, assuming that PIR format has already been tested for.
Sequences with zero length were accepted. Sequences must now have a length of at least 1. Some user scripts could create FASTA format files with no sequence, or with the sequence on the ID line. These can crash many programs, including a core dump from clustalw (through emma).
Added a calculated attribute "haslengths" to (phylogenetic) tree input in ACD for use in phylipnew interfaces
Wossname and seealso have a new commandline option -showembassy which defines one embassy package to be shown. The main use is in finding applications when automatically building the documentation, but end users and interface builders may find some uses for this option too.
Added an "embassy" string attribute to the application in ACD so that wossname can find whether an application is in EMBASSY or not. Wossname was depending on the source directory, but could not distinguish between EMBOSS and EMBASSY ACD files once they were installed.
The EFUNC and EDATA databases have been enhanced to provide better views and links within SRS. The new versions are available at both HGMP and EBI. In future, EBI will probably become the sole site (as HGMP/RFCGR is closing in 2005).
The official EMBOSS website has moved to emboss.sourceforge.net which includes redefining links in applications and major modifications to the scripts which maintain the application web pages. The sourceforge web pages are now committed to CVS under doc/sourceforge. The pages on sourceforge itself can only be modified by registering at sourceforge and joining the emboss project.
ajListMapRead and ajListstrMapRead functions for read-only lists. As an added check, the functions these call for each element have a different prototype.
ajStrStr function now returns const, as do various 'Get' functions. The few cases where a true char* is needed must now call ajStrStrMod with the AjPStr passed by reference so that we can check it is being modified. All calls to ajStrStr in EMBOSS and most EMBASSY packages have been resolved to compiler remove warning messages. ajStrFix also needs the AjPStr passed by reference.
tfm -html now gives full path to image files.
Remove need for the definition of PLPLOT_LIB.
Add configuration for cygwin dlls.
Allow filenames of the form drive:/filename for cygwin.
Fixes for list files with sequence ranges in the USAs. The sequence input object is now reset during list processing.
Sequence sets with begin and end positions are now automatically trimmed on input. This applies for example to list input with ranges in the USAs for programs such as polydot which were previously reporting the entire sequence.
graph output now has the default title including the date in dd-mmm-yy format instead of the unreadable dd/mm/yy format.
Align output for seqmatchall (like wordmatch). The algorithm is not maintaining the sequence accession and description information. They may be restored in a future update.
infoalign now also displays the weight of the sequences in the alignment. This can be turned off using '-noweight'.
New output types in ACD for all input data types, including those for phylogenetics and protein structure data. Initially these are a new AjPOutfile type with a defined format (fixed until any of them has a choice).
Programs that produce graphics or text (outfile) output now by default will not create the outfile if there is a graph (done by setting the nullok attribute of the outfile).
Acdvalid now checks for incomplete ACD types and attributes.
trimest now has the option '-toplower' which changes the poly-A tail to lower-case instead of cutting it off.
new ACD attribute 'relation' added to all ACD types. This will hold some information about how output data types relate to inputs and parameters. The syntax of the string is not yet clear. Running of EMBOSS programs will not be affected - the relation string is defined for web services and related wrappers to maintain provenance better.
New ACD function oneof added, syntax is @($(var)=={a,b,c}) to test for a choice of menu options. Intended to clean up some ACD files - but they are already clean so it may not be useful. At some stage the unused ACD functions should be declared obsolete for simplicity (and efficiency). We will leave the code in place, but remove them from the list of functions tested.
acdvalid now tests the knowntype attribute for strings. ACD files have been cleanup up to give knowntypes for all strings (defined in knowntypes.standard) or to convert strings to datafile or other ACD types as appropriate.
showfeat now has the qualifier '-annotation'. This allows you to add your own brief annotations of regions on the displayed figure.
remap now has has the option '-frame' which allows you to specify a list of the frames to be translated and displayed.
Major cleanup of @data documentation. Added @datatype for typedef data types (e.g. AjBool). Checking all have attributes, and all attribute names and types match. Comments in the code are moved to the @attr documentation. Added an @cc documentation line for comments.
Eprimer3 has been changed so that it runs a separate child process of primer3_core for every sequence. This is to cure a problem seen when more than about 23 sequences were input, in which there was some blocking contention between the input and output streams.
Major cleanup of ACD files to match acdvalid standards. Featout qualifiers are now -outfeat, which means all output start with -out but it does clash with -outfile so -outf is not always usable as an abbreviation.
Options for emma have been cleaned up. -insist is no longer used (use -sprotein instead) and -slowfast is now a simple boolean -slow. Both changed lead to a much cleaner ACD file.
Options for eprimer3 have been cleaned up. New options -primer (true) and -hybridprobe (false) make the dependencies far simpler. The default task is now 1 (same as the old zero) and the -hybridprobe option is needed to calculate the hybridization probes. This removes a lot of dependencies on tasks 1 and 4 (hybridprobe) and not-task-4 (primer)
New AjPDir to hold directory path and default extension. Intended for domainatrix applications. This requires changing ajAcdGetDirectory to return an AjPDir and providing ajAcdGetDirectoryName to return the path as a string. Several programs were changed to reflect this changed call.
New ACD type outdirectory for a directory to which files will be written. Must have a knowntype describing the files that will appear there. Expected qualifier name is -outdir.
compseq now has the option '-calcfreq'. This makes it calculate the expected frequencies of the words in the sequences from the observed frequencies of the single bases or residues in those frequencies.
HTML data from remote sites is becoming more complex. EMBOSS now makes a first pass to look for a single preformatted block and accepts this as the data (thus avoiding horrors such as the Entrez headers and javascript which NCBI's search service includes). At the same time, an old fix to patch SRS 6.1.0 output has been removed as this clashed with the new code.
Optional outputs have a new behaviour. With nulldefault defined, an output is, by default, turned off and will return a NULL value to the calling program if nullok is set. Setting the value to "" on the command line will now ask for the standard filename to be generated. The "missing" attribute, if defined, allows simply -qualname on the commandline to request the default filename, although care must be taken to avoid anything following the qualifier appearing to be a filename. This means the qualifier must be last on the commandline, or must be followed by another qualifier.
Indexing programs dbifasta and dbiflat no longer store the source directory in the division.lkp file - directory is specified in the database definition. This was only done originally to share index files with "efetch" at the Sanger Centre. With index files and data files in the same directory (as for efetch) it is not needed.
All ACD files revised for new acdvalid checks.
New ACD section "additional" added for qualifiers with additional:"Y" defined. These have been put in the "advanced" section until now. Acdvalid checks that these qualifiers are in the appropriate section.
Acdvalid now checks that qualifiers are in the expected section. All input qualifiers (including cfile and datafile) are now in the input section, all output qualifiers are in the output section. All (remaining) standard, additional and advanced qualifiers are in the "required" "additional" and "advanced" sections.
New ACD type "toggle" added. This is the same as "boolean" but is allowed in any section by "acdvalid" checks. Toggle is to be used for ACD qualifiers that "toggle" (turn on or off) other qualifiers. An example in many ACD files would be "-plot".
Cirdna and lindna now dynamically allocate memory. For simplicity they do still have an upper limit for the number of groups and labels per group, but no longer have static arrays.
tfm accepts the PAGER environment variable. It can be overridden by EMBOSS_PAGER.
Fix for HTTP 1.1 lines for MacOSX added (Cedric Rossi).
The home directory ~/.embossrc file can be turned off with "setenv EMBOSS_RCHOME N" This was added for cleaner QA tests but may have other uses.
Report format output added (by Henrikki Almusa) for dreg, preg, recoder and silent.
pestfind renamed to epestfind and handling of terminal water residue adjusted.
Align formats: Added "tcoffee" as a valid -aformat which writes a T-Coffee library file suitable for input as -in=Lfilename to T-Coffee.
Pepstats: added molar extinction coefficient and extinction coefficient at 1mg/ml for A280.
Nexus format sequence input added, with new functions to parse all standard nexus files. Later releases will accept nexus format for other input data.
Jackknifer, Mega, Treecon Mase and Fitch formats parsed, at least in their EMBOSS output forms.
Underscores are allowed in accession numbers and sequence versions to handle REFSEQ fasta format entries.
New function ajRegPre returns the original string before the regular expression match.
New function ajStrArrayDel deletes a string array.
New functions ajListstrToArrayApp appends strings in a list to the end of a string array.
Sequence input changes: Allow '?' as a valid character (it has been seen in phylip sequences) for 'unknown' and convert to X for protein (or any) and 'N' for nucleotide. Note that this can give an X or N depending on whether the program accepts nucleotide only or any sequence. We may find a cleaner fix, but it would depend on knowing the sequence type.
Added binding factor output to tfscan plus option to specify a custom data file
Removed the Henry Spencer regular expression libraries. There were a few calls to the ajPosReg functions, but only to test it worked the same way as ajReg. Added a case-insensitive ajRegComp and ajRegCompC (which the ajPosReg functions had) using PCRE. Farewell, Henry. You were a great servant to EMBOSS.
Water S-W alignment program no longer truncates some matches
Vector arithmetic added to ajax library.
Compilation now uses large file handling by default. To disable use --disable-large when configuring. An effect is to make the default size of ajlongs 64 bits.
Pepstats modified to allow multiple sequences
Major (well, obvious impact on ACD authors) ACD change - the "required" attribute is renamed "standard" and the "optional" attribute is renamed "additional". They have exactly the same functions as before. The change is to (hopefully) make their meaning more obvious to those developing ACD parsers and wrappers for EMBOSS. ACD attribute "standardtype" clashed with "standard" and is renamed "knowntype".
ACD attributes have been added for applications and for all ACD types to make wrappers easier to control. These new attributes are specifically for SoapLab from EBI, and need not have any impact on other wrappers (SoapLab uses ACD to define non-EMBOSS applications and needs extra attributes to define some additional properties).
pepinfo now writes to a file with a standard output filename of (sequenceid).pepinfio instead of pepinfo.out
Completed the standardization of ACD definitions, using "acdvalid" to remove all errors and allowing only selected and hard to avoid warnings to remain. The warnings are for calculated "required" or "optional" definitions (simple true/false relations to another boolean are accepted). In particular: all essential inputs and outputs are parameters, with standardtype defined. Non-essential inputs and outputs have the nullok attribute set. Information strings are defined only where there is no standard prompt.
The definition of AjPStr and other "pointers to structs" is causing strange problems in specifying "const" for structs that are unchanged by function calls. In summary, it appears (for all compilers we tried) that "const" only knows it is for a pointer if it can see the "*" in the type. This means, for example, that "const AjPStr" failed but "const AjOStr*" worked. With "const" if it knows it is a pointer, it makes the data structure constant. Otherwise it makes the pointer itself constant, the equivalent of "AjOStr* const". We fixed this by changing AjPStr to be a #define of AjOStr*. This has the advantages that most code is unaffected and that const now works as expected. The only code changes we needed are lines with multiple AjPStr definitions (which is anyway deprecated), for example "AjPStr astr, bstr" which clearly fail when you think about the #define (astr is an AjPStr, but bstr is now an AjOStr and will give strange compiler errors). We may change this again to define a separate const data type for each struct, but probably the #define is a good solution and we expect to stay with it.
PCRE is now the library of choice for regular expressions. This allows the full Perl regular expression syntax, and was very easy to integrate. Regular expressions are used internally for parsing and for manipulating strings such as file and directory names, and also for matching by programs such as dreg and preg.
The previous Henry Spencer library functions are renamed from ajReg to ajHsReg. The Posix version of the Henry Spencer library remains available as ajPosReg but may be removed as it was not used by the EMBOSS distribution, and PCRE can provide the same or higher functionality.
acdpretty now writes the name of the output file to standard output. For example "Created seqret.acdpretty".
The ACD qualifiers -acdpretty -acdtable and -acdlog are removed. Programs acdpretty and acdtable do the first two tasks (in the same way as before). To turn on the acdlog file, use environment variable EMBOSS_ACDLOG.
Graphs can now use "-graph data" to produce files compatible with the Staden package's spin2 and spin GUIs. This makes some ACD options obsolete, especially the various -data and -outfile combinations. Banana already wrote an output file which caused some confusion in these options. The outfile and the graph are both produced by default, but have the nullok attribute and can be turned off with -nooutfile or -nograph on the command line.
graph and xygraph output can now be optional - the ACD files can have a nullok: "Y" attribute which allows -nograph on the command line.
In ACD files alternatives for protein and nucleotide input are common. Added an automatic variable $(acdprotein) which is defined as the calculated ".protein" attribute of the first input sequence(s). The value will be "Y" or "N". Acdvalid will check that this is how proteins are tested, so the original "$(asequence.protein)" syntax will become obsolete. The intention is that any wrappers can use this to make protein and nucleotide versions of the ACD file, and in general to use only simple boolean tests in calculated ACD values.
Added wait call to wait for a piped command to complete before reading data (needed for listfile input with many piped reads, for example getz calls from SRS databases.
Corrected Jemboss for displaying emma & prettyplot forms
Corrected display of recognition sequence for restrict -solofragment
Standardtype attribute added for filelist in ACD
Datafile for mwfilter changed from string to datafile ACD type.
A new test application acdvalid will check for deprecated ACD syntax and report errors for something that should be fixed, or warnings for something still to be clearly defined. None of these "errors" will stop an ACD file from working correctly, but they do cause confusion to the authors and maintainers of wrappers, GUIs, and so on.
Sequence types are extended to include new types for programs that can handle selenocysteine.
Sequence types are simplified so that input can be converted to the specified type. Gaps can be removed, and unsupported characters can be converted to X for protein or N for nucleotide. A few applications may be unable to handle any ambiguity (pureprotein, puredna, etc.) and will require correct input. To make it safe to run a program over (for example) swissprot or embl, such programs should read single sequences only, or be converted to support ambiguity codes. This may take a little time. banana, octanol and pepwindow already read single sequences. In need of attention are hmoment and iep.
In ACD files a new application attribute "external" is added where a third-party tool is needed. examples include clustalw (emma) and primer3_core (eprimer3 and primers).
ACD definitions for feature and featout now have a "type" attribute. The feature output type defaults to the sequence type, as for sequence output. Feature types are "protein" or "nucleotide" or "any".
ACD sections now have "information" instead of merely "info" for consistency.
Boundary fix for ajStrMask
Tightened up on reporting of isoschizomer groups in 'showseq -limit' and 'remap -limit'.
Added embPatRestrictPreferred.
Added -individual option to RESTRICT. This gives the fragment lengths produced by restriction assuming only each named RE of the set that can cut the sequence is used. Results are added to the tail section of the report.
Added a -equivalences option (on by default) to rebaseextract. This option calculates an embossre.equ file using RE prototypes in the withrefm file.
A guide to the EMBASSY package domainatrix (domainatrix.doc) has been added to /emboss/emboss/doc/manuals
Extractfeat now has the -describe qualifier to allow it to add the value of selected tags to the Description line of the output sequence.
Revseq can now read in gapped nucleic acid sequences.
Removed old corba code in preparation for adding corba server as an embassy package.
Simplified error messages for sequence reading, and corrected handling of a bad USA as the first in a list file.
Padded temporary filename for emma to avoid clustalw bug with short input filename (this will not work in all cases and a corrected clustalw should be used nevertheless).
-help output modified to align all the qualifiers
acdpretty output revised to resolve to full names
Complete overhaul of all ACD error conditions. Parsing and command line validation messages are now all used, and all tested in the qatest suite. These tests used bad ACD files in the test/acd directory.
whichdb failed to report error messages. They are now turned on - and most of the common errors are reported with less verbosity.
TCODE application added. Calculates the TESTCODE statistic.
Eprimer3 now reports the primer positions using the coordinates of the original sequence when -sbegin and -send are used to specify a sub-sequence to consider. The input ranges, such as the -exclude and -target ranges are always given using the positions from the original sequence.
tfm looks for documentation in EMBOSS_DOCROOT (an environment variable, or defined in emboss.default), then in the install directory, and finally the original build directory.
In some cases, EMBOSS programs could terminate with an exit status of 255 (-1). Terminating with "Die:" message exists with status 1. All exit calls now use either 0 (success) or the standard library EXIT_FAILURE value (usually 1).
All report output fields have a new attribute (and qualifier) rscoreshow which defaults to "Y". Setting rscoreshow: "N" will remove the score from the output, except for GFF where it is required, and SRS format where it can be kept for use in standard parsers. The aim is to exclude the score value from applications that have no scoring method (restrict for example). For these, putting -rscore on the command line will override the ACD file and display the score.
Showseq and showfeat both now have the qualifier '-stricttags'. By default if any tag/value pair in a feature matches the specified tag and value, then all the tags/value pairs of that feature will be displayed. If '-stricttags' is set to be true, then only those tag/value pairs in a feature that match the specified tag and value will be displayed.
Megamerger now has the qualifier '-prefer' which makes it use the first sequence to create the merged sequence whenever there is a mismatch between the two sequences.
Sirna now has the qualifier '-context' which writes the first two bases (in brackets) of the 23 base target region.
Maskseq and maskfeat now both have the qualifier '-tolower' which will change the masked regions to lower-case characters instead of replacing them with a mask character.
ACD parsing internals are rewritten to find and report errors more cleanly and to make the syntax stricter for other ACD parsers used by (for example) GUI developers.
Sequence output types now have a 'type:' attribute which defaults to the type of the first input sequence. For most applications this is good enough as a default. For those which add gaps or translate DNA to protein (or vice versa) a 'type:' attribute will be needed. This is to improve support for automated workflow building by more strongly typing input and output data.
acdpretty now wraps long lines of ACD definitions, splitting at any lone backslash (which defines a newline for -help output) or at whitespace. Attributes and sections are indented by 2 spaces.
Until now, the ACD file syntax has allowed name=value syntax and the use of {} () and even <> for quoted strings just in case they needed both ' and " characters. These are now removed. We believe no ACD files were using this syntax.
valgrind.pl is a new addition to the script directory that runs valgrind memory leak tests under linux. the tests are a copy of those in purify.pl - they may one day move to a separate file.
EMBOSS feature output now copies (where available) the name of the input sequence as the filename, so filenames match more closely to the sequence output. For example, "seqret -feat tembl:paamir" will now create 2 files called paamir.fasta and paamir.gff where the feature file previously was called 'unknown.gff'
EMBOSS feature output defaults (as before) to GFF format, but the default format can now be set by variable EMBOSS_OUTFEATFORMAT
All EMBOSS output files now have a default output directory (required by some webservices implementations that run in the 'wrong' default directory). Variable EMBOSS_OUTDIRECTORY if set becomes the default output directory for outfile, align, report, graph, sequence and feature output.
The output directory can also be set from the command line (or as an ACD attribute) using the associated qualifier -odirectory (outfile), -rdirectory (report) -adirectory (align) -gdirectory (Graph and graphxy) -osdirectory (sequence) or -ofdirectory (featout).
The "g*"" attributes for graph and graphxy in ACD have been deleted as they have the same name (and function) as existing associated qualifiers - and can still be used with these names in ACD files. Duplicate ACD attribute and associated qualifier functions exist in many ACD types, but usually have different names and so are left for compatibility purposes.
emboss.default and ~/.embossrc configuration files now have extensive error messages reporting filename and line number. showdb has additional validation for all database definitions. Environment variable EMBOSS_NAMVALID (boolean) turns this on for all programs.
ajnam.c has debugging turned on by environment variable EMBOSS_NAMDEBUG (boolean). This processing (of emboss.default and ~/.embossrc) happens before command line option -debug has taken effect. The output goes to standard error.
Function ajFmtVPrintS is a previously missing complement to ajFmtPrintS
EMBL/Genbank feature tables updated to FTv5.0
SwissProt feature table '<' '>' and '?' location modifiers are now handled correctly.
Added new applications acdlog, acdpretty and acdtable. Run like acdc they provide the same functions as the command line options -acdlog -acdpretty and "-acdtable -help" These -acd options are now obsolete and will be removed in a future release to clean up the ACD interface.
Transeq now has the option '-clean' that converts all '*' characters to 'X's. This may be useful because not all programs accept protein sequences containing '*' characters.
Showdb now can display the presence of any of the extra sv, des, org, and key search fields that can be used to index and search in databases.
Added twofeat - Finds neighbouring pairs of features in sequences.
Extractfeat - added option (-featinname) to include the name of the feature as part of the ID name of the sequence that is written out.
Added sirna - designs siRNA probes in mRNA.
Sigcleave sorts results highest score first.
Helixturnhelix sorts results highest score first and reports the score position as an integer.
Added pestfind.
Moved the following programs into the "domainatrix" embassy package:
contacts, domainer, fraggle, hetparse, hmmgen, interface, pdbparse, pdbtosp, profgen, scopalign, scopnr, scopparse, scoprep, scopreso, scopseqs, seqalign, seqnr, seqsearch, seqsort, seqwords, siggen, sigplot, sigscan
Palindrome no longer reports palindromes that are only composed of N's.
Msbar can now check that the result doesn't match a set of input other sequences. For example you could specify that it doesn't match the input sequence or a set of previously produced mutation results.
Getorf reporting of circular genome positions tidied up - it now reports positions starting in the range 1 to the sequence length and indicates if the ORF goes through the breakpoint. A clear indication of when ORFs are in the reverse sense has been added.
Pasteseq now behaves correctly when -sask2, -sbegin2 or -send2 are used.
Whichdb new option -showall to see which databases are being searched for use where searches hang. The order of searching is undefined - it depends on the order in which databases are returned from the internal table, which is unrelated to the order in which they were defined.
Wordmatch alignments save the entire sequence but use part only. Fixed all alignment formats to work with these by adding a SubOffset attribute.
Duplicate IDs fix. The database indexing programs skipped duplicate IDs but did not reset the size of the entryname index file so some queries could fail to find the later IDs in the databases. Duplicate IDs are illegal for -nosystemsort (no easy way to correct because entry numbers are stored internally). For the default case duplicate IDs are merged even if they are different. REFSEQ is the main problem area.
Writing data files used EMBOSS_DATA, or by default the install directory. Earlier versions, if not installed, could write to the source tree emboss/data directory. Fixed to continue if there is no install data directory, and to check EMBOSS_DATA (if defined) is a real directory.
Sigcleave options pval and nval hardcoded. They depend on the weight matrix size - which is hardcoded as 15 in the ACD file and is not checked in the program. They were introduced in EGCG in 1988 but never used because no other weight matrix length was tried.
"fasta" format now uses the "ncbi" parser, so both formats report "fasta" as the format. "pearson" is the old "fasta" format for a few cases (empty IDs for example) there ncbi parsing fails completely.
SPLITTER changed to match documentation. Old behaviour is now selectable by using the -addoverlap command line option.
Configuration modifications. --without-x works. Removed odd but harmless -I definitions. PNG detection improved.
Corrected EMBLCD index searching for queries that start with a wildcard. For example, tembl-key:?* should search for all entries that have a keyword (key:* is regarded as 'all entries'). Entries with no keyword (in PIR's pir4.ref file for example) will be ignored.
Updated source code docs for EFUNC and EDATA. Corrected all bad headers. efunc.out has no errors. efunc.check only reports 'missing headers' for duplicated function names (#ifdef code) which is a known 'feature'.
Updated source code to fix most lines over 80 bytes.
Calculated ACD attributes now QA tested. Feature attributes will be correctly set, although none are used in the ACD files at present.
purify.pl has a new option -block=n where n is a number from 1 upwards. 1 runs the first 10 tests, 2 runs the next 10 (blocksize=10 is hardcoded for now).
Cleaned up string position code. Inspections showed ajStrPos and related functions gave results from 0 to length of a string. This caused confusion in many other functions and applications. These functions are now static strPos functions because only ajstr.c had calls to them (though the ajStrPos versions are still available). All calls were checked for positions out of range. As a result, many calls to ajStrAssSub and AjStrCut were fixed. ajStrInsertC requires a value from 0 to length (start position to insert can be before or after the string, or any position in between). Fixed by passing length+1 to strPosII.
Added a functions ajUtilCatch for use in debugging with gdb. When a nasty special case occurs, call ajUtilCatch and make it a breakpoint in gdb. The resulting backtrace will give the call stack and all variable values.
Cleaned up code for chunk HTML input. Added a new variable EMBOSS_HTTPVERSION which defaults to 1.0 (so HTTP is not chunked) and a DB attribute httpversion. This must be a floating point number, and is included in the HTTP header to specify the HTTP protocol version to be used. There is no check in the code to change behaviour for different versions. This is used in the SRSWWW and URL access methods.
Added check to qatest.pl to report any EMBOSS (rather than EMBASSY) applications for which there is no defined test. The EMBASSY test uses wossname results, checked against the names of ACD files in the source tree, as qatest always runs in the test/qa directory.
Allowed sequences as values for EMBL rpt_unit feature qualifiers because so many entries have them. They are illegal according to the Version 4.0 (current) feature table document.
Allow ? before from and to feature locations in SwissProt. For now, these are ignored, though we could add something to hold them for accurate output.
Added modified Harrison solubility probability to PEPSTATS
ACD attributes now have descriptions in the ajacd.c code which are reported by 'entrails'. All ACD attributes have been checked by inspection of the code to note those which are used/unused by ACD. The ACD "type" attribute for files is renamed "standardtype" to reflect its intended use to note standard file types for linking applications. Sequences and alignments still have a "type" attribute for protein or dna sequence types.
Aaindexextract (new) reads the AAINDEX database and writes each entry to data/AAINDEX directory. New function ajFileDataDirNew to read data files from a named directory. New ACD datafile attribute 'directory' passed to ajFileDataDirNew. AAINDEX directory defined for pepwindow and pepwindowall.
Palindrome can now read in multiple sequences
Palindrome now does not print a '|' in an alignment where there is a mismatched pair of bases.
Added filelist datatype to ACD
Mwcontam program added. Displays molecular weights that are common across a set of files.
Showfeat - added '-sort join' to display joined features on one line.
Diffseq - don't give summary of SNPs if the sequences are proteins.
Inclusion of stat64 and readdir64 for offsetbits=64 (ajfile.c and ajsys.c)
Workaround for broken Solaris readdir64_r (jembossctl)
Infoseq can now optionally display GI and Sequence Version numbers.
Notseq can now read in a file of sequence names.
Added '-alternative' qualifier to transeq to allow reverse frame translations to be done using the codons counted from the start of the reversed sequence, rather than, by default, using the codons of the corresponding forward frame.
Added the qualifier '-join' to the program extractfeat. If '-join' is set then joined features, such as 'CDS' and 'mRNA' are output as a single concatenated sequence.
Changed the default output filename from 'stdout' to a file for the following:
infoalign megamerger merger showalign showfeat showseq textsearch
Lindna/cirdna can now draw filled boxes and the user can change the text size on the command-line. They can also read and display complete genomic sequences.
Major new revision of protein structure applications - w/o full documentation.
New applications have been added:
pdbparse.c / acd scopseqs.c / acd scopnr.c / acd seqsearch.c / acd seqwords.c / acd seqalign.c / acd hetparse.c / acd scopreso.c / acd scoprep.c / acd profgen.c / acd funky.c / acd hmmgen.c / acd fraggle.c / acd
Some applications have been deleted:
scope.c / acd nrscope.c / acd psiblasts.c / acd swissparse.c / acd alignwrap.c / acd dichet.c / acd
The deleted applications have been replaced as follows:
coordenew --> pdbparse (coordnew was deleted a while back) scope --> scopparse nrscope --> scopnr psiblasts --> seqsearch swissparse --> seqwords alignwrap --> seqalign
New versions of code have been committed:
pdbparse.c / acd domainer.c / acd contacts.c / acd interface.c / acd pdbtosp.c / acd scopparse.c / acd scopreso.c / acd scopseqs.c / acd scopnr.c / acd scoprep.c / acd scopalign.c / acd seqsearch.c / acd seqwords.c / acd seqsort.c / acd seqnr.c / acd seqalign.c / acd siggen.c / acd sigscan.c / acd sigplot.c / acd hetparse.c / acd profgen.c / acd funky.c / acd hmmgen.c / acdPlus
ajxyz.c / ajxyz.h
Short summaries of the applications are as follows:
pdbparse - Parses pdb files and writes cleaned-up protein coordinate files. domainer - Reads protein coordinate files and writes domains coordinate files. contacts - Reads coordinate files and writes files of intra-chain residue-residue contact data. interface- Reads coordinate files and writes files of inter-chain residue-residue contact data. pdbtosp - Convert raw swissprot:pdb equivalence file to embl-like format. scopparse- Converts raw scop classification files to a file in embl-like format. scopreso - Removes low resolution domains from a scop classification file. scopseqs - Adds pdb and swissprot sequence records to a scop classification file. scopnr - Removes redundant domains from a scop classification file. scoprep - Reorder scop classification file so that the representative structure of each family is given first. scopalign- Generate alignments for families in a scop classification file by using STAMP. seqsearch- Generate files of hits for families in a scop classification file by using PSI-BLAST with seed alignments. seqwords - Generate files of hits for scop families by searching swissprot with keywords. seqsort - Reads multiple files of hits and writes a non-ambiguous file of hits (scop families file) plus a validation file. seqnr - Removes redundant hits from a scop families file. seqalign - Generate extended alignments for families in a scop families file by using CLUSTALW with seed alignments. siggen - Generates a sparse protein signature from an alignment and residue contact data. sigscan - Scans a signature against swissprot and writes a signature hits files. sigplot - Reads a signature hits file and validation file and generates gnuplot data files of signature performance. profgen - Generates various profiles for each alignment in a directory. hmmgen - Generates a hidden Markov model for each alignment in a directory. hetparse - Converts raw dictionary of heterogen groups to a file in embl-like format. funky - Reads clean coordinate files and writes file of protein-heterogen contact data.
Updated "make check" program entrails. Corrected sequence format reports, added report and alignment formats and database access methods.
Added scripts/logreport1.pl to report EMBOSS usage from the logfile. Takes the logfile name on the command line. Reports total use, most active user, and total user count.
Extractseq now only reads one sequence as input.
Fixed error reading multiple databases
Fixed MacOSX reading of incomplete sequence files
Fixed indexing of REFSEQ
New Jemboss authorizing server code. This uses a new set-uid program (jembossctl) to perform tasks as the user.
New alignment output format "match" for wordmatch, reports the length, sequence names, and range in each sequence.
emboss.default.template has been changed to include the new SRSWWW access method and the fields definitions for the test databases.
In dbiblast, renamed the -filename option -filenames to match the other dbi indexing programs, and because wildcard filenames are supported.
Removed the -staden option for the dbi indexing programs. This had no effect (it was originally included to rename files as division.lookup for use by internal utilities at the Sanger Centre).
In qatest.pl test script, added test for missing expected file. Only seen for obsolete secondary output files, no tests were passing that should have failed.
Script (scripts/dbilist.pl) to report the contents of EMBLCD database indices created by dbiflat, dbigcg, dbifasta or dbiblast.
Proxy HTTP access for remote servers. Define EMBOSS_PROXY as an environment variable, or in emboss.defaults. Can also be set for any database as proxy: "hostname:port" or overridden with proxy: ":" to use a local server for a database. This is used by both the URL and SRSWWW access methods.
New ajListUnique function to remove duplicate nodes in a list.
New embxyz.c / .h embXyzSeqsetNRRange functions added
Report format "table" is the default for several applications. In this format, the sequence USA has been removed because it already appears in the sequence header part of the report. A new format "-rformat nametable" will produce the previous report output for users who are relying on parsing it.
Output files defined with the "nullok" attribute in ACD are not created unless requested. The file name and extension are ignored. It is possible to add a new associated qualifier to control this behaviour, but its use may be confusing with more than one output file.
Precision attribute for report score (default is 3). Other floating point report values are written as strings by the original application so their precision is defined in the code. The score is a float, as part of the internal (GFF) feature structure. A zero value produces an integer score (strictly, it uses %.0f as the format). Set precision for etandem, fuzznuc, fuzzpro, fuzztran, patmatdb, patmatmotifs (integer scores) and restrict (no score)
Report output for equicktandem and etandem, with -origfile to write the original output format for sites (Sanger for example) who still require it. By default, the origfile output file is not created.
Report output for patmatdb and patmatmotifs. For patmatmotifs the prosite documentation appears in the report footer, with the addition of the motif name and the number of matches in the sequence.
Report headers and footers automatically trim last newline.
Reports in -rformat SeqTable right-align numbers.
Report output for marscan (-rformat GFF by default)
Report output for fuzztran (-rformat table with the translation included as a report field). Using -rformat seqtable with fuzztran now also shows the original DNA sequence.
Report output for fuzznuc and fuzzpro (-rformat SeqTable by default)
New report qualifiers -raccshow to include accession in header and -rdesshow to include description in header
Two access methods "file" and "offset" were defined as valid in database definitions, but are really reserved for simple file reading. They are removed from the database access methods list.
Two access methods "cmd" and "nbrf" are obsolete (cmd was never implemented, nbrf is replaced by gcg which includes a query mechanism). Both are removed from the database access methods list, and the source code is commented out.
SRS, SRSFASTA and SRSWWW database access can read all entries This is not recommended for SRSWWW access because it will read everything into memory - all of EMBL for example - then strip out HTML tags before reading. For SRS it is not recommended because "methodall: direct" is faster. For SRSFASTA it is necessary because using SRSFASTA implies EMBOSS does not read the original data format. However, not implementing an "all" search left a gap in the SRS access methods which would generate a bad SRS command line or URL.
NBRF sequence reading trims last character only if it is '*' to catch cases where SRS reports the sequence as 'plain'
GCG database text has the spaces in ". ." strings removed.
Database entry text and sequence saved for binary formats (GCG, BLAST) for use by entret and other applications
Dbiblast indices with split databases (formatdb -v) fixed for reading all entries (was only reading the first file)
Dbiblast and dbigcg indices support exclude and file definitions to create database subsets
Database include and file definitions can use the simple filename. In some cases the full path was used. Database files are checked both with and without the directory path for back-compatibility.
srswww access method created to query a remote web server. Preferred to using URL access as SRS queries can be built
Sequence objects include the SeqVersion, Keyword list and Taxonomy list.
The GI number is read as an alternative SeqVersion where it is available (GenBank and some NCBI formats). The GI number is reported in GenBank format if available, but the GenBank VERSION line may have only the SeqVersion if, for example, the sequence was read from an EMBL entry. "sv" queries check both the SeqVersion and GI number.
Accession numbers have a strict definition, which covers the old and new EMBL/GenBank format, SwissProt, PIR, and REFSEQ (NM_nnnnnn). Earlier versions would accept any "accession number" in some sequence formats, especially NCBI format.
SeqVersion (EMBL SV line, GenBank VERSION line) is used in preference to accession number where available. Can also be read in FASTA and NCBI formats. Where only the SeqVersion is available, the accession number is generated.
USA queries implement searches by SV, DES, ORG and KEY. These work with SRS access methods (SRS, SRSFASTA, SRSWWW) by building SRS queries, and with direct access (simple file reading) by testing the sequence object.
Key and Org queries are for full keywords (including spaces) and for each level of the taxonomy.
Des queries, if the access method does not provide a mechanism, (if the access method does not have its own index) are applied to words within the description. Words start with a letter or number, and end with a letter or number. SRS typically does the same, but allows a single quote at the end. This catches words such as 3' and 5' but is a problem with some quoted text.
Queries for ID ACC SV DES ORG and KEY are valid for all file access methods, including URL, external, cmd, app, file and by default any new method added. If the internal query data is not flagged by the access method (to show the database has been queried) the sequence object is automatically tested.
Missing description, keyword, organism, or seqversion fields cause queries to fail if they are used on inappropriate data.
Dbiflat, dbigcg dbifasta and dbiblast can index the new fields. All fields are available in dbiflat and dbigcg. The sv and des fields are available in dbifasta and dbiblast. If any specific formats make it possible to parse the org (or key) field they can be added as new formats.
The new EMBLCD index files are named as follows: des for the descriptions (no obvious standard name), seqvn for the seqversion (no obvious standard name), keyword for keywords (EMBLCD distribution name) and taxon to organism (EMBCD distribution name). The EMBLCD distribution also included a freetext index which is similar to the SRS alltext search so we did not use the name for the description index.
We are working through the EMBLCD format documentation to make EMBOSS indices more compatible. For example, all tokens in the TRG index files should have trailing spaces. We use a NULL to mark the end of the string.
EMBLCD index files now expand to fit the longest token, including the entryname index which was limited to 12 characters (only one site reported a problem with this in dbifasta with long ID names).
A new qualifier -maxindex sets an upper limit (25 is recommended) to limit the size of all index files. Currently this applies to all indices. We can add separate maxima for each field if needed. We expect very few sites to use the extra index fields as SRS is a simpler alternative.
New database definition token 'fields' with a list of indexed fields can be set to 'sv des org key' for SRS databases.
USAs check the query field against the database 'fields' definition. ID and ACC are always allowed. dbname:name still searches ID and ACC (no change from previous version)
USAs with a filename can include the new query fields. The syntax is filename:field:query for example empro.dat:id:eclaci (the extended syntax is because empro.dat-id:eclaci looks like a filename ending in -id)
Application 'tranalign' added. This aligns nucleic coding regions based on a set of aligned proteins.
Est2genome fixed for large alignments (over 40Mbase for est * genomic sequence length).
Sequence reading for ABI files fixed (and selex files tested).
Genbank feature input working.
Pepinfo PNG output larger to make the text readable (only affects PNG output).
Empty sequence file input fails gracefully.
Empty sequence input fails gracefully (and only needs one ^D from stdin).
Seqretall, seqretallfeat and seqretset moved to 'make check'. Seqret has all the functionality of the above.
Fix for NBRF accession number reading (ajseqread.c).
Whichdb program added.
Fix for dbifasta and wormpep.
Fix for problem reading plain format sequences by primer3.
Primer3 renamed eprimer3 to avoid conflicts with the Whitehead's Primer3 version 3.0.6.
Transeq's '-frame' can have a list of values, as: '-frame=1,2,3'.
Non-existent files in lists are again ignored.
Various wildcard database search fixes.
ESIM4 added as an embassy package.
New applications: Biosed, Contacts, Dichet, Psiblasts, Scopalign, Sigscan, Siggen.
Configure tidy.
Alignment report fixes.
Jemboss.
More formats for reports and alignments.
Release of HMMER as an embassy package.
DBIGCG bugfix
New feature table handling etc.
Fix emboss.default.template problem
New applications showalign and embossversion.
Prophet fixed.
New applications distmat and cai.
New applications charge and degapseq.
Bug fixes of marscan, getorf and garnier
New applications scope, nrscope, domainer.
Initial large file model support.
New applications abiview and recode.
Linked list and string iterator code rewritten.
New application coderet.
Corba test routines
New application entret.
GCG output style changed.
Fixed -slower & -supper input options for multiple sequences
Further mods for seqed files.
Rewrite of profile core routines.
Added %id, %sim and fasta output to needle and water.
Now reads GCG seqed mangled files.
Phylip output fixed.
Numerous minor changes.
RedHat Linux 7.0 fpos_t fix
New application cons.
URL access handles new SRS6.07* format.
Library and applications leak-free.
Error messages made less daunting.
dbigcg changes for genbank.
Memory leaks plugged.
Added blast multi-volume support for database indexing.
More gui hints in ACD files.
LinuxPPC support added.
dbigcg changes for embl database in GCG format.
Changes to graphics data output for GUIs.
New application emowse.
tfm corrected.
HTML documentation corrected.
More GUI work.
Changes to graphics data output for GUIs.
Minor library changes.
New application silent
Indexing filenamelen fix.
Modification to diffseq.
New applications vectorstrip and diffseq.