Please help by correcting and extending the Wiki pages.
cons calculates a consensus sequence from a multiple sequence alignment. To obtain the consensus, the sequence weights and a scoring matrix are used to calculate a score for each amino acid residue or nucleotide at each position in the alignment. The highest scoring residue goes into the consensus sequence if the score is higher than a user-specified "plurality" value, otherwise, there is no consensus at that position.
To obtain the consensus, the sequence weights and a scoring matrix are used to calculate a score at each position in the alignment as follows. The residue (or nucleotide) i in an alignment column, is compared to all other residues (j) in the same column. The score for i is the sum over all residues j (not i=j) of the score(ij)*weight(j), where score(ij) is taken from a nucleotide or protein scoring matrix (see -datafile qualifier) and the "weight(j)" is the weighting given to the sequence j, which is given in the alignment file.q
The highest scoring type of residue is then found in the column. If the number of "positive matches" (see below) for this residue is greater than the "plurality value" (see below), then this residue is the consensus residue. Otherwise there is no consensus for that position and an 'n' (nucleotide sequence alignment) or an 'x' (protein sequence alignment) character is written to the consensus sequence.
The positive matches for a residue i are calculated as being the sum of the corresponding sequence weights for all the residues that increase the score of residue i (i.e. that have a positive score). The "plurality" qualifier sets the cut-off for the number of positive matches (weighted) below which there is no consensus.
% cons Create a consensus sequence from a multiple alignment Input (aligned) sequence set: dna.msf output sequence [dna.fasta]: aligned.cons
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers: [-sequence] seqset File containing a sequence alignment. [-outseq] seqout [
|Standard (Mandatory) qualifiers||Allowed values||Default|
|File containing a sequence alignment.||Readable set of sequences||Required|
|Sequence filename and optional format (output USA)||Writeable sequence||<*>.format|
|Additional (Optional) qualifiers||Allowed values||Default|
|-datafile||This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation.||Comparison matrix file in EMBOSS data path||EBLOSUM62 for protein
EDNAFULL for DNA
|-plurality||Set a cut-off for the number of positive matches below which there is no consensus. The default plurality is taken as half the total weight of all the sequences in the alignment.||Any numeric value||Half the total sequence weighting|
|-identity||Provides the facility of setting the required number of identities at a site for it to give a consensus at that position. Therefore, if this is set to the number of sequences in the alignment only columns of identities contribute to the consensus.||Integer 0 or more||0|
|-setcase||Sets the threshold for the positive matches above which the consensus is is upper-case and below which the consensus is in lower-case.||Any numeric value||@( $(sequence.totweight) / 2)|
|-name||Name of the consensus sequence||Any string is accepted||An empty string is accepted|
|Advanced (Unprompted) qualifiers||Allowed values||Default|
!!NA_MULTIPLE_ALIGNMENT dna.msf MSF: 120 Type: N January 01, 1776 12:00 Check: 3196 .. Name: MSFM1 Len: 120 Check: 8587 Weight: 1.00 Name: MSFM2 Len: 120 Check: 6178 Weight: 1.00 Name: MSFM3 Len: 120 Check: 8431 Weight: 1.00 // MSFM1 ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC MSFM2 ACGTACGTAC GTACGTACGT ....ACGTAC GTACGTACGT ACGTACGTAC MSFM3 ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT CGTACGTACG MSFM1 GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT MSFM2 GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT MSFM3 TACGTACGTA CGTACGTACG TACGTACGTA ACGTACGTAC GTACGTACGT MSFM1 ACGTACGTAC GTACGTACGT MSFM2 ACGTACGTTG CAACGTACGT MSFM3 ACGTACGTAC GTACGTACGT
>EMBOSS_001 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
cons uses the standard set of scoring matrix data files in the EMBOSS data directory.
EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.
To see the available EMBOSS data files, run:
% embossdata -showall
To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:
% embossdata -fetch -file Exxx.dat
Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".
The directories are searched in the following order:
The "identity" qualifier provides an additional constrain to "plurality" when determining a consensus residue at an alignment site. "identity" sets the required number of identities at a site for it to be included in the consensus. If for example this is set to the number of sequences in the alignment, then only a site with the same residue in all sequences would be included in the consensus.
The "setcase" qualifier sets the threshold for the positive matches above which the consensus residue is given is upper-case and below which it is in lower-case.
|consambig||Create an ambiguous consensus sequence from a multiple alignment|
|megamerger||Merge two large overlapping DNA sequences|
|merger||Merge two overlapping sequences|