There are many different programs in EMBOSS that do many different types of alignments. Some of these programs have been incorporated into EMBOSS from pre-existing programs and some were specially written for it.
The resulting assortment of programs were starting to produce alignment output in a variety of different formats. It was decided that from EMBOSS version 2.2 onwards there should be a set of standard alignment formats.
One notable set of alignment formats (markx0, markx1, markx2, markx3, markx10) are derived from the programs written by Bill Pearson - these programs are known as the FASTA suite of programs, because the FASTA programs were a major part of them. (The FASTA sequence format was devised by Bill Pearson for use by this software suite.) The A2M alignment format is also a FASTA format in which gap characters in sequences are permitted.
If you only intend to look at the resulting alignments and not read them into any other programs, then it is still worth having a standard set of formats as you will very quickly get used to the look and feel of a format and be able to compare the alignments from different programs more easily.
It is often convenient to have different alignment formats produced by the same program for different puposes. Depending on what you may wish to do with the result, it may be better to have a human readable alignment for publication purposes, or an alignment that is also a standard sequence format which can be read in by another program for further analysis.
Different programs will have different default alignment formats. You may accept the default or chose your preferred format when you run the program.
The exception to this rule is msf format which uses '.' as the gap character inside the sequences and '~' as the gap character at the terminal ends of the alignment. The reason for this inconsistency is that MSF is a multiple sequence alignment format that was not defined by EMBOSS but by another package.
######################################## # Program: demoalign # Rundate: Thu Jan 17 09:30:08 2002 # Report_file: stdout ######################################## #======================================= # # Aligned_sequences: 4 # 1: IXI_234 # 2: IXI_235 # 3: IXI_236 # 4: IXI_237 # Matrix: EBLOSUM62 # Gap_penalty: 9 # Extend_penalty: -1 # # Length: 131 # Identity: 95/131 (72.5%) # Similarity: 127/131 (96.9%) # Gaps: 25/131 (19.1%) # # #=======================================
There is also a block of information at the end of the alignment for summary information. This is used by a few programs e.g. merger.
# Length: 131
This is the length of the alignment, including any gaps that have been introduced to construct the alignment.
# Identity: 95/131 (72.5%)
This is a count of the number of positions over the length of the alignment where all of the residues or bases at that position are identical.
It is followed by '/131' - the length of the alignment and '(72.5%)' - the percentage of positions in the alignment where there are identities.
# Similarity: 127/131 (96.9%)
This is a count of the number of positions over the length of the alignment where >= 51% of the residues or bases at that position are similar.
Any two residues or bases are defined as similar when they have positive comparisons (as defined by the comparison matrix being used in the alignment algorithm).
It is followed by '/131' - the length of the alignment and '(96.9%)' - the percentage of positions in the alignment where there are similarities.
Note that the sum of identical and similar positions is greater than 100%. This is because the count of similar positions includes the count of identical positions; if residues are identical, they must also be similar.
# Gaps: 25/131 (19.1%)
This is a count of the number of positions over the length of the alignment where there are one or more sequences with a gap.
It is followed by '/131' - the length of the alignment and '(19.1%)' - the percentage of positions in the alignment where there are gaps.
# Score: 100.0
This is the score used by the program that calculated the alignment to determine which is the best possible alignment to report.
The algorithm that was used to derive the score is not part of the alignment formatting routines. You should see documentation about the relevant algorithm to see how the score is derived.
In general the markup line uses a space for a mismatch or a gap, '.' for any small positive score, ':' for a similarity which scores more than 1.0, and '|' for an identity where both sequences have the same residue regardless of its score ('W' matching 'W' scores much more than 'L' matching 'L' because a conserved tryptophan is more significant than a conserved leucine).
The 'markx' set of alignment formats (produced by the FASTA suite of programs written by Bill Pearson) use '.' for similarity and ':' for an identity. The '|' character is not used. This was a design decision by Bill Pearson when he wrote the FASTA programs.
Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.
The format names in these tables link to examples of the sequences.
Name | Comments |
---|---|
unknown multiple simple |
These are synonyms for simple format. This format displays the sequence names, positions and sequences and then puts the markup line underneath the sequences. When only two sequences are being aligned then the format is changed to that produced by pair |
fasta / a2m | This is just the standard Fasta sequence format with gaps, where many sequences are concatenated one after the other. |
msf | This is just the standard MSF sequence format. |
trace | This is a special verbose format for use in debugging. It is not intended for normal users. |
srs | This shows the sequence ID name, the sequence position, the sequence and the sequence position for each line. |
Name | Comments |
---|---|
pair | This is the default format used when there are only 2 sequences. When simple format is selected but there are only 2 sequences, then this format is used. The sequences have the markup line between them. |
markx0 | This is the standard default output format used by Bill Pearson's suite of FASTA programs. |
markx1 | This is an alternative output format used by Bill Pearson's suite of FASTA programs in which identities are not marked. Instead conservative replacements are denoted by 'x' and non-conservative substitutions by 'X'. |
markx2 | This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the residues in the second sequence are only shown if they are different from the first. |
markx3 | This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the aligned sequences are displayed in FASTA sequence format. These can be used to build a primitive multiple alignment. |
markx10 | This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the aligned sequences are displayed in FASTA sequence format and the sequence length, alignment start and stop information is given in lines starting with a ';' character just after the title line for each sequence. It is intended to be easily parsed by other programs. |
srspair | This is very similar in style to pair format. |
score | This does not display the sequence alignment. It shows only the names of the sequences, the length of the alignment and the score. |
You are not restricted to these default formats. You can use any format if you have a multiple alignment and you can use any pair-wise format if you have two aligned sequences.
You specify the required format by putting the qualifier -aformat followed by the format name on the command line, for example:
water -aformat msf
-aformat string alignment format -aopenfile string alignment file name -aextension string file name extension -aname string base file name -awidth int alignment width -ausashow bool show the full USA in the alignment -adirectory bool alignment file output directory
Of these, -awidth and -ausashow might be the more useful.