Alignment Formats

Contents


Alignment Formats

When an alignment of two or more sequences is done by programs in EMBOSS, then the resulting output is written to a file.

There are many different programs in EMBOSS that do many different types of alignments. Some of these programs have been incorporated into EMBOSS from pre-existing programs and some were specially written for it.

The resulting assortment of programs were starting to produce alignment output in a variety of different formats. It was decided that from EMBOSS version 2.2 onwards there should be a set of standard alignment formats.

One notable set of alignment formats (markx0, markx1, markx2, markx3, markx10) are derived from the programs written by Bill Pearson - these programs are known as the FASTA suite of programs, because the FASTA programs were a major part of them. (The FASTA sequence format was devised by Bill Pearson for use by this software suite.) The A2M alignment format is also a FASTA format in which gap characters in sequences are permitted.


Why have standard formats?

Standardising on a set of formats enables programs to be written that can read in the results from many different programs.

If you only intend to look at the resulting alignments and not read them into any other programs, then it is still worth having a standard set of formats as you will very quickly get used to the look and feel of a format and be able to compare the alignments from different programs more easily.

It is often convenient to have different alignment formats produced by the same program for different puposes. Depending on what you may wish to do with the result, it may be better to have a human readable alignment for publication purposes, or an alignment that is also a standard sequence format which can be read in by another program for further analysis.

Different programs will have different default alignment formats. You may accept the default or chose your preferred format when you run the program.


What are the formats?

Gaps in sequences

In all EMBOSS alignment formats, gaps that have been introduced into the sequences to make them align are indicated by the '-' character.

The exception to this rule is msf format which uses '.' as the gap character inside the sequences and '~' as the gap character at the terminal ends of the alignment. The reason for this inconsistency is that MSF is a multiple sequence alignment format that was not defined by EMBOSS but by another package.

Head and tail of the format

The majority of the alignment formats (except those that are also standard sequence formats, like fasta or MSF) have a block of information at the start of the alignment describing the program, date, output filename, ID names of the sequences and some of the parameters and statistics of the alignment.


########################################
# Program:  demoalign
# Rundate:  Thu Jan 17 09:30:08 2002
# Report_file: stdout
########################################
#=======================================
#
# Aligned_sequences: 4
# 1: IXI_234
# 2: IXI_235
# 3: IXI_236
# 4: IXI_237
# Matrix: EBLOSUM62
# Gap_penalty: 9
# Extend_penalty: -1
#
# Length: 131
# Identity:      95/131 (72.5%)
# Similarity:   127/131 (96.9%)
# Gaps:          25/131 (19.1%)
#
#
#=======================================   

There is also a block of information at the end of the alignment for summary information. This is used by a few programs e.g. merger.

Length

The header block contains a line similar to:

# Length: 131

This is the length of the alignment, including any gaps that have been introduced to construct the alignment.

Identity

The header block contains a line similar to:

# Identity:      95/131 (72.5%)

This is a count of the number of positions over the length of the alignment where all of the residues or bases at that position are identical.

It is followed by '/131' - the length of the alignment and '(72.5%)' - the percentage of positions in the alignment where there are identities.

Similarity

The header block contains a line similar to:

# Similarity:   127/131 (96.9%)

This is a count of the number of positions over the length of the alignment where >= 51% of the residues or bases at that position are similar.

Any two residues or bases are defined as similar when they have positive comparisons (as defined by the comparison matrix being used in the alignment algorithm).

It is followed by '/131' - the length of the alignment and '(96.9%)' - the percentage of positions in the alignment where there are similarities.

Note that the sum of identical and similar positions is greater than 100%. This is because the count of similar positions includes the count of identical positions; if residues are identical, they must also be similar.

Gaps

The header block contains a line similar to:

# Gaps:          25/131 (19.1%)

This is a count of the number of positions over the length of the alignment where there are one or more sequences with a gap.

It is followed by '/131' - the length of the alignment and '(19.1%)' - the percentage of positions in the alignment where there are gaps.

Score

The header block may contain a line similar to:

# Score: 100.0

This is the score used by the program that calculated the alignment to determine which is the best possible alignment to report.

The algorithm that was used to derive the score is not part of the alignment formatting routines. You should see documentation about the relevant algorithm to see how the score is derived.

Markup Line

The markup line is the line commonly placed between a pairwise alignment or at the bottom of alignments of 3 or more sequences that shows where sequences are mismatched, gapped, identical or similar.

In general the markup line uses a space for a mismatch or a gap, '.' for any small positive score, ':' for a similarity which scores more than 1.0, and '|' for an identity where both sequences have the same residue regardless of its score ('W' matching 'W' scores much more than 'L' matching 'L' because a conserved tryptophan is more significant than a conserved leucine).

The 'markx' set of alignment formats (produced by the FASTA suite of programs written by Bill Pearson) use '.' for similarity and ':' for an identity. The '|' character is not used. This was a design decision by Bill Pearson when he wrote the FASTA programs.

Format names

The formats have been given names that correspond to the names of existing alignment styles or programs.

Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.

The format names in these tables link to examples of the sequences.

Multiple sequence formats

NameComments
unknown
multiple
simple
These are synonyms for simple format. This format displays the sequence names, positions and sequences and then puts the markup line underneath the sequences. When only two sequences are being aligned then the format is changed to that produced by pair
fasta / a2m This is just the standard Fasta sequence format with gaps, where many sequences are concatenated one after the other.
msf This is just the standard MSF sequence format.
trace This is a special verbose format for use in debugging. It is not intended for normal users.
srs This shows the sequence ID name, the sequence position, the sequence and the sequence position for each line.

Pair-wise sequence formats

NameComments
pair This is the default format used when there are only 2 sequences. When simple format is selected but there are only 2 sequences, then this format is used. The sequences have the markup line between them.
markx0 This is the standard default output format used by Bill Pearson's suite of FASTA programs.
markx1 This is an alternative output format used by Bill Pearson's suite of FASTA programs in which identities are not marked. Instead conservative replacements are denoted by 'x' and non-conservative substitutions by 'X'.
markx2 This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the residues in the second sequence are only shown if they are different from the first.
markx3 This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the aligned sequences are displayed in FASTA sequence format. These can be used to build a primitive multiple alignment.
markx10 This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the aligned sequences are displayed in FASTA sequence format and the sequence length, alignment start and stop information is given in lines starting with a ';' character just after the title line for each sequence. It is intended to be easily parsed by other programs.
srspair This is very similar in style to pair format.
score This does not display the sequence alignment. It shows only the names of the sequences, the length of the alignment and the score.

Changing the format

Each program that writes an alignment has a default alignment format defined for that program. This format is usually simple for multiple alignments and for pair-wise alignments it is usually pair but for programs that are derived from Bill Pearson's FASTA suite of programs it is usually markx0.

You are not restricted to these default formats. You can use any format if you have a multiple alignment and you can use any pair-wise format if you have two aligned sequences.

You specify the required format by putting the qualifier -aformat followed by the format name on the command line, for example:

water -aformat msf

Command-line qualifiers

There are other command-line qualifiers that change the behaviour of the alignment output.

  -aformat            string     alignment format
  -aopenfile          string     alignment file name
  -aextension         string     file name extension
  -aname              string     base file name
  -awidth             int        alignment width
  -ausashow           bool       show the full USA in the alignment
  -adirectory         bool       alignment file output directory

Of these, -awidth and -ausashow might be the more useful.