EMBASSY: CLUSTALOMEGA: eomega

eomega

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Multiple sequence alignment (ClustalO wrapper)

Description

eomega is a wrapper to clustalo. It takes a set of unaligned sequences and produces an output alignment.

Clustal-Omega (clustalo) is a general purpose multiple sequence alignment (MSA) program for proteins. It produces high quality MSAs and is capable of handling data-sets of hundreds of thousands of sequences in reasonable time.

In its current form Clustal-Omega can only align protein sequences but not DNA/RNA sequences. It is envisioned that DNA/RNA will become available in a future version.

Algorithm

Clustal-Omega uses HMMs for the alignment engine, based on the HHalign package from Johannes Soeding [1]. Guide trees are optionally made using mBed [2] which can cluster very large numbers of sequences in O(N*log(N)) time. Multiple alignment then proceeds by aligning larger and larger alignments using HHalign, following the clustering given by the guide tree.

Usage

Here is a sample session with eomega

% eomega globins.fasta Multiple sequence alignment (ClustalO wrapper) (aligned) output sequence set [globins.aln]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

Multiple sequence alignment (ClustalO wrapper)
Version: EMBOSS:6.5.0.0

   Standard (Mandatory) qualifiers:
  [-sequences]         seqset     File containing sequences to align
  [-outseq]            seqoutset  [.] Sequence set filename
                                  and optional format (output USA)

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers:
   -indist             infile     Pairwise distance matrix input file (skips
                                  distance computation)
   -inguide            infile     Guide tree input file (skips distance
                                  computation and guide tree clustering step)
   -dealign            toggle     [N] Dealign input sequences
   -mbed               toggle     [N] Fast, Mbed-like clustering for guide
                                  tree calculation
   -intermbed          toggle     [N] Fast, Mbed-like clustering for guide
                                  tree calculation
   -maxiterations      integer    [0] Number of (combined guide tree/HMM)
                                  iterations (Integer from 0 to 2000000000)
   -maxgiterations     integer    [2000000000] Maximum guide tree iterations
                                  (Integer from 0 to 2000000000)
   -maxhiterations     integer    [2000000000] Maximum number of HMM
                                  iterations (Integer from 0 to 2000000000)
   -maxseqs            integer    [2000000000] Maximum number of sequences
                                  (Integer from 2 to 2000000000)
   -maxlenseq          integer    [2000000000] Maximum length of sequence
                                  (Integer from 1 to 2000000000)
   -self               toggle     [N] Set options automatically (might
                                  overwrite some options
   -outdist            outfile    [*.eomega] Pairwise distance matrix output
                                  file
   -outguide           outfile    [*.eomega] Guide tree output file

   Associated qualifiers:

   "-sequences" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -scircular1         boolean    Sequence is circular
   -sformat1           string     Input sequence format
   -iquery1            string     Input query fields or ID list
   -ioffset1           integer    Input start position offset
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outseq" associated qualifiers
   -osformat2          string     Output seq format
   -osextension2       string     File name extension
   -osname2            string     Base file name
   -osdirectory2       string     Output directory
   -osdbname2          string     Database name to add
   -ossingle2          boolean    Separate file for each entry
   -oufo2              string     UFO features
   -offormat2          string     Features format
   -ofname2            string     Features file name
   -ofdirectory2       string     Output directory

   "-outdist" associated qualifiers
   -odirectory         string     Output directory

   "-outguide" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier Type Description Allowed values Default

Standard (Mandatory) qualifiers

[-sequences]
(Parameter 1) seqset File containing sequences to align Readable set of sequences Required

[-outseq]
(Parameter 2) seqoutset Sequence set filename and optional format (output USA) Writeable sequences <*>.format

Additional (Optional) qualifiers

(none)

Advanced (Unprompted) qualifiers

-indist infile Pairwise distance matrix input file (skips distance computation) Input file Required

-inguide infile Guide tree input file (skips distance computation and guide tree clustering step) Input file Required

-dealign toggle Dealign input sequences Toggle value Yes/No No

-mbed toggle Fast, Mbed-like clustering for guide tree calculation Toggle value Yes/No No

-intermbed toggle Fast, Mbed-like clustering for guide tree calculation Toggle value Yes/No No

-maxiterations integer Number of (combined guide tree/HMM) iterations Integer from 0 to 2000000000 0

-maxgiterations integer Maximum guide tree iterations Integer from 0 to 2000000000 2000000000

-maxhiterations integer Maximum number of HMM iterations Integer from 0 to 2000000000 2000000000

-maxseqs integer Maximum number of sequences Integer from 2 to 2000000000 2000000000

-maxlenseq integer Maximum length of sequence Integer from 1 to 2000000000 2000000000

-self toggle Set options automatically (might overwrite some options Toggle value Yes/No No

-outdist outfile Pairwise distance matrix output file Output file <*>.eomega

-outguide outfile Guide tree output file Output file <*>.eomega

Associated qualifiers

"-sequences" associated seqset qualifiers

-sbegin1
-sbegin_sequences integer Start of each sequence to be used Any integer value 0

-send1
-send_sequences integer End of each sequence to be used Any integer value 0

-sreverse1
-sreverse_sequences boolean Reverse (if DNA) Boolean value Yes/No N

-sask1
-sask_sequences boolean Ask for begin/end/reverse Boolean value Yes/No N

-snucleotide1
-snucleotide_sequences boolean Sequence is nucleotide Boolean value Yes/No N

-sprotein1
-sprotein_sequences boolean Sequence is protein Boolean value Yes/No N

-slower1
-slower_sequences boolean Make lower case Boolean value Yes/No N

-supper1
-supper_sequences boolean Make upper case Boolean value Yes/No N

-scircular1
-scircular_sequences boolean Sequence is circular Boolean value Yes/No N

-sformat1
-sformat_sequences string Input sequence format Any string

-iquery1
-iquery_sequences string Input query fields or ID list Any string

-ioffset1
-ioffset_sequences integer Input start position offset Any integer value 0

-sdbname1
-sdbname_sequences string Database name Any string

-sid1
-sid_sequences string Entryname Any string

-ufo1
-ufo_sequences string UFO features Any string

-fformat1
-fformat_sequences string Features format Any string

-fopenfile1
-fopenfile_sequences string Features file name Any string

"-outseq" associated seqoutset qualifiers

-osformat2
-osformat_outseq string Output seq format Any string

-osextension2
-osextension_outseq string File name extension Any string

-osname2
-osname_outseq string Base file name Any string

-osdirectory2
-osdirectory_outseq string Output directory Any string

-osdbname2
-osdbname_outseq string Database name to add Any string

-ossingle2
-ossingle_outseq boolean Separate file for each entry Boolean value Yes/No N

-oufo2
-oufo_outseq string UFO features Any string

-offormat2
-offormat_outseq string Features format Any string

-ofname2
-ofname_outseq string Features file name Any string

-ofdirectory2
-ofdirectory_outseq string Output directory Any string

"-outdist" associated outfile qualifiers

-odirectory string Output directory Any string

"-outguide" associated outfile qualifiers

-odirectory string Output directory Any string

General qualifiers

-auto boolean Turn off prompts Boolean value Yes/No N

-stdout boolean Write first file to standard output Boolean value Yes/No N

-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N

-options boolean Prompt for standard and additional values Boolean value Yes/No N

-debug boolean Write debug output to program.dbg Boolean value Yes/No N

-verbose boolean Report some/full command line options Boolean value Yes/No Y

-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N

-warning boolean Report warnings Boolean value Yes/No Y

-error boolean Report errors Boolean value Yes/No Y

-fatal boolean Report fatal errors Boolean value Yes/No Y

-die boolean Report dying program messages Boolean value Yes/No Y

-version boolean Report version number and exit Boolean value Yes/No N

Qualifier	Type	Description	Allowed values	Default
Standard (Mandatory) qualifiers
[-sequences] (Parameter 1)	seqset	File containing sequences to align	Readable set of sequences	Required
[-outseq] (Parameter 2)	seqoutset	Sequence set filename and optional format (output USA)	Writeable sequences	<>*.format
Additional (Optional) qualifiers
(none)
Advanced (Unprompted) qualifiers
-indist	infile	Pairwise distance matrix input file (skips distance computation)	Input file	Required
-inguide	infile	Guide tree input file (skips distance computation and guide tree clustering step)	Input file	Required
-dealign	toggle	Dealign input sequences	Toggle value Yes/No	No
-mbed	toggle	Fast, Mbed-like clustering for guide tree calculation	Toggle value Yes/No	No
-intermbed	toggle	Fast, Mbed-like clustering for guide tree calculation	Toggle value Yes/No	No
-maxiterations	integer	Number of (combined guide tree/HMM) iterations	Integer from 0 to 2000000000	0
-maxgiterations	integer	Maximum guide tree iterations	Integer from 0 to 2000000000	2000000000
-maxhiterations	integer	Maximum number of HMM iterations	Integer from 0 to 2000000000	2000000000
-maxseqs	integer	Maximum number of sequences	Integer from 2 to 2000000000	2000000000
-maxlenseq	integer	Maximum length of sequence	Integer from 1 to 2000000000	2000000000
-self	toggle	Set options automatically (might overwrite some options	Toggle value Yes/No	No
-outdist	outfile	Pairwise distance matrix output file	Output file	<>*.eomega
-outguide	outfile	Guide tree output file	Output file	<>*.eomega
Associated qualifiers
"-sequences" associated seqset qualifiers
-sbegin1 -sbegin_sequences	integer	Start of each sequence to be used	Any integer value	0
-send1 -send_sequences	integer	End of each sequence to be used	Any integer value	0
-sreverse1 -sreverse_sequences	boolean	Reverse (if DNA)	Boolean value Yes/No	N
-sask1 -sask_sequences	boolean	Ask for begin/end/reverse	Boolean value Yes/No	N
-snucleotide1 -snucleotide_sequences	boolean	Sequence is nucleotide	Boolean value Yes/No	N
-sprotein1 -sprotein_sequences	boolean	Sequence is protein	Boolean value Yes/No	N
-slower1 -slower_sequences	boolean	Make lower case	Boolean value Yes/No	N
-supper1 -supper_sequences	boolean	Make upper case	Boolean value Yes/No	N
-scircular1 -scircular_sequences	boolean	Sequence is circular	Boolean value Yes/No	N
-sformat1 -sformat_sequences	string	Input sequence format	Any string
-iquery1 -iquery_sequences	string	Input query fields or ID list	Any string
-ioffset1 -ioffset_sequences	integer	Input start position offset	Any integer value	0
-sdbname1 -sdbname_sequences	string	Database name	Any string
-sid1 -sid_sequences	string	Entryname	Any string
-ufo1 -ufo_sequences	string	UFO features	Any string
-fformat1 -fformat_sequences	string	Features format	Any string
-fopenfile1 -fopenfile_sequences	string	Features file name	Any string
"-outseq" associated seqoutset qualifiers
-osformat2 -osformat_outseq	string	Output seq format	Any string
-osextension2 -osextension_outseq	string	File name extension	Any string
-osname2 -osname_outseq	string	Base file name	Any string
-osdirectory2 -osdirectory_outseq	string	Output directory	Any string
-osdbname2 -osdbname_outseq	string	Database name to add	Any string
-ossingle2 -ossingle_outseq	boolean	Separate file for each entry	Boolean value Yes/No	N
-oufo2 -oufo_outseq	string	UFO features	Any string
-offormat2 -offormat_outseq	string	Features format	Any string
-ofname2 -ofname_outseq	string	Features file name	Any string
-ofdirectory2 -ofdirectory_outseq	string	Output directory	Any string
"-outdist" associated outfile qualifiers
-odirectory	string	Output directory	Any string
"-outguide" associated outfile qualifiers
-odirectory	string	Output directory	Any string
General qualifiers
-auto	boolean	Turn off prompts	Boolean value Yes/No	N
-stdout	boolean	Write first file to standard output	Boolean value Yes/No	N
-filter	boolean	Read first file from standard input, write first file to standard output	Boolean value Yes/No	N
-options	boolean	Prompt for standard and additional values	Boolean value Yes/No	N
-debug	boolean	Write debug output to program.dbg	Boolean value Yes/No	N
-verbose	boolean	Report some/full command line options	Boolean value Yes/No	Y
-help	boolean	Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose	Boolean value Yes/No	N
-warning	boolean	Report warnings	Boolean value Yes/No	Y
-error	boolean	Report errors	Boolean value Yes/No	Y
-fatal	boolean	Report fatal errors	Boolean value Yes/No	Y
-die	boolean	Report dying program messages	Boolean value Yes/No	Y
-version	boolean	Report version number and exit	Boolean value Yes/No	N

Input file format

eomega reads a set of unaligned sequences and optional distance and guide tree files.

Input files for usage example

File: globins.fasta

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA

Output file format

eomega writes alignments using the default Clustal-Omega output.

Output files for usage example

File: globins.aln

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
--------VHLTPEEKSAVTALWGKVNV--DEVGGEALGRLLVVYPWTQRFFESFGDLST
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLK--G---TFATLSELHCDKLHVDPENFRL
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------
>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
--------VQLSGEEKAAVLALWDKVNE--EEVGGEALGRLLVVYPWTQRFFDSFGDLSN
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLK--G---TFAALSELHCDKLHVDPENFRL
LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDL---
---SHGSAQVKGHGKKVADALTNAVAHVDDMP--N---ALSALSDLHAHKLRVDPVNFKL
LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDL---
---SHGSAQVKAHGKKVGDALTLAVGHLDDLP--G---ALSNLSDLHAHKLRVDPVNFKL
LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------
>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCA
---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
EAEMKASEDLKKHGVTVLTALGAILKKKGHHE--A---ELKPLAQSHATKHKIPIKYLEF
ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTE--KMSMKLRDLSGKHAKSFQVDPQYFKV
LAAVIADTVAA---------GDAGFEKLMSMICILLRSAY-------
>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLU
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
--VPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGV-ADAHFPV
VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---

Data files

None.

Notes

None.

References

[1] Johannes Soding (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21 (7): 951–960.

[2] Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG. Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol. 2010 May 14;5:21.

[3] http://www.genetics.wustl.edu/eddy/software/#squid

[4] Wilbur and Lipman, 1983; PMID 6572363

[5] Thompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673-4680.

[6] Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG. (2007). Clustal W and Clustal X version 2.0. Bioinformatics, 23, 2947-2948.

[7] Kimura M (1980). "A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences". Journal of Molecular Evolution 16: 111–120.

[8] Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput.Nucleic Acids Res. 32(5):1792-1797.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

Program name	Description
edialign	Local multiple alignment of sequences
emma	Multiple sequence alignment (ClustalW wrapper)
eomegapp	Profile with profile (ClustalO wrapper)
eomegash	Sequence with HMM (ClustalO wrapper)
eomegasp	Sequence with profile (ClustalO wrapper)
infoalign	Display basic information about a multiple sequence alignment
mse	Multiple sequence editor
plotcon	Plot conservation of a sequence alignment
prettyplot	Draw a sequence alignment with pretty formatting
showalign	Display a multiple sequence alignment in pretty format
tranalign	Generate an alignment of nucleic coding regions from aligned proteins

Author(s)

Alan Bleasby
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None

Wiki

Function

Description

Algorithm

Usage

Command line arguments

Input file format

Input files for usage example

File: globins.fasta

Output file format

Output files for usage example

File: globins.aln

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

See also

Author(s)

History

Target users

Comments