needleall

 

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Many-to-many pairwise alignments of two sequence sets

Description

needleall reads a set of input sequences and compares them all to one or more sequences, writing their optimal global sequence alignments to file. It uses the Needleman-Wunsch alignment algorithm to find the optimum alignment (including gaps) of two sequences along their entire length. The algorithm uses a dynamic programming method to ensure the alignment is optimum, by exploring all possible alignments and choosing the best. A scoring matrix is read that contains values for every possible residue or nucleotide match. Needleall finds the alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix, minus penalties arising from opening and extending gaps in the aligned sequences. The substitution matrix and gap opening and extension penalties are user-specified.

Algorithm

The Needleman-Wunsch algorithm is a member of the class of algorithms that can calculate the best score and alignment of two sequences in the order of mn steps, where n and m are the sequence lengths. These dynamic programming algorithms were first developed for protein sequence comparison by Needleman and Wunsch, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.

An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty). Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.

Penalty for a gap of n positions is calculated using the following formula:

gap opening penalty + (n - 1) * gap extension penalty

In a Needleman-Wunsch global alignment, the entire length of each sequence is aligned. The sequences might be partially overlapping or one sequence might be aligned entirely internally to the other. There is no penalty for the hanging ends of the overlap. In bioinformatics, it is usually reasonable to assume that the sequences are incomplete and there should be no penalty for failing to align the missing bases.

Usage

Here is a sample session with needleall


% needleall -minscore 40 -stdout -auto ../data/test1_illumina.fastq 

Illumina_DpnII_Gex_PCR_Primer_2 FC12044_91407_8_200_406_24 45 (41.0)
Illumina_NlaIII_Gex_PCR_Primer_2 FC12044_91407_8_200_406_24 45 (41.0)
Illumina_Small_RNA_PCR_Primer_2 FC12044_91407_8_200_406_24 45 (41.0)
Illumina_DpnII_Gex_Adapters1_1 FC12044_91407_8_200_106_131 35 (40.5)
Illumina_Paired_End_DNA_Adapters1_1 FC12044_91407_8_200_57_85 35 (41.0)
Illumina_DpnII_Gex_Adapters1_1 FC12044_91407_8_200_154_436 31 (42.0)
Illumina_Genomic_DNA_PCR_Primers1_1 FC12044_91407_8_200_83_511 64 (42.0)
Illumina_Paired_End_DNA_PCR_Primers1_1 FC12044_91407_8_200_83_511 64 (42.0)
Illumina_DpnII_Gex_Adapters1_2 FC12044_91407_8_200_303_427 33 (40.5)
Illumina_DpnII_Gex_PCR_Primer_2 FC12044_91407_8_200_303_427 51 (40.5)
Illumina_DpnII_Gex_sequencing_primer FC12044_91407_8_200_303_427 38 (44.5)
Illumina_NlaIII_Gex_Adapters1_2 FC12044_91407_8_200_303_427 36 (40.5)
Illumina_NlaIII_Gex_PCR_Primer_2 FC12044_91407_8_200_303_427 51 (40.5)
Illumina_NlaIII_Gex_sequencing_primer FC12044_91407_8_200_303_427 39 (40.5)
Illumina_Small_RNA_5p_Adapter FC12044_91407_8_200_303_427 33 (40.5)
Illumina_Small_RNA_PCR_Primer_2 FC12044_91407_8_200_303_427 51 (40.5)
Illumina_Small_RNA_sequencing_primer FC12044_91407_8_200_303_427 38 (44.5)
Illumina_Paired_End_DNA_Adapters1_1 FC12044_91407_8_200_553_135 33 (44.5)
Illumina_DpnII_Gex_PCR_Primer_2 FC12044_91407_8_200_139_74 51 (46.0)
Illumina_DpnII_Gex_sequencing_primer FC12044_91407_8_200_139_74 38 (42.0)
Illumina_NlaIII_Gex_PCR_Primer_2 FC12044_91407_8_200_139_74 51 (46.0)
Illumina_Small_RNA_PCR_Primer_2 FC12044_91407_8_200_139_74 51 (46.0)
Illumina_Small_RNA_sequencing_primer FC12044_91407_8_200_139_74 38 (42.0)

#---------------------------------------
#---------------------------------------

Go to the input files for this example
Go to the output files for this example

Command line arguments

Many-to-many pairwise alignments of two sequence sets
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-asequence]         seqset     Sequence set filename and optional format,
                                  or reference (input USA)
  [-bsequence]         seqall     Sequence(s) filename and optional format, or
                                  reference (input USA)
   -gapopen            float      [10.0 for any sequence] The gap open penalty
                                  is the score taken away when a gap is
                                  created. The best value depends on the
                                  choice of comparison matrix. The default
                                  value assumes you are using the EBLOSUM62
                                  matrix for protein sequences, and the
                                  EDNAFULL matrix for nucleotide sequences.
                                  (Floating point number from 1.0 to 100.0)
   -gapextend          float      [0.5 for any sequence] The gap extension,
                                  penalty is added to the standard gap penalty
                                  for each base or residue in the gap. This
                                  is how long gaps are penalized. Usually you
                                  will expect a few long gaps rather than many
                                  short gaps, so the gap extension penalty
                                  should be lower than the gap penalty. An
                                  exception is where one or both sequences are
                                  single reads with possible sequencing
                                  errors in which case you would expect many
                                  single base gaps. You can get this result by
                                  setting the gap open penalty to zero (or
                                  very low) and using the gap extension
                                  penalty to control gap scoring. (Floating
                                  point number from 0.0 to 10.0)
  [-outfile]           align      [*.needleall] Output alignment file name
                                  (default -aformat score)

   Additional (Optional) qualifiers:
   -datafile           matrixf    [EBLOSUM62 for protein, EDNAFULL for DNA]
                                  This is the scoring matrix file used when
                                  comparing sequences. By default it is the
                                  file 'EBLOSUM62' (for proteins) or the file
                                  'EDNAFULL' (for nucleic sequences). These
                                  files are found in the 'data' directory of
                                  the EMBOSS installation.
   -endweight          boolean    [N] Apply end gap penalties.
   -endopen            float      [10.0 for any sequence] The end gap open
                                  penalty is the score taken away when an end
                                  gap is created. The best value depends on
                                  the choice of comparison matrix. The default
                                  value assumes you are using the EBLOSUM62
                                  matrix for protein sequences, and the
                                  EDNAFULL matrix for nucleotide sequences.
                                  (Floating point number from 1.0 to 100.0)
   -endextend          float      [0.5 for any sequence] The end gap
                                  extension, penalty is added to the end gap
                                  penalty for each base or residue in the end
                                  gap. (Floating point number from 0.0 to
                                  10.0)
   -minscore           float      [1.0 for any sequence] Minimum alignment
                                  score to report an alignment. (Floating
                                  point number from -10.0 to 100.0)
   -errfile            outfile    [needleall.error] Error file to be written
                                  to

   Advanced (Unprompted) qualifiers:
   -[no]brief          boolean    [Y] Brief identity and similarity

   Associated qualifiers:

   "-asequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -scircular1         boolean    Sequence is circular
   -squick1            boolean    Read id and sequence only
   -sformat1           string     Input sequence format
   -iquery1            string     Input query fields or ID list
   -ioffset1           integer    Input start position offset
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-bsequence" associated qualifiers
   -sbegin2            integer    Start of each sequence to be used
   -send2              integer    End of each sequence to be used
   -sreverse2          boolean    Reverse (if DNA)
   -sask2              boolean    Ask for begin/end/reverse
   -snucleotide2       boolean    Sequence is nucleotide
   -sprotein2          boolean    Sequence is protein
   -slower2            boolean    Make lower case
   -supper2            boolean    Make upper case
   -scircular2         boolean    Sequence is circular
   -squick2            boolean    Read id and sequence only
   -sformat2           string     Input sequence format
   -iquery2            string     Input query fields or ID list
   -ioffset2           integer    Input start position offset
   -sdbname2           string     Database name
   -sid2               string     Entryname
   -ufo2               string     UFO features
   -fformat2           string     Features format
   -fopenfile2         string     Features file name

   "-outfile" associated qualifiers
   -aformat3           string     Alignment format
   -aextension3        string     File name extension
   -adirectory3        string     Output directory
   -aname3             string     Base file name
   -awidth3            integer    Alignment width
   -aaccshow3          boolean    Show accession number in the header
   -adesshow3          boolean    Show description in the header
   -ausashow3          boolean    Show the full USA in the alignment
   -aglobal3           boolean    Show the full sequence in alignment

   "-errfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-asequence]
(Parameter 1)
seqset Sequence set filename and optional format, or reference (input USA) Readable set of sequences Required
[-bsequence]
(Parameter 2)
seqall Sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
-gapopen float The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. Floating point number from 1.0 to 100.0 10.0 for any sequence
-gapextend float The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. Floating point number from 0.0 to 10.0 0.5 for any sequence
[-outfile]
(Parameter 3)
align Output alignment file name (default -aformat score) <*>.needleall
Additional (Optional) qualifiers
-datafile matrixf This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAFULL for DNA
-endweight boolean Apply end gap penalties. Boolean value Yes/No No
-endopen float The end gap open penalty is the score taken away when an end gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. Floating point number from 1.0 to 100.0 10.0 for any sequence
-endextend float The end gap extension, penalty is added to the end gap penalty for each base or residue in the end gap. Floating point number from 0.0 to 10.0 0.5 for any sequence
-minscore float Minimum alignment score to report an alignment. Floating point number from -10.0 to 100.0 1.0 for any sequence
-errfile outfile Error file to be written to Output file needleall.error
Advanced (Unprompted) qualifiers
-[no]brief boolean Brief identity and similarity Boolean value Yes/No Yes
Associated qualifiers
"-asequence" associated seqset qualifiers
-sbegin1
-sbegin_asequence
integer Start of each sequence to be used Any integer value 0
-send1
-send_asequence
integer End of each sequence to be used Any integer value 0
-sreverse1
-sreverse_asequence
boolean Reverse (if DNA) Boolean value Yes/No N
-sask1
-sask_asequence
boolean Ask for begin/end/reverse Boolean value Yes/No N
-snucleotide1
-snucleotide_asequence
boolean Sequence is nucleotide Boolean value Yes/No N
-sprotein1
-sprotein_asequence
boolean Sequence is protein Boolean value Yes/No N
-slower1
-slower_asequence
boolean Make lower case Boolean value Yes/No N
-supper1
-supper_asequence
boolean Make upper case Boolean value Yes/No N
-scircular1
-scircular_asequence
boolean Sequence is circular Boolean value Yes/No N
-squick1
-squick_asequence
boolean Read id and sequence only Boolean value Yes/No N
-sformat1
-sformat_asequence
string Input sequence format Any string  
-iquery1
-iquery_asequence
string Input query fields or ID list Any string  
-ioffset1
-ioffset_asequence
integer Input start position offset Any integer value 0
-sdbname1
-sdbname_asequence
string Database name Any string  
-sid1
-sid_asequence
string Entryname Any string  
-ufo1
-ufo_asequence
string UFO features Any string  
-fformat1
-fformat_asequence
string Features format Any string  
-fopenfile1
-fopenfile_asequence
string Features file name Any string  
"-bsequence" associated seqall qualifiers
-sbegin2
-sbegin_bsequence
integer Start of each sequence to be used Any integer value 0
-send2
-send_bsequence
integer End of each sequence to be used Any integer value 0
-sreverse2
-sreverse_bsequence
boolean Reverse (if DNA) Boolean value Yes/No N
-sask2
-sask_bsequence
boolean Ask for begin/end/reverse Boolean value Yes/No N
-snucleotide2
-snucleotide_bsequence
boolean Sequence is nucleotide Boolean value Yes/No N
-sprotein2
-sprotein_bsequence
boolean Sequence is protein Boolean value Yes/No N
-slower2
-slower_bsequence
boolean Make lower case Boolean value Yes/No N
-supper2
-supper_bsequence
boolean Make upper case Boolean value Yes/No N
-scircular2
-scircular_bsequence
boolean Sequence is circular Boolean value Yes/No N
-squick2
-squick_bsequence
boolean Read id and sequence only Boolean value Yes/No N
-sformat2
-sformat_bsequence
string Input sequence format Any string  
-iquery2
-iquery_bsequence
string Input query fields or ID list Any string  
-ioffset2
-ioffset_bsequence
integer Input start position offset Any integer value 0
-sdbname2
-sdbname_bsequence
string Database name Any string  
-sid2
-sid_bsequence
string Entryname Any string  
-ufo2
-ufo_bsequence
string UFO features Any string  
-fformat2
-fformat_bsequence
string Features format Any string  
-fopenfile2
-fopenfile_bsequence
string Features file name Any string  
"-outfile" associated align qualifiers
-aformat3
-aformat_outfile
string Alignment format Any string score
-aextension3
-aextension_outfile
string File name extension Any string  
-adirectory3
-adirectory_outfile
string Output directory Any string  
-aname3
-aname_outfile
string Base file name Any string  
-awidth3
-awidth_outfile
integer Alignment width Any integer value 0
-aaccshow3
-aaccshow_outfile
boolean Show accession number in the header Boolean value Yes/No N
-adesshow3
-adesshow_outfile
boolean Show description in the header Boolean value Yes/No N
-ausashow3
-ausashow_outfile
boolean Show the full USA in the alignment Boolean value Yes/No N
-aglobal3
-aglobal_outfile
boolean Show the full sequence in alignment Boolean value Yes/No Y
"-errfile" associated outfile qualifiers
-odirectory string Output directory Any string  
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

Input file format

needleall reads in two nucleotide or protein sequences inputs. Both can be one or more sequences. All sequences in the first ionput are aligned to all sequences in the second input.

The input is a standard EMBOSS sequence query (also known as a 'USA').

Major sequence database sources defined as standard in EMBOSS installations include srs:embl, srs:uniprot and ensembl

Data can also be read from sequence output in any supported format written by an EMBOSS or third-party application.

The input format can be specified by using the command-line qualifier -sformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir (nbrf), swissprot (swiss, sw), dasgff and debug.

See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.

Input files for usage example

File: illumina_adapter_primer.fa

>Illumina_Genomici_DNA_Adapters1_1
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
>Illumina_Genomic_DNA_Adapters1_2
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Genomic_DNA_PCR_Primers1_1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Genomic_DNA_PCR_Primers1_2
CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
>Illumina_Genomic_DNA_sequencing_primer
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_DNA_Adapters1_1
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
>Illumina_Paired_End_DNA_Adapters1_2
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_DNA_PCR_Primers1_1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_DNA_PCR_Primers1_2
CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
>Illumina_Paired_End_DNA_sequencing_primer_1
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_DNA_sequencing_primer_2
CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
>Illumina_DpnII_Gex_Adapters1_1
GATCGTCGGACTGTAGAACTCTGAAC
>Illumina_DpnII_Gex_Adapters1_2
ACAGGTTCAGAGTTCTACAGTCCGAC
>Illumina_DpnII_Gex_Adapters2_1
CAAGCAGAAGACGGCATACGA
>Illumina_DpnII_Gex_Adapters2_2
TCGTATGCCGTCTTCTGCTTG
>Illumina_DpnII_Gex_PCR_Primer_1
CAAGCAGAAGACGGCATACGA
>Illumina_DpnII_Gex_PCR_Primer_2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina_DpnII_Gex_sequencing_primer
CGACAGGTTCAGAGTTCTACAGTCCGACGATC
>Illumina_NlaIII_Gex_Adapters1_1
TCGGACTGTAGAACTCTGAAC
>Illumina_NlaIII_Gex_Adapters1_2
ACAGGTTCAGAGTTCTACAGTCCGACATG
>Illumina_NlaIII_Gex_Adapters2_1
CAAGCAGAAGACGGCATACGANN
>Illumina_NlaIII_Gex_Adapters2_2
TCGTATGCCGTCTTCTGCTTG
>Illumina_NlaIII_Gex_PCR_Primer_1
CAAGCAGAAGACGGCATACGA
>Illumina_NlaIII_Gex_PCR_Primer_2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina_NlaIII_Gex_sequencing_primer
CCGACAGGTTCAGAGTTCTACAGTCCGACATG
>Illumina_Small_RNA_RT_Primer
CAAGCAGAAGACGGCATACGA
>Illumina_Small_RNA_5p_Adapter
GTTCAGAGTTCTACAGTCCGACGATC
>Illumina_Small_RNA_3p_Adapter
TCGTATGCCGTCTTCTGCTTGT
>Illumina_Small_RNA_PCR_Primer_1
CAAGCAGAAGACGGCATACGA
>Illumina_Small_RNA_PCR_Primer_2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina_Small_RNA_sequencing_primer
CGACAGGTTCAGAGTTCTACAGTCCGACGATC

File: test1_illumina.fastq

@FC12044_91407_8_200_406_24
GTTAGCTCCCACCTTAAGATGTTTA
+FC12044_91407_8_200_406_24
SXXTXXXXXXXXXTTSUXSSXKTMQ
@FC12044_91407_8_200_720_610
CTCTGTGGCACCCCATCCCTCACTT
+FC12044_91407_8_200_720_610
OXXXXXXXXXXXXXXXXXTSXQTXU
@FC12044_91407_8_200_345_133
GATTTTTTAACAATAAACGTACATA
+FC12044_91407_8_200_345_133
OQTOOSFORTFFFIIOFFFFFFFFF
@FC12044_91407_8_200_106_131
GTTGCCCAGGCTCGTCTTGAACTCC
+FC12044_91407_8_200_106_131
XXXXXXXXXXXXXXSXXXXISTXQS
@FC12044_91407_8_200_916_471
TGATTGAAGGTAGGGTAGCATACTG
+FC12044_91407_8_200_916_471
XXXXXXXXXXXXXXXUXXUSXXTXW
@FC12044_91407_8_200_57_85
GCTCCAATAGCGCAGAGGAAACCTG
+FC12044_91407_8_200_57_85
XFXMXSXXSXXXOSQROOSROFQIQ
@FC12044_91407_8_200_10_437
GCTGCTTGGGAGGCTGAGGCAGGAG
+FC12044_91407_8_200_10_437
USXSXXXXXXUXXXSXQXXUQXXKS
@FC12044_91407_8_200_154_436
AGACCTTTGGATACAATGAACGACT
+FC12044_91407_8_200_154_436
MKKMQTSRXMSQTOMRFOOIFFFFF
@FC12044_91407_8_200_336_64
AGGGAATTTTAGAGGAGGGCTGCCG
+FC12044_91407_8_200_336_64
STQMOSXSXSQXQXXKXXXKFXFFK
@FC12044_91407_8_200_620_233
TCTCCATGTTGGTCAGGCTGGTCTC
+FC12044_91407_8_200_620_233
XXXXXXXXXXXXXXXXXXXXXSXSW
@FC12044_91407_8_200_902_349
TGAACGTCGAGACGCAAGGCCCGCC
+FC12044_91407_8_200_902_349
XMXSSXMXXSXQSXTSQXFKSKTOF
@FC12044_91407_8_200_40_618
CTGTCCCCACGGCGGGGGGGCCTGG
+FC12044_91407_8_200_40_618
TXXXXSXXXXXXXXXXXXXRKFOXS
@FC12044_91407_8_200_83_511
GATGTACTCTTACACCCAGACTTTG
+FC12044_91407_8_200_83_511
SOXXXXXUXXXXXXQKQKKROOQSU
@FC12044_91407_8_200_76_246
TCAAGGGTGGATCTTGGCTCCCAGT
+FC12044_91407_8_200_76_246
XTXTUXXXXXRXXXTXXSUXSRFXQ
@FC12044_91407_8_200_303_427
TTGCGACAGAGTTTTGCTCTTGTCC
+FC12044_91407_8_200_303_427
XXQROXXXXIXFQXXXOIQSSXUFF
@FC12044_91407_8_200_31_299
TCTGCTCCAGCTCCAAGACGCCGCC
+FC12044_91407_8_200_31_299
XRXTSXXXRXXSXQQOXQTSQSXKQ
@FC12044_91407_8_200_553_135
TACGGAGCCGCGGGCGGGAAAGGCG
+FC12044_91407_8_200_553_135
XSQQXXXXXXXXXXSXXMFFQXTKU
@FC12044_91407_8_200_139_74
CCTCCCAGGTTCAAGCGATTATCCT
+FC12044_91407_8_200_139_74
RMXUSXTXXQXXQUXXXSQISISSO
@FC12044_91407_8_200_108_33
GTCATGGCGGCCCGCGCGGGGAGCG
+FC12044_91407_8_200_108_33
OOOSSXXSXXOMKMOFMKFOKFFFF
@FC12044_91407_8_200_980_965
ACAGTGGGTTCTTAAAGAAGAGTCG
+FC12044_91407_8_200_980_965
TOSSRXXXSSMSXMOMXIRXOXFFS
@FC12044_91407_8_200_981_857
AACGAGGGGCGCGACTTGACCTTGG
+FC12044_91407_8_200_981_857
RXMSSXXXXSXQXQXFSXQFQKMXS
@FC12044_91407_8_200_8_865
TTTCCCACCCCAGGAAGCCTTGGAC
+FC12044_91407_8_200_8_865
XXXFKOROMKOORMIMRIIKKORFF
@FC12044_91407_8_200_292_484
TCAGCCTCCGTGCCCAGCCCACTCC
+FC12044_91407_8_200_292_484
XQXOSXXXXXUXXXXIXXXXQTOXF
@FC12044_91407_8_200_675_16
CTCGGGAGGCTGAGGCAGGGGGGTT
+FC12044_91407_8_200_675_16
OXTXXXSXXQXXOXXKMXXMXOKQF
@FC12044_91407_8_200_285_136
CCAAATCTTGAATTGTAGCTCCCCT
+FC12044_91407_8_200_285_136
OSXOQXXXXXSXXUXXTXXXXTRMS

Output file format

The output is a standard EMBOSS alignment file.

The results can be output in one of several styles by using the command-line qualifier -aformat xxx, where 'xxx' is replaced by the name of the required format. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.

The available multiple alignment format names are: multiple, simple, fasta, msf, clustal, mega, meganon, nexus,, nexusnon, phylip, phylipnon, selex, treecon, tcoffee, debug, srs.

The available pairwise alignment format names are: pair, markx0, markx1, markx2, markx3, markx10, match, sam, bam, score, srspair

See: http://emboss.sf.net/docs/themes/AlignFormats.html for further information on alignment formats.

Output files for usage example

File: needleall.error

Alignment score (21.5) is less than minimum score(40.0) for sequences Illumina_Genomici_DNA_Adapters1_1 vs FC12044_91407_8_200_406_24
Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_Adapters1_2 vs FC12044_91407_8_200_406_24
Alignment score (31.0) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_406_24
Alignment score (25.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_406_24
Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_sequencing_primer vs FC12044_91407_8_200_406_24
Alignment score (16.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_1 vs FC12044_91407_8_200_406_24
Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_2 vs FC12044_91407_8_200_406_24
Alignment score (31.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_406_24
Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_406_24
Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_1 vs FC12044_91407_8_200_406_24
Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_2 vs FC12044_91407_8_200_406_24
Alignment score (14.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_406_24
Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_406_24
Alignment score (23.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_406_24
Alignment score (12.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_406_24
Alignment score (27.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_406_24
Alignment score (27.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_sequencing_primer vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_RT_Primer vs FC12044_91407_8_200_406_24
Alignment score (23.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_5p_Adapter vs FC12044_91407_8_200_406_24
Alignment score (13.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_3p_Adapter vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_1 vs FC12044_91407_8_200_406_24
Alignment score (23.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_sequencing_primer vs FC12044_91407_8_200_406_24
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Genomici_DNA_Adapters1_1 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_Adapters1_2 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_720_610
Alignment score (20.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_sequencing_primer vs FC12044_91407_8_200_720_610
Alignment score (0.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_1 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_2 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_720_610
Alignment score (33.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_1 vs FC12044_91407_8_200_720_610
Alignment score (33.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_2 vs FC12044_91407_8_200_720_610
Alignment score (20.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_720_610
Alignment score (9.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_720_610
Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_720_610
Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_720_610
Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_720_610
Alignment score (10.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_720_610
Alignment score (15.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_720_610
Alignment score (20.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_720_610
Alignment score (9.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_720_610
Alignment score (7.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_720_610
Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_720_610


  [Part of this file has been deleted for brevity]

Alignment score (13.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_675_16
Alignment score (17.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_675_16
Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_675_16
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_675_16
Alignment score (22.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_675_16
Alignment score (13.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_675_16
Alignment score (17.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_675_16
Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_675_16
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_675_16
Alignment score (21.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_sequencing_primer vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_RT_Primer vs FC12044_91407_8_200_675_16
Alignment score (15.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_5p_Adapter vs FC12044_91407_8_200_675_16
Alignment score (7.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_3p_Adapter vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_1 vs FC12044_91407_8_200_675_16
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_2 vs FC12044_91407_8_200_675_16
Alignment score (22.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_sequencing_primer vs FC12044_91407_8_200_675_16
Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Genomici_DNA_Adapters1_1 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_Adapters1_2 vs FC12044_91407_8_200_285_136
Alignment score (30.0) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_285_136
Alignment score (16.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_sequencing_primer vs FC12044_91407_8_200_285_136
Alignment score (7.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_1 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_2 vs FC12044_91407_8_200_285_136
Alignment score (30.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_285_136
Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_1 vs FC12044_91407_8_200_285_136
Alignment score (18.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_2 vs FC12044_91407_8_200_285_136
Alignment score (27.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_285_136
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_285_136
Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_285_136
Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_285_136
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_285_136
Alignment score (26.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_285_136
Alignment score (14.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_285_136
Alignment score (2.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_285_136
Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_285_136
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_285_136
Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_sequencing_primer vs FC12044_91407_8_200_285_136
Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_RT_Primer vs FC12044_91407_8_200_285_136
Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_5p_Adapter vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_3p_Adapter vs FC12044_91407_8_200_285_136
Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_1 vs FC12044_91407_8_200_285_136
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_2 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_sequencing_primer vs FC12044_91407_8_200_285_136

The Identity: is the percentage of identical matches between the two sequences over the reported aligned region (including any gaps in the length).

The Similarity: is the percentage of matches between the two sequences over the reported aligned region (including any gaps in the length).

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAFULL is used. Others can be specified.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

needleall is a true implementation of the Needleman-Wunsch algorithm and so produces a full path matrix. It therefore cannot be used with genome sized sequences unless you've a lot of memory and a lot of time.

References

  1. Needleman, S. B. and Wunsch, C. D. (1970) J. Mol. Biol. 48, 443-453.
  2. Kruskal, J. B. (1983) An overview of squence comparison In D. Sankoff and J. B. Kruskal, (ed.), Time warps, string edits and macromolecules: the theory and practice of sequence comparison, pp. 1-44 Addison Wesley.

Warnings

needleall is for aligning pairs of sequences over their entire length. This works best with closely related sequences. If you use needleall to align very distantly-related sequences, it will produce a result but much of the alignment may have little or no biological significance.

A true Needleman Wunsch implementation like needleall needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of ints and one of floats so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.

The first input sequence set is loaded completely into memory. When comparing large numbers (or lengths) of sequences, the smallest set should be the first input to make the most efficient use of memory.

If you run out of memory, try using stretcher instead.

Diagnostic Error Messages

Uncaught exception
 Assertion failed
 raised at ajmem.c:xxx

Probably means you have run out of memory. Try using stretcher if this happens.

Exit status

0 upon successful completion.

Known bugs

None.

See also

Program name Description
est2genome Align EST sequences to genomic DNA sequence
needle Needleman-Wunsch global alignment of two sequences
stretcher Needleman-Wunsch rapid global alignment of two sequences

When you want an alignment that covers the whole length of two sequences, use needle.

When you are trying to find the best region of similarity between two sequences, use water.

stretcher is a more suitable program to use to find global alignments of very long sequences.

Author(s)

Mahmut Uludag
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None