helixturnhelix

 

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Identify nucleic acid-binding motifs in protein sequences

Description

helixturnhelix uses the method of Dodd and Egan to identify helix-turn-helix nucleic acid binding motifs in an input protein sequence. The output is a standard EMBOSS report file describing the location, size and score of any putative motifs.

Usage

Here is a sample session with helixturnhelix


% helixturnhelix 
Identify nucleic acid-binding motifs in protein sequences
Input protein sequence(s): tsw:laci_ecoli
Output report [laci_ecoli.hth]: 

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-sequence]          seqall     Protein sequence(s) filename and optional
                                  format, or reference (input USA)
  [-outfile]           report     [*.helixturnhelix] Output report file name

   Additional (Optional) qualifiers:
   -mean               float      [238.71] Mean value (Number from 1.000 to
                                  10000.000)
   -sd                 float      [293.61] Standard Deviation value (Number
                                  from 1.000 to 10000.000)
   -minsd              float      [2.5] Minimum SD (Number from 0.000 to
                                  100.000)
   -eightyseven        boolean    Use the old (1987) weight data

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -rformat2           string     Report format
   -rname2             string     Base file name
   -rextension2        string     File name extension
   -rdirectory2        string     Output directory
   -raccshow2          boolean    Show accession number in the report
   -rdesshow2          boolean    Show description in the report
   -rscoreshow2        boolean    Show the score in the report
   -rstrandshow2       boolean    Show the nucleotide strand in the report
   -rusashow2          boolean    Show the full USA in the report
   -rmaxall2           integer    Maximum total hits to report
   -rmaxseq2           integer    Maximum hits to report for one sequence

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Protein sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
[-outfile]
(Parameter 2)
Output report file name Report output file <*>.helixturnhelix
Additional (Optional) qualifiers Allowed values Default
-mean Mean value Number from 1.000 to 10000.000 238.71
-sd Standard Deviation value Number from 1.000 to 10000.000 293.61
-minsd Minimum SD Number from 0.000 to 100.000 2.5
-eightyseven Use the old (1987) weight data Boolean value Yes/No No
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

helixturnhelix reads one or more protein sequence USAs.

Input files for usage example

'tsw:laci_ecoli' is a sequence entry in the example protein database 'tsw'

Database entry: tsw:laci_ecoli

ID   LACI_ECOLI              Reviewed;         360 AA.
AC   P03023; O09196; P71309; Q2MC79; Q47338;
DT   21-JUL-1986, integrated into UniProtKB/Swiss-Prot.
DT   19-JUL-2003, sequence version 3.
DT   16-JUN-2009, entry version 105.
DE   RecName: Full=Lactose operon repressor;
GN   Name=lacI; OrderedLocusNames=b0345, JW0336;
OS   Escherichia coli (strain K12).
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
OC   Enterobacteriaceae; Escherichia.
OX   NCBI_TaxID=83333;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RX   MEDLINE=78246991; PubMed=355891; DOI=10.1038/274765a0;
RA   Farabaugh P.J.;
RT   "Sequence of the lacI gene.";
RL   Nature 274:765-769(1978).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RA   Chen J., Matthews K.K.S.M.;
RL   Submitted (MAY-1991) to the EMBL/GenBank/DDBJ databases.
RN   [3]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RA   Marsh S.;
RL   Submitted (JAN-1997) to the EMBL/GenBank/DDBJ databases.
RN   [4]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=K12 / MG1655 / ATCC 47076;
RA   Chung E., Allen E., Araujo R., Aparicio A.M., Davis K., Duncan M.,
RA   Federspiel N., Hyman R., Kalman S., Komp C., Kurdi O., Lew H., Lin D.,
RA   Namath A., Oefner P., Roberts D., Schramm S., Davis R.W.;
RT   "Sequence of minutes 4-25 of Escherichia coli.";
RL   Submitted (JAN-1997) to the EMBL/GenBank/DDBJ databases.
RN   [5]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=K12 / MG1655 / ATCC 47076;
RX   MEDLINE=97426617; PubMed=9278503; DOI=10.1126/science.277.5331.1453;
RA   Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V.,
RA   Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F.,
RA   Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J.,
RA   Mau B., Shao Y.;
RT   "The complete genome sequence of Escherichia coli K-12.";
RL   Science 277:1453-1474(1997).
RN   [6]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
RX   PubMed=16738553; DOI=10.1038/msb4100049;
RA   Hayashi K., Morooka N., Yamamoto Y., Fujita K., Isono K., Choi S.,
RA   Ohtsubo E., Baba T., Wanner B.L., Mori H., Horiuchi T.;
RT   "Highly accurate genome sequences of Escherichia coli K-12 strains


  [Part of this file has been deleted for brevity]

FT   CHAIN         1    360       Lactose operon repressor.
FT                                /FTId=PRO_0000107963.
FT   DOMAIN        1     58       HTH lacI-type.
FT   DNA_BIND      6     25       H-T-H motif.
FT   VARIANT     282    282       Y -> D (in T41 mutant).
FT   MUTAGEN      17     17       Y->H: Broadening of specificity.
FT   MUTAGEN      22     22       R->N: Recognizes an operator variant.
FT   CONFLICT    286    286       L -> S (in Ref. 1, 4 and 7).
FT   HELIX         6     11
FT   TURN         12     14
FT   HELIX        17     24
FT   HELIX        33     45
FT   HELIX        51     56
FT   STRAND       63     69
FT   HELIX        74     89
FT   STRAND       93     98
FT   STRAND      101    103
FT   HELIX       104    115
FT   TURN        116    118
FT   STRAND      122    126
FT   HELIX       130    139
FT   TURN        140    142
FT   STRAND      145    150
FT   STRAND      154    156
FT   STRAND      158    161
FT   HELIX       163    177
FT   STRAND      181    186
FT   HELIX       192    207
FT   STRAND      213    217
FT   HELIX       222    234
FT   STRAND      240    246
FT   HELIX       247    259
FT   TURN        265    267
FT   STRAND      268    271
FT   HELIX       277    281
FT   STRAND      282    284
FT   STRAND      287    290
FT   HELIX       293    308
FT   STRAND      314    319
FT   STRAND      322    324
FT   STRAND      334    338
FT   HELIX       343    353
FT   HELIX       354    356
SQ   SEQUENCE   360 AA;  38590 MW;  347A8DEE92D736CB CRC64;
     MKPVTLYDVA EYAGVSYQTV SRVVNQASHV SAKTREKVEA AMAELNYIPN RVAQQLAGKQ
     SLLIGVATSS LALHAPSQIV AAIKSRADQL GASVVVSMVE RSGVEACKAA VHNLLAQRVS
     GLIINYPLDD QDAIAVEAAC TNVPALFLDV SDQTPINSII FSHEDGTRLG VEHLVALGHQ
     QIALLAGPLS SVSARLRLAG WHKYLTRNQI QPIAEREGDW SAMSGFQQTM QMLNEGIVPT
     AMLVANDQMA LGAMRAITES GLRVGADISV VGYDDTEDSS CYIPPLTTIK QDFRLLGQTS
     VDRLLQLSQG QAVKGNQLLP VSLVKRKTTL APNTQTASPR ALADSLMQLA RQVSRLESGQ
//

Output file format

The output is a standard EMBOSS report file.

The results can be output in one of several styles by using the command-line qualifier -rformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: embl, genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel, feattable, motif, regions, seqtable, simple, srs, table, tagseq

See: http://emboss.sf.net/docs/themes/ReportFormats.html for further information on report formats.

By default helixturnhelix writes a 'motif' report file.

Output files for usage example

File: laci_ecoli.hth

########################################
# Program: helixturnhelix
# Rundate: Tue 15 Jul 2008 12:00:00
# Commandline: helixturnhelix
#    -sequence tsw:laci_ecoli
# Report_format: motif
# Report_file: laci_ecoli.hth
########################################

#=======================================
#
# Sequence: LACI_ECOLI     from: 1   to: 360
# HitCount: 1
#
# Hits above +2.50 SD (972.73)
#
#=======================================

Maximum_score_at at "*"

(1) Score 2160.000 length 22 at residues 4->25
           *
 Sequence: VTLYDVAEYAGVSYQTVSRVVN
           |                    |
           4                    25
 Maximum_score_at: 4
 Standard_deviations: 6.54


#---------------------------------------
#---------------------------------------

#---------------------------------------
# Total_sequences: 1
# Total_length: 360
# Reported_sequences: 1
# Reported_hitcount: 1
#---------------------------------------

Data files

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

The data files are stored in the standard EMBOSS data directory. The names are:

The old (1987) data has a motif length of 20 residues, whilst the default data (Ehth.dat) has a motif length of 22 residues.

With care these can be replaced to suit your data sets. If the files are placed in the following directories they will be used in preference to the files in the EMBOSS distribution data directory:

Here is the default file:

# Amino acid counts for 91 Helix-turn-helix (presumed) protein motifs
# from Dodd IB and Egan JB (1990) Nucl. Acids. Res. 18:5019-5026.
#
Sample: 91 aligned sequences
#
# R  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 Total Exp
# - -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- ----- ---
  A  2  1  3 14 10 12 75  6 15  9  1  1  4  3  8 15  4  4  4 11  0 10   212 995
  C  0  0  1  1  0  0  0  0  0  3  3  1  1  0  0  0  0  0  0  1  0  3    14 106
  D  0  1  0  1 14  0  0 14  1  0  5  0  1  2  0  0  0  0  1  1  0  2    43 556
  E  4  5  0 11 26  0  0 16  9  3  3  0  3 12 13  0  0  2  0  1 13  6   127 669
  F  4  0  4  0  0  4  0  1  0 10  0  0  0  0  1  0  0  1  1  1 22  0    49 358
  G  9  7  1  4  0  0  8  0  0  0 50  0  6  0  7  1  0  3  1  1  0  4   102 761
  H  4  3  1  1  2  0  0  3  2  0  5  0  3  3  0  2  0  2  4  5  0  2    42 225
  I 10  0 13  3  2 15  0  4  9  4  0 17  0  2  0  1 31  1  4  8 16  1   141 583
  K  4  4  6 11 12  1  1 14 11  0  5  2  2  7  2  1  0  5  8  4  5 15   120 516
  L 16  1 17  0  1 35  0  3 12 31  0 22  0  2  1  1 22  1  1 12 20  0   198 954
  M  7  0  2  1  1  1  0  0  5  7  1 10  0  0  2  0  2  0  0  2  0  1    42 275
  N  0  8  0  1  0  0  0  2  1  1 14  0  8  1  4  2  0  4  9  0  0 11    66 383
  P  1  6  0  1  0  0  0  0  0  0  0  0  3 13  7  0  0  0  0  0  0  3    34 403
  Q  2  1 21  9 11  0  0  9  8  0  0  2  1 17  7 12  0  3 12  5  3  9   132 437
  R  9 10 14  9  5  0  1 16 10  0  1  0  1 17  8  7  0 17 28  3  0 16   172 609
  S  2 17  0  8  4  1  6  1  2  2  3  0 37  1 25  5  0 29  3  0  1  5   152 552
  T  6 24  3 12  1  5  0  2  2  4  0  5 20  4  3 39  0  4  1  0  4  3   142 512
  V  7  3  1  1  2 16  0  0  2 12  0 29  0  5  3  3 32  0  7  8  7  0   138 724
  W  2  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0  2 21  0  0    27 105
  Y  2  0  4  3  0  1  0  0  2  4  0  1  1  2  0  2  0 15  5  7  0  0    49 267

Notes

The helix-turn-helix protein structural motif was originally identified as the DNA-binding domain of phage repressors. One alpha-helix lies in the wide groove of DNA; the other lies at an angle across DNA. The motif is commonly involved in binding DNA. The motif is of fundamental biological importance and is found in most proteins that regulate gene expression. It is formed by of two alpha-helices joined by a short turn.

References

  1. Dodd I.B., Egan J.B. (1987) "Systematic method for the detection of potential lambda cro-like DNA-binding regions in proteins." J. Mol. Biol. 194: 557-564.
  2. Dodd I.B., Egan J.B. (1990) "Improved detection of helix-turn-helix DNA-binding motifs in protein sequences." Nucleic Acids Res. 18: 5019-5026.

Warnings

The program will warn you if the data file is not mathematically accurate.

Diagnostic Error Messages

None.

Exit status

It exits with status 0 unless an error is reported.

Known bugs

None.

See also

Program name Description
antigenic Finds antigenic sites in proteins
digest Reports on protein proteolytic enzyme or reagent cleavage sites
epestfind Finds PEST motifs as potential proteolytic cleavage sites
fuzzpro Search for patterns in protein sequences
fuzztran Search for patterns in protein sequences (translated)
garnier Predicts protein secondary structure using GOR method
hmoment Calculate and plot hydrophobic moment for protein sequence(s)
oddcomp Identify proteins with specified sequence word composition
patmatdb Searches protein sequences with a sequence motif
patmatmotifs Scan a protein sequence with motifs from the PROSITE database
pepcoil Predicts coiled coil regions in protein sequences
pepnet Draw a helical net for a protein sequence
pepwheel Draw a helical wheel diagram for a protein sequence
preg Regular expression search of protein sequence(s)
pscan Scans protein sequence(s) with fingerprints from the PRINTS database
sigcleave Reports on signal cleavage sites in a protein sequence
tmap Predict and plot transmembrane segments in protein sequences

Author(s)

Alan Bleasby (ajb © ebi.ac.uk)
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Original program "HELIXTURNHELIX" (EGCG 1990) by Peter Rice (pmr © ebi.ac.uk)
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

History

Completed 11th March 1999

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None