SEQWORDS documentation


 

CONTENTS

1.0 SUMMARY
2.0 INPUTS & OUTPUTS
3.0 INPUT FILE FORMAT
4.0 OUTPUT FILE FORMAT
5.0 DATA FILES
6.0 USAGE
7.0 KNOWN BUGS & WARNINGS
8.0 NOTES
9.0 DESCRIPTION
10.0 ALGORITHM
11.0 RELATED APPLICATIONS
12.0 DIAGNOSTIC ERROR MESSAGES
13.0 AUTHORS
14.0 REFERENCES



1.0 SUMMARY

Generates DHF files (domain hits files) of database hits (sequences) from Swissprot matching keywords from a keywords file. Generate DHF files from keyword search of UniProt


2.0 INPUTS & OUTPUTS

SEQWORDS searches a swissprot-format sequence database with keywords taken from a keywords file, and writes a DHF file (domain hits file) of sequences whose swissprot entries contains at least one of the keywords. If an entry contains a keyword in a domain record of the feature table, then the sequence of the domain is written to the output file, otherwise the entire sequence is written. The user specifies the name of the swissprot-format sequence database (input), keywords file (input) and DHF file (output).


3.0 INPUT FILE FORMAT

The keywords file (below) contains lists of keywords specific to a number of SCOP or CATH nodes, e.g. families and superfamilies. Each list of keywords is given after a block of SCOP or CATH classification records; for family-specific search terms, the block must contain a CL, FO, SF and an FA record (see below). For superfamily-specific terms, clearly only the CL, FO and SF should be specified. A single keyword must be given per line after the record 'TE'. Each block of SCOP or CATH classification records and search terms must be delimited by the record '//' (the file should also end with this record).
It is possible to provide search terms above the level of superfamily, for example, fold and class-specific search terms for SCOP by using the CL and FO records only as appropriate. However, text searches of swissprot for members of scop folds and classes are unlikely to produce specific or meaningful results.

Input files for usage example

File: seqwords.terms

TY   SCOP
XX
CL   Alpha and beta proteins (a/b)
XX
FO   NAD(P)-binding Rossmann-fold domains
XX
SF   NAD(P)-binding Rossmann-fold domains
XX
FA   Lactate & malate dehydrogenases, N-terminal domain
XX
TE   NAD(P)-binding Rossmann-fold
TE   Lactate & malate dehydrogenases
TE   Lactate dehydrogenase
TE   Malate dehydrogenase
//

File: seqwords.seq

ID   ACEA_ECOLI     STANDARD;      PRT;   434 AA.
AC   P05313;
DT   01-NOV-1988 (Rel. 09, Created)
DT   01-NOV-1988 (Rel. 09, Last sequence update)
DT   15-DEC-1998 (Rel. 37, Last annotation update)
DE   ISOCITRATE LYASE (EC 4.1.3.1) (ISOCITRASE) (ISOCITRATASE) (ICL).
GN   ACEA OR ICL.
OS   Escherichia coli.
OC   Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
OC   Escherichia.
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12;
RX   MEDLINE; 89083515.
RA   Byrne C.R., Stokes H.W., Ward K.A.;
RT   "Nucleotide sequence of the aceB gene encoding malate synthase A in
RT   Escherichia coli.";
RL   Nucleic Acids Res. 16:10924-10924(1988).
RN   [2]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12;
RX   MEDLINE; 88262573.
RA   Rieul C., Bleicher F., Duclos B., Cortay J.-C., Cozzone A.J.;
RT   "Nucleotide sequence of the aceA gene coding for isocitrate lyase in
RT   Escherichia coli.";
RL   Nucleic Acids Res. 16:5689-5689(1988).
RN   [3]
RP   SEQUENCE FROM N.A.
RX   MEDLINE; 89008064.
RA   Matsuoka M., McFadden B.A.;
RT   "Isolation, hyperexpression, and sequencing of the aceA gene encoding
RT   isocitrate lyase in Escherichia coli.";
RL   J. Bacteriol. 170:4528-4536(1988).
RN   [4]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12 / MG1655;
RX   MEDLINE; 94089392.
RA   Blattner F.R., Burland V.D., Plunkett G. III, Sofia H.J.,
RA   Daniels D.L.;
RT   "Analysis of the Escherichia coli genome. IV. DNA sequence of the
RT   region from 89.2 to 92.8 minutes.";
RL   Nucleic Acids Res. 21:5408-5417(1993).
RN   [5]
RP   SEQUENCE OF 293-434 FROM N.A.
RX   MEDLINE; 88227861.
RA   Klumpp D.J., Plank D.W., Bowdin L.J., Stueland C.S., Chung T.,
RA   Laporte D.C.;
RT   "Nucleotide sequence of aceK, the gene encoding isocitrate
RT   dehydrogenase kinase/phosphatase.";
RL   J. Bacteriol. 170:2763-2769(1988).


  [Part of this file has been deleted for brevity]

FT   CONFLICT     70     70       A -> R (IN REF. 2).
FT   CONFLICT     80     80       A -> R (IN REF. 1 AND 2).
FT   CONFLICT    116    116       I -> N (IN REF. 2).
FT   CONFLICT    144    144       F -> L (IN REF. 1).
FT   CONFLICT    305    312       LGEEFVNK -> WAKSSLISN (IN REF. 2).
FT   CONFLICT    307    307       E -> Q (IN REF. 1).
FT   STRAND        2      6
FT   TURN          7      9
FT   HELIX        11     23
FT   TURN         26     27
FT   STRAND       28     33
FT   TURN         37     38
FT   HELIX        39     47
FT   TURN         48     48
FT   STRAND       53     58
FT   HELIX        64     67
FT   TURN         68     69
FT   STRAND       72     75
FT   TURN         83     84
FT   HELIX        87    108
FT   TURN        110    111
FT   STRAND      113    116
FT   HELIX       121    134
FT   TURN        135    136
FT   TURN        140    141
FT   STRAND      143    145
FT   HELIX       148    162
FT   TURN        163    163
FT   HELIX       166    168
FT   STRAND      173    175
FT   TURN        179    181
FT   STRAND      182    184
FT   HELIX       186    188
FT   TURN        190    191
FT   HELIX       196    217
FT   TURN        218    219
FT   HELIX       225    242
FT   TURN        243    244
FT   STRAND      248    255
FT   STRAND      263    271
FT   TURN        272    273
FT   STRAND      274    278
FT   HELIX       286    311
SQ   SEQUENCE   312 AA;  32337 MW;  17741A3B5AD068BA CRC64;
     MKVAVLGAAG GIGQALALLL KTQLPSGSEL SLYDIAPVTP GVAVDLSHIP TAVKIKGFSG
     EDATPALEGA DVVLISAGVA RKPGMDRSDL FNVNAGIVKN LVQQVAKTCP KACIGIITNP
     VNTTVAIAAE VLKKAGVYDK NKLFGVTTLD IIRSNTFVAE LKGKQPGEVE VPVIGGHSGV
     TILPLLSQVP GVSFTEQEVA DLTKRIQNAG TEVVEAKAGG GSATLSMGQA AARFGLSLVR
     ALQGEQGVVE CAYVEGDGQY ARFFSQPLLL GKNGVEERKS IGTLSAFEQN ALEGMLDTLK
     KDIALGEEFV NK
//




4.0 OUTPUT FILE FORMAT

DHF file (domain hits file)
The format of the DHF file (domain hits file) of hit sequences generated by SEQWORDS (Figure 1) is described fully in SEQSEARCH documentation and only summarised here. The file contains two lines per hit, the first is a description of the hit in 16 text tokens delimited by '^'. The second line contains the protein sequence. The first 4 tokens refer to the hit (sequence) itself, the tokens are as follows:
The next 9 tokens refer to the domain family, superfamily etc for which the terms were defined (in the keywords file) and are as follows:
The next 4 tokens refer to the hit, specifically, information about the search result as follows:

Output files for usage example

File: seqwords.dhf

> Q60150^.^1^312^SCOP^.^0^Alpha and beta proteins (a/b)^.^.^NAD(P)-binding Rossmann-fold domains^NAD(P)-binding Rossmann-fold domains^Lactate & malate dehydrogenases, N-terminal domain^KEYWORD^0.00^0.000e+00^0.000e+00
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVARKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIIRSNTFVAELKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVRALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGEEFVNK




5.0 DATA FILES

SEQWORDS does not use a data file.


6.0 USAGE

Generate DHF files from keyword search of UniProt.
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-keyfile]           infile     This option specifies the name of keywords
                                  file (input). This contains a list of
                                  keywords specific to a number of SCOP or
                                  CATH families and superfamilies used by
                                  SEQWORDS to search a sequence database.
  [-spfile]            infile     This option specifies the name of the
                                  sequence database (input) to search.
  [-outfile]           outfile    [test.hits] This option specifies the name
                                  of the DHF file (domain hits file) (output).
                                  A 'domain hits file' contains database hits
                                  (sequences) with domain classification
                                  information, in the DHF format (FASTA-like).
                                  The hits are relatives to a SCOP or CATH
                                  family (or other node in the structural
                                  hierarchies) and are found from a search of
                                  a sequence database. Files containing hits
                                  retrieved by PSIBLAST are generated by using
                                  SEQSEARCH, hits retrieved by a sparse
                                  protein signatare by using SIGSCAN or
                                  various types of HMM and profile by using
                                  LIBSCAN.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

6.1 COMMAND LINE ARGUMENTS

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-keyfile]
(Parameter 1)
infile This option specifies the name of keywords file (input). This contains a list of keywords specific to a number of SCOP or CATH families and superfamilies used by SEQWORDS to search a sequence database. Input file Required
[-spfile]
(Parameter 2)
infile This option specifies the name of the sequence database (input) to search. Input file Required
[-outfile]
(Parameter 3)
outfile This option specifies the name of the DHF file (domain hits file) (output). A 'domain hits file' contains database hits (sequences) with domain classification information, in the DHF format (FASTA-like). The hits are relatives to a SCOP or CATH family (or other node in the structural hierarchies) and are found from a search of a sequence database. Files containing hits retrieved by PSIBLAST are generated by using SEQSEARCH, hits retrieved by a sparse protein signatare by using SIGSCAN or various types of HMM and profile by using LIBSCAN. Output file test.hits
Additional (Optional) qualifiers
(none)
Advanced (Unprompted) qualifiers
(none)
Associated qualifiers
"-outfile" associated outfile qualifiers
-odirectory3
-odirectory_outfile
string Output directory Any string  
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

6.2 EXAMPLE SESSION

An example of interactive use of SEQWORDS is shown below. Here is a sample session with seqwords


% seqwords 
Generate DHF files from keyword search of UniProt.
Keywords file: seqwords.terms
Swissprot-format database file: seqwords.seq
Domain hits output file [test.hits]: seqwords.dhf

Go to the input files for this example
Go to the output files for this example




7.0 KNOWN BUGS & WARNINGS

SEQWORDS is slow - swissprot is read multiple times (once for each list of terms). Changing it to do a single file read would require modifying the program to take an array of hitlist and terms structures. An easy-ish change.


8.0 NOTES

8.1 GLOSSARY OF FILE TYPES

FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
Domain hits file DHF format (FASTA-like). Database hits (sequences) with domain classification information. The hits are relatives to a SCOP or CATH family (or other node in the structural hierarchies) and are found from a search of a discriminating element (e.g. a protein signature, hidden Markov model, simple frequency matrix, Gribskov profile or Hennikoff profile) against a sequence database. SEQSEARCH (hits retrieved by PSIBLAST). SIGSCAN (hits retrieved by sparse protein signature). LIBSCAN (hits retrieved by various types of HMM and profile). N.A.
Keywords file Text Contains a list of keywords specific to a number of SCOP families and superfamilies used by SEQWORDS to search a sequence database. N.A. N.A.
None


9.0 DESCRIPTION

Retrieval of domain sequence information from swissprot by mutliple keywords can be a time-consuming process. SEQWORDS parses a file of keywords and the swissprot database and writes a file of sequences whose swissprot entries contains at least one of the keywords. The domain sequence is taken if the keyword appears in the domain record of the feature table, otherwise the full-length sequence is taken.


10.0 ALGORITHM

None.


11.0 RELATED APPLICATIONS

See also

Program name Description
cathparse Generate DCF file from raw CATH files
domainalign Generate alignments (DAF file) for nodes in a DCF file
domainnr Remove redundant domains from a DCF file
domainrep Reorder DCF file to identify representative structures
domainseqs Add sequence records to a DCF file
domainsse Add secondary structure records to a DCF file
helixturnhelix Identify nucleic acid-binding motifs in protein sequences
libgen Generate discriminating elements from alignments
matgen3d Generate a 3D-1D scoring matrix from CCF files
pepcoil Predict coiled coil regions in protein sequences
rocon Generate a hits file from comparing two DHF files
rocplot Perform ROC analysis on hits files
scopparse Generate DCF file from raw SCOP files
seqalign Extend alignments (DAF file) with sequences (DHF file)
seqfraggle Remove fragment sequences from DHF files
seqsort Remove ambiguous classified sequences from DHF files
ssematch Search a DCF file for secondary structure matches



12.0 DIAGNOSTIC ERROR MESSAGES

None.


13.0 AUTHORS

Jon Ison (jison@ebi.ac.uk)
The European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SD UK


14.0 REFERENCES

Please cite the authors and EMBOSS.

Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.

See also http://emboss.sourceforge.net/

14.1 Other useful references