|   | SIGGEN documentation | 
| TY SCOP XX TS 1D XX CL Alpha and beta proteins (a+b) XX FO Ferredoxin-like XX SF Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain XX FA Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain XX SI 54894 XX NP 15 XX NN [1] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA H ; 2 XX GA 12 ; 2 XX NN [2] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 2 XX GA 1 ; 2 XX NN [3] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 2 XX GA 26 ; 2 XX NN [4] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA T ; 2 XX GA 15 ; 2 XX NN [5] XX [Part of this file has been deleted for brevity] XX GA 4 ; 2 XX NN [10] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA I ; 2 XX GA 2 ; 2 XX NN [11] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA D ; 2 XX GA 0 ; 2 XX NN [12] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA N ; 2 XX GA 0 ; 2 XX NN [13] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA V ; 2 XX GA 3 ; 2 XX NN [14] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA R ; 2 XX GA 3 ; 2 XX NN [15] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA L ; 2 XX GA 2 ; 2 // | 
| TY SCOP XX TS 1D XX CL Alpha and beta proteins (a+b) XX FO Ferredoxin-like XX SF Adenylyl and guanylyl cyclase catalytic domain XX FA Adenylyl and guanylyl cyclase catalytic domain XX SI 55074 XX NP 38 XX NN [1] XX IN NRES 2 ; NGAP 2 ; WSIZ 0 XX AA H ; 1 AA E ; 1 XX GA 10 ; 1 GA 11 ; 1 XX NN [2] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA D ; 1 AA T ; 1 XX GA 1 ; 2 XX NN [3] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA I ; 1 AA T ; 1 XX GA 3 ; 2 XX NN [4] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA F ; 1 AA I ; 1 [Part of this file has been deleted for brevity] AA N ; 1 XX GA 4 ; 2 XX NN [34] XX IN NRES 2 ; NGAP 2 ; WSIZ 0 XX AA K ; 1 AA A ; 1 XX GA 4 ; 1 GA 8 ; 1 XX NN [35] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA W ; 1 AA A ; 1 XX GA 0 ; 2 XX NN [36] XX IN NRES 2 ; NGAP 2 ; WSIZ 0 XX AA A ; 1 AA T ; 1 XX GA 14 ; 1 GA 16 ; 1 XX NN [37] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA K ; 1 AA X ; 1 XX GA 2 ; 2 XX NN [38] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA G ; 1 AA D ; 1 XX GA 1 ; 2 // | 
   Standard (Mandatory) qualifiers (* if not always prompted):
  [-algpath]           dirlist    [./] This option specifies the location of
                                  DAF files (domain alignment files) (input).
                                  A 'domain alignment file' contains a
                                  sequence alignment of domains belonging to
                                  the same SCOP or CATH family (or other node
                                  in the structural hierarchies). The file is
                                  in DAF format (CLUSTAL-like) and is
                                  annotated with domain family classification
                                  information. The files generated by using
                                  SCOPALIGN will contain a structure-based
                                  sequence alignment of domains of known
                                  structure only. Such alignments can be
                                  extended with sequence relatives (of unknown
                                  structure) by using SEQALIGN.
   -mode               menu       [1] This option specifies the mode of
                                  signature generation. There are 3 modes for
                                  signatures generatation: (1) Use positions
                                  specified in alignment file. The alignment
                                  file must contain a line beginning with the
                                  text 'Positions' for each line of the
                                  alignment. A '1' in the 'Positions' line
                                  indicates that the signature should include
                                  data from the corresponding alignment site.
                                  The signature will only include the
                                  positions that are marked with a '1'. (2)
                                  Use a scoring method. The alignment is
                                  scored (see 'Algorithm') and the signature
                                  of a specified sparsity is sampled from high
                                  scoring positions. (3): Generate a
                                  randomised signature. A signature of a
                                  specified sparsity is sampled at random from
                                  the alignment. (Values: 1 (Use positions
                                  specified in alignment file); 2 (Use a
                                  scoring method); 3 (Generate a randomised
                                  signature))
*  -conoption          menu       [5] This option specifies the
                                  structure-based scoring scheme. SIGGEN
                                  provides 2 structure-based scoring schemes
                                  (plus a combination method) that are used to
                                  score the input alignment. (Values: 1
                                  (Number); 2 (Conservation); 3 (Number and
                                  conservation); 4 (None (structural data
                                  available)); 5 (None (no structural data
                                  available)))
*  -conpath            directory  [./] This option specifies the location of
                                  CON files (contact files) (input). A
                                  'contact file' contains contact data for a
                                  protein or a domain from SCOP or CATH, in
                                  the CON format (EMBL-like). The contacts may
                                  be intra-chain residue-residue, inter-chain
                                  residue-residue or residue-ligand. The
                                  files are generated by using CONTACTS,
                                  INTERFACE and SITES.
*  -cpdbpath           directory  [./] This option specifies the location of
                                  domain CCF files (clean coordinate files)
                                  (input). A 'clean cordinate file' contains
                                  protein coordinate and derived data for a
                                  single PDB file ('protein clean coordinate
                                  file') or a single domain from SCOP or CATH
                                  ('domain clean coordinate file'), in CCF
                                  format (EMBL-like). The files, generated by
                                  using PDBPARSE (PDB files) or DOMAINER
                                  (domains), contain 'cleaned-up' data that is
                                  self-consistent and error-corrected.
                                  Records for residue solvent accessibility
                                  and secondary structure are added to the
                                  file by using PDBPLUS.
*  -seqoption          menu       [3] This option specifies the sequence-based
                                  scoring scheme. SIGGEN provides 2
                                  sequence-based scoring schemes that are used
                                  to score the input alignment. (Values: 1
                                  (Substitution matrix); 2 (Residue class); 3
                                  (None))
*  -datafile           matrixf    [EBLOSUM62] This option specifies the the
                                  substitution matrix. The substitution matrix
                                  is used by the sequence-based scoring
                                  schemes.
*  -sparsity           integer    [10] This option specifies the % sparsity of
                                  signature. The signature sparsity is a
                                  user-defined parameter that determines how
                                  many residues the final signature will
                                  contain, for example, if the average
                                  sequence length of the proteins in the
                                  alignment is 250 residues, then a signature
                                  of sparsity 10% (default value) will contain
                                  25 key residues or signature positions,
                                  that correspond to the top 25% highest
                                  scoring alignment positions. (Any integer
                                  value)
   -wsiz               integer    [0] This option specifies the window size.
                                  When a signature is aligned to a protein
                                  sequence, the permissible gaps between two
                                  signature positions is determined by the
                                  empirical gaps and the window size. The user
                                  is prompted for a window size that is used
                                  for every position in the signature. Likely
                                  this is not optimal. A future implementation
                                  will provide a range of methods for
                                  generating values of window size depending
                                  upon the alignment (window size is
                                  identified by the WSIZ record in the
                                  signature output file). (Any integer value)
*  -filtercon          toggle     [N] This option specifies whether to
                                  disregard positions forming few contacts
                                  only during the selection of signature
                                  positions.
*  -conthresh          integer    [10] This option specifies the threshold
                                  contact number. This controls the selection
                                  of key positions for the structure-based
                                  scoring scheme (number of contacts). (Any
                                  integer value)
*  -[no]filterpsim     boolean    [Y] This option specifies whether to
                                  disregard alignment sites that were not
                                  aligned satisfactorily (STAMP alignments
                                  only).
  [-sigoutdir]         outdir     [./] This option specifies the location of
                                  signature files (output). A 'signature file'
                                  contains a sparse sequence signature
                                  suitable for use with the SIGSCAN and SIGGEN
                                  programs. The files are generated by using
                                  SIGGEN & SIGGENLIG.
   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers: (none)
   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
| Standard (Mandatory) qualifiers | Allowed values | Default | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [-algpath] (Parameter 1) | This option specifies the location of DAF files (domain alignment files) (input). A 'domain alignment file' contains a sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is in DAF format (CLUSTAL-like) and is annotated with domain family classification information. The files generated by using SCOPALIGN will contain a structure-based sequence alignment of domains of known structure only. Such alignments can be extended with sequence relatives (of unknown structure) by using SEQALIGN. | Directory with files | ./ | ||||||||||
| -mode | This option specifies the mode of signature generation. There are 3 modes for signatures generatation: (1) Use positions specified in alignment file. The alignment file must contain a line beginning with the text 'Positions' for each line of the alignment. A '1' in the 'Positions' line indicates that the signature should include data from the corresponding alignment site. The signature will only include the positions that are marked with a '1'. (2) Use a scoring method. The alignment is scored (see 'Algorithm') and the signature of a specified sparsity is sampled from high scoring positions. (3): Generate a randomised signature. A signature of a specified sparsity is sampled at random from the alignment. | 
 | 1 | ||||||||||
| -conoption | This option specifies the structure-based scoring scheme. SIGGEN provides 2 structure-based scoring schemes (plus a combination method) that are used to score the input alignment. | 
 | 5 | ||||||||||
| -conpath | This option specifies the location of CON files (contact files) (input). A 'contact file' contains contact data for a protein or a domain from SCOP or CATH, in the CON format (EMBL-like). The contacts may be intra-chain residue-residue, inter-chain residue-residue or residue-ligand. The files are generated by using CONTACTS, INTERFACE and SITES. | Directory | ./ | ||||||||||
| -cpdbpath | This option specifies the location of domain CCF files (clean coordinate files) (input). A 'clean cordinate file' contains protein coordinate and derived data for a single PDB file ('protein clean coordinate file') or a single domain from SCOP or CATH ('domain clean coordinate file'), in CCF format (EMBL-like). The files, generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up' data that is self-consistent and error-corrected. Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. | Directory | ./ | ||||||||||
| -seqoption | This option specifies the sequence-based scoring scheme. SIGGEN provides 2 sequence-based scoring schemes that are used to score the input alignment. | 
 | 3 | ||||||||||
| -datafile | This option specifies the the substitution matrix. The substitution matrix is used by the sequence-based scoring schemes. | Comparison matrix file in EMBOSS data path | EBLOSUM62 | ||||||||||
| -sparsity | This option specifies the % sparsity of signature. The signature sparsity is a user-defined parameter that determines how many residues the final signature will contain, for example, if the average sequence length of the proteins in the alignment is 250 residues, then a signature of sparsity 10% (default value) will contain 25 key residues or signature positions, that correspond to the top 25% highest scoring alignment positions. | Any integer value | 10 | ||||||||||
| -wsiz | This option specifies the window size. When a signature is aligned to a protein sequence, the permissible gaps between two signature positions is determined by the empirical gaps and the window size. The user is prompted for a window size that is used for every position in the signature. Likely this is not optimal. A future implementation will provide a range of methods for generating values of window size depending upon the alignment (window size is identified by the WSIZ record in the signature output file). | Any integer value | 0 | ||||||||||
| -filtercon | This option specifies whether to disregard positions forming few contacts only during the selection of signature positions. | Toggle value Yes/No | No | ||||||||||
| -conthresh | This option specifies the threshold contact number. This controls the selection of key positions for the structure-based scoring scheme (number of contacts). | Any integer value | 10 | ||||||||||
| -[no]filterpsim | This option specifies whether to disregard alignment sites that were not aligned satisfactorily (STAMP alignments only). | Boolean value Yes/No | Yes | ||||||||||
| [-sigoutdir] (Parameter 2) | This option specifies the location of signature files (output). A 'signature file' contains a sparse sequence signature suitable for use with the SIGSCAN and SIGGEN programs. The files are generated by using SIGGEN & SIGGENLIG. | Output directory | ./ | ||||||||||
| Additional (Optional) qualifiers | Allowed values | Default | |||||||||||
| (none) | |||||||||||||
| Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||||||
| (none) | |||||||||||||
| 
% siggen 
Generates a sparse protein signature from an alignment
Domain alignment directories [./]: ../domainalign-keep/daf
Specify mode of signature generation
         1 : Use positions specified in alignment file
         2 : Use a scoring method
         3 : Generate a randomised signature
Select number [1]: 2
Residue contacts scoring method
         1 : Number
         2 : Conservation
         3 : Number and conservation
         4 : None (structural data available)
         5 : None (no structural data available)
Select number [5]: 5
Sequence variability scoring method
         1 : Substitution matrix
         2 : Residue class
         3 : None
Select number [3]: 1
Substitution matrix to be used [EBLOSUM62]: EBLOSUM62
The % sparsity of signature [10]: 15
Window size [0]: 0
Ignore alignment positions with post_similar value of 0 [Y]: Y
Domainatrix signature file output directory [./]: 
 | 
Go to the output files for this example
| FILE TYPE | FORMAT | DESCRIPTION | CREATED BY | SEE ALSO | 
| Clean coordinate file (for domain) | CCF format (EMBL-like). | Protein coordinate and derived data for a single domain from SCOP or CATH. The data are 'cleaned-up': self-consistent and error-corrected. | DOMAINER | Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. | 
| Contact file (intra-chain residue-residue contacts) | CON format (EMBL-like.) | Intra-chain residue-residue contact data for a protein or a domain from SCOP or CATH. | CONTACTS | N.A. | 
| Domain alignment file | DAF format (CLUSTAL-like). | Sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is annotated with domain family classification information. | DOMAINALIGN (structure-based sequence alignment of domains of known structure). | DOMAINALIGN alignments can be extended with sequence relatives (of unknown structure) to the family in question by using SEQALIGN. | 
| Signature file | SIG format | Contains a sparse sequence signature suitable for use with the SIGSCAN program. Contains a sparse sequence signature. | SIGGENLIG, LIBGEN | The files are generated by using SIGGEN. | 
| Program name | Description | 
|---|---|
| contacts | Generate intra-chain CON files from CCF files | 
| domainalign | Generate alignments (DAF file) for nodes in a DCF file | 
| domainrep | Reorder DCF file to identify representative structures | 
| domainreso | Remove low resolution domains from a DCF file | 
| interface | Generate inter-chain CON files from CCF files | 
| libgen | Generate discriminating elements from alignments | 
| matgen3d | Generate a 3D-1D scoring matrix from CCF files | 
| psiphi | Calculates phi and psi torsion angles from protein coordinates | 
| rocon | Generates a hits file from comparing two DHF files | 
| rocplot | Performs ROC analysis on hits files | 
| seqalign | Extend alignments (DAF file) with sequences (DHF file) | 
| seqfraggle | Removes fragment sequences from DHF files | 
| seqsearch | Generate PSI-BLAST hits (DHF file) from a DAF file | 
| seqsort | Remove ambiguous classified sequences from DHF files | 
| seqwords | Generates DHF files from keyword search of UniProt | 
| siggenlig | Generates ligand-binding signatures from a CON file | 
| sigscan | Generates hits (DHF file) from a signature search | 
| sigscanlig | Searches ligand-signature library & writes hits (LHF file) | 
See also http://emboss.sourceforge.net/
Automatic generation and evaluation of sparse protein signatures for families of protein 
structural domains. MJ Blades, JC Ison, R Ranasinghe, and JBC Findlay. Protein Science. 2005 (accepted)
A key residues approach to the definition of protein families and analysis
of sparse family signatures.  JC Ison, AJ Bleasby, MJ Blades, SC Daniel, 
JH Parish, JBC Findlay.  PROTEINS: Structure, Function & Genetics.  2000, 
40:330-341
Alignment of a sparse protein signature with protein sequences: application
to fold prediction for three small globulins.  SC Daniel, JH Parish, 
JC Ison, MJ Blades & JBC Findlay.  FEBS Letters.  1999, 459:349-352.