|
|
SIGGEN documentation |
TY SCOP XX TS 1D XX CL Alpha and beta proteins (a+b) XX FO Ferredoxin-like XX SF Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain XX FA Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain XX SI 54894 XX NP 15 XX NN [1] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA H ; 2 XX GA 12 ; 2 XX NN [2] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 2 XX GA 1 ; 2 XX NN [3] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 2 XX GA 26 ; 2 XX NN [4] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA T ; 2 XX GA 15 ; 2 XX NN [5] XX [Part of this file has been deleted for brevity] XX GA 4 ; 2 XX NN [10] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA I ; 2 XX GA 2 ; 2 XX NN [11] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA D ; 2 XX GA 0 ; 2 XX NN [12] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA N ; 2 XX GA 0 ; 2 XX NN [13] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA V ; 2 XX GA 3 ; 2 XX NN [14] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA R ; 2 XX GA 3 ; 2 XX NN [15] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA L ; 2 XX GA 2 ; 2 // |
TY SCOP XX TS 1D XX CL Alpha and beta proteins (a+b) XX FO Ferredoxin-like XX SF Adenylyl and guanylyl cyclase catalytic domain XX FA Adenylyl and guanylyl cyclase catalytic domain XX SI 55074 XX NP 38 XX NN [1] XX IN NRES 2 ; NGAP 2 ; WSIZ 0 XX AA H ; 1 AA E ; 1 XX GA 10 ; 1 GA 11 ; 1 XX NN [2] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA D ; 1 AA T ; 1 XX GA 1 ; 2 XX NN [3] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA I ; 1 AA T ; 1 XX GA 3 ; 2 XX NN [4] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA F ; 1 AA I ; 1 [Part of this file has been deleted for brevity] AA N ; 1 XX GA 4 ; 2 XX NN [34] XX IN NRES 2 ; NGAP 2 ; WSIZ 0 XX AA K ; 1 AA A ; 1 XX GA 4 ; 1 GA 8 ; 1 XX NN [35] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA W ; 1 AA A ; 1 XX GA 0 ; 2 XX NN [36] XX IN NRES 2 ; NGAP 2 ; WSIZ 0 XX AA A ; 1 AA T ; 1 XX GA 14 ; 1 GA 16 ; 1 XX NN [37] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA K ; 1 AA X ; 1 XX GA 2 ; 2 XX NN [38] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA G ; 1 AA D ; 1 XX GA 1 ; 2 // |
Standard (Mandatory) qualifiers (* if not always prompted):
[-algpath] dirlist [./] This option specifies the location of
DAF files (domain alignment files) (input).
A 'domain alignment file' contains a
sequence alignment of domains belonging to
the same SCOP or CATH family (or other node
in the structural hierarchies). The file is
in DAF format (CLUSTAL-like) and is
annotated with domain family classification
information. The files generated by using
SCOPALIGN will contain a structure-based
sequence alignment of domains of known
structure only. Such alignments can be
extended with sequence relatives (of unknown
structure) by using SEQALIGN.
-mode menu [1] This option specifies the mode of
signature generation. There are 3 modes for
signatures generatation: (1) Use positions
specified in alignment file. The alignment
file must contain a line beginning with the
text 'Positions' for each line of the
alignment. A '1' in the 'Positions' line
indicates that the signature should include
data from the corresponding alignment site.
The signature will only include the
positions that are marked with a '1'. (2)
Use a scoring method. The alignment is
scored (see 'Algorithm') and the signature
of a specified sparsity is sampled from high
scoring positions. (3): Generate a
randomised signature. A signature of a
specified sparsity is sampled at random from
the alignment. (Values: 1 (Use positions
specified in alignment file); 2 (Use a
scoring method); 3 (Generate a randomised
signature))
* -conoption menu [5] This option specifies the
structure-based scoring scheme. SIGGEN
provides 2 structure-based scoring schemes
(plus a combination method) that are used to
score the input alignment. (Values: 1
(Number); 2 (Conservation); 3 (Number and
conservation); 4 (None (structural data
available)); 5 (None (no structural data
available)))
* -conpath directory [./] This option specifies the location of
CON files (contact files) (input). A
'contact file' contains contact data for a
protein or a domain from SCOP or CATH, in
the CON format (EMBL-like). The contacts may
be intra-chain residue-residue, inter-chain
residue-residue or residue-ligand. The
files are generated by using CONTACTS,
INTERFACE and SITES.
* -cpdbpath directory [./] This option specifies the location of
domain CCF files (clean coordinate files)
(input). A 'clean cordinate file' contains
protein coordinate and derived data for a
single PDB file ('protein clean coordinate
file') or a single domain from SCOP or CATH
('domain clean coordinate file'), in CCF
format (EMBL-like). The files, generated by
using PDBPARSE (PDB files) or DOMAINER
(domains), contain 'cleaned-up' data that is
self-consistent and error-corrected.
Records for residue solvent accessibility
and secondary structure are added to the
file by using PDBPLUS.
* -seqoption menu [3] This option specifies the sequence-based
scoring scheme. SIGGEN provides 2
sequence-based scoring schemes that are used
to score the input alignment. (Values: 1
(Substitution matrix); 2 (Residue class); 3
(None))
* -datafile matrixf [EBLOSUM62] This option specifies the the
substitution matrix. The substitution matrix
is used by the sequence-based scoring
schemes.
* -sparsity integer [10] This option specifies the % sparsity of
signature. The signature sparsity is a
user-defined parameter that determines how
many residues the final signature will
contain, for example, if the average
sequence length of the proteins in the
alignment is 250 residues, then a signature
of sparsity 10% (default value) will contain
25 key residues or signature positions,
that correspond to the top 25% highest
scoring alignment positions. (Any integer
value)
-wsiz integer [0] This option specifies the window size.
When a signature is aligned to a protein
sequence, the permissible gaps between two
signature positions is determined by the
empirical gaps and the window size. The user
is prompted for a window size that is used
for every position in the signature. Likely
this is not optimal. A future implementation
will provide a range of methods for
generating values of window size depending
upon the alignment (window size is
identified by the WSIZ record in the
signature output file). (Any integer value)
* -filtercon toggle [N] This option specifies whether to
disregard positions forming few contacts
only during the selection of signature
positions.
* -conthresh integer [10] This option specifies the threshold
contact number. This controls the selection
of key positions for the structure-based
scoring scheme (number of contacts). (Any
integer value)
* -[no]filterpsim boolean [Y] This option specifies whether to
disregard alignment sites that were not
aligned satisfactorily (STAMP alignments
only).
[-sigoutdir] outdir [./] This option specifies the location of
signature files (output). A 'signature file'
contains a sparse sequence signature
suitable for use with the SIGSCAN and SIGGEN
programs. The files are generated by using
SIGGEN & SIGGENLIG.
Additional (Optional) qualifiers: (none)
Advanced (Unprompted) qualifiers: (none)
Associated qualifiers: (none)
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write first file to standard output
-filter boolean Read first file from standard input, write
first file to standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
| Standard (Mandatory) qualifiers | Allowed values | Default | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [-algpath] (Parameter 1) |
This option specifies the location of DAF files (domain alignment files) (input). A 'domain alignment file' contains a sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is in DAF format (CLUSTAL-like) and is annotated with domain family classification information. The files generated by using SCOPALIGN will contain a structure-based sequence alignment of domains of known structure only. Such alignments can be extended with sequence relatives (of unknown structure) by using SEQALIGN. | Directory with files | ./ | ||||||||||
| -mode | This option specifies the mode of signature generation. There are 3 modes for signatures generatation: (1) Use positions specified in alignment file. The alignment file must contain a line beginning with the text 'Positions' for each line of the alignment. A '1' in the 'Positions' line indicates that the signature should include data from the corresponding alignment site. The signature will only include the positions that are marked with a '1'. (2) Use a scoring method. The alignment is scored (see 'Algorithm') and the signature of a specified sparsity is sampled from high scoring positions. (3): Generate a randomised signature. A signature of a specified sparsity is sampled at random from the alignment. |
|
1 | ||||||||||
| -conoption | This option specifies the structure-based scoring scheme. SIGGEN provides 2 structure-based scoring schemes (plus a combination method) that are used to score the input alignment. |
|
5 | ||||||||||
| -conpath | This option specifies the location of CON files (contact files) (input). A 'contact file' contains contact data for a protein or a domain from SCOP or CATH, in the CON format (EMBL-like). The contacts may be intra-chain residue-residue, inter-chain residue-residue or residue-ligand. The files are generated by using CONTACTS, INTERFACE and SITES. | Directory | ./ | ||||||||||
| -cpdbpath | This option specifies the location of domain CCF files (clean coordinate files) (input). A 'clean cordinate file' contains protein coordinate and derived data for a single PDB file ('protein clean coordinate file') or a single domain from SCOP or CATH ('domain clean coordinate file'), in CCF format (EMBL-like). The files, generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up' data that is self-consistent and error-corrected. Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. | Directory | ./ | ||||||||||
| -seqoption | This option specifies the sequence-based scoring scheme. SIGGEN provides 2 sequence-based scoring schemes that are used to score the input alignment. |
|
3 | ||||||||||
| -datafile | This option specifies the the substitution matrix. The substitution matrix is used by the sequence-based scoring schemes. | Comparison matrix file in EMBOSS data path | EBLOSUM62 | ||||||||||
| -sparsity | This option specifies the % sparsity of signature. The signature sparsity is a user-defined parameter that determines how many residues the final signature will contain, for example, if the average sequence length of the proteins in the alignment is 250 residues, then a signature of sparsity 10% (default value) will contain 25 key residues or signature positions, that correspond to the top 25% highest scoring alignment positions. | Any integer value | 10 | ||||||||||
| -wsiz | This option specifies the window size. When a signature is aligned to a protein sequence, the permissible gaps between two signature positions is determined by the empirical gaps and the window size. The user is prompted for a window size that is used for every position in the signature. Likely this is not optimal. A future implementation will provide a range of methods for generating values of window size depending upon the alignment (window size is identified by the WSIZ record in the signature output file). | Any integer value | 0 | ||||||||||
| -filtercon | This option specifies whether to disregard positions forming few contacts only during the selection of signature positions. | Toggle value Yes/No | No | ||||||||||
| -conthresh | This option specifies the threshold contact number. This controls the selection of key positions for the structure-based scoring scheme (number of contacts). | Any integer value | 10 | ||||||||||
| -[no]filterpsim | This option specifies whether to disregard alignment sites that were not aligned satisfactorily (STAMP alignments only). | Boolean value Yes/No | Yes | ||||||||||
| [-sigoutdir] (Parameter 2) |
This option specifies the location of signature files (output). A 'signature file' contains a sparse sequence signature suitable for use with the SIGSCAN and SIGGEN programs. The files are generated by using SIGGEN & SIGGENLIG. | Output directory | ./ | ||||||||||
| Additional (Optional) qualifiers | Allowed values | Default | |||||||||||
| (none) | |||||||||||||
| Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||||||
| (none) | |||||||||||||
% siggen
Generates a sparse protein signature from an alignment
Domain alignment directories [./]: ../domainalign-keep/daf
Specify mode of signature generation
1 : Use positions specified in alignment file
2 : Use a scoring method
3 : Generate a randomised signature
Select number [1]: 2
Residue contacts scoring method
1 : Number
2 : Conservation
3 : Number and conservation
4 : None (structural data available)
5 : None (no structural data available)
Select number [5]: 5
Sequence variability scoring method
1 : Substitution matrix
2 : Residue class
3 : None
Select number [3]: 1
Substitution matrix to be used [EBLOSUM62]: EBLOSUM62
The % sparsity of signature [10]: 15
Window size [0]: 0
Ignore alignment positions with post_similar value of 0 [Y]: Y
Domainatrix signature file output directory [./]:
|
Go to the output files for this example
| FILE TYPE | FORMAT | DESCRIPTION | CREATED BY | SEE ALSO |
| Clean coordinate file (for domain) | CCF format (EMBL-like). | Protein coordinate and derived data for a single domain from SCOP or CATH. The data are 'cleaned-up': self-consistent and error-corrected. | DOMAINER | Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. |
| Contact file (intra-chain residue-residue contacts) | CON format (EMBL-like.) | Intra-chain residue-residue contact data for a protein or a domain from SCOP or CATH. | CONTACTS | N.A. |
| Domain alignment file | DAF format (CLUSTAL-like). | Sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is annotated with domain family classification information. | DOMAINALIGN (structure-based sequence alignment of domains of known structure). | DOMAINALIGN alignments can be extended with sequence relatives (of unknown structure) to the family in question by using SEQALIGN. |
| Signature file | SIG format | Contains a sparse sequence signature suitable for use with the SIGSCAN program. Contains a sparse sequence signature. | SIGGENLIG, LIBGEN | The files are generated by using SIGGEN. |
| Program name | Description |
|---|---|
| contacts | Generate intra-chain CON files from CCF files |
| domainalign | Generate alignments (DAF file) for nodes in a DCF file |
| domainrep | Reorder DCF file to identify representative structures |
| domainreso | Remove low resolution domains from a DCF file |
| interface | Generate inter-chain CON files from CCF files |
| libgen | Generate discriminating elements from alignments |
| matgen3d | Generate a 3D-1D scoring matrix from CCF files |
| psiphi | Calculates phi and psi torsion angles from protein coordinates |
| rocon | Generates a hits file from comparing two DHF files |
| rocplot | Performs ROC analysis on hits files |
| seqalign | Extend alignments (DAF file) with sequences (DHF file) |
| seqfraggle | Removes fragment sequences from DHF files |
| seqsearch | Generate PSI-BLAST hits (DHF file) from a DAF file |
| seqsort | Remove ambiguous classified sequences from DHF files |
| seqwords | Generates DHF files from keyword search of UniProt |
| siggenlig | Generates ligand-binding signatures from a CON file |
| sigscan | Generates hits (DHF file) from a signature search |
| sigscanlig | Searches ligand-signature library & writes hits (LHF file) |
See also http://emboss.sourceforge.net/
Automatic generation and evaluation of sparse protein signatures for families of protein
structural domains. MJ Blades, JC Ison, R Ranasinghe, and JBC Findlay. Protein Science. 2005 (accepted)
A key residues approach to the definition of protein families and analysis
of sparse family signatures. JC Ison, AJ Bleasby, MJ Blades, SC Daniel,
JH Parish, JBC Findlay. PROTEINS: Structure, Function & Genetics. 2000,
40:330-341
Alignment of a sparse protein signature with protein sequences: application
to fold prediction for three small globulins. SC Daniel, JH Parish,
JC Ison, MJ Blades & JBC Findlay. FEBS Letters. 1999, 459:349-352.