SIGGEN documentation |
TY SCOP XX TS 1D XX CL Alpha and beta proteins (a+b) XX FO Ferredoxin-like XX SF Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain XX FA Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain XX SI 54894 XX NP 15 XX NN [1] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA H ; 2 XX GA 12 ; 2 XX NN [2] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 2 XX GA 1 ; 2 XX NN [3] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 2 XX GA 26 ; 2 XX NN [4] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA T ; 2 XX GA 15 ; 2 XX NN [5] XX [Part of this file has been deleted for brevity] XX GA 4 ; 2 XX NN [10] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA I ; 2 XX GA 2 ; 2 XX NN [11] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA D ; 2 XX GA 0 ; 2 XX NN [12] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA N ; 2 XX GA 0 ; 2 XX NN [13] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA V ; 2 XX GA 3 ; 2 XX NN [14] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA R ; 2 XX GA 3 ; 2 XX NN [15] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA L ; 2 XX GA 2 ; 2 // |
TY SCOP XX TS 1D XX CL Alpha and beta proteins (a+b) XX FO Ferredoxin-like XX SF Adenylyl and guanylyl cyclase catalytic domain XX FA Adenylyl and guanylyl cyclase catalytic domain XX SI 55074 XX NP 38 XX NN [1] XX IN NRES 2 ; NGAP 2 ; WSIZ 0 XX AA H ; 1 AA E ; 1 XX GA 10 ; 1 GA 11 ; 1 XX NN [2] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA D ; 1 AA T ; 1 XX GA 1 ; 2 XX NN [3] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA I ; 1 AA T ; 1 XX GA 3 ; 2 XX NN [4] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA F ; 1 AA I ; 1 [Part of this file has been deleted for brevity] AA N ; 1 XX GA 4 ; 2 XX NN [34] XX IN NRES 2 ; NGAP 2 ; WSIZ 0 XX AA K ; 1 AA A ; 1 XX GA 4 ; 1 GA 8 ; 1 XX NN [35] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA W ; 1 AA A ; 1 XX GA 0 ; 2 XX NN [36] XX IN NRES 2 ; NGAP 2 ; WSIZ 0 XX AA A ; 1 AA T ; 1 XX GA 14 ; 1 GA 16 ; 1 XX NN [37] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA K ; 1 AA X ; 1 XX GA 2 ; 2 XX NN [38] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA G ; 1 AA D ; 1 XX GA 1 ; 2 // |
Standard (Mandatory) qualifiers (* if not always prompted): [-algpath] dirlist [./] This option specifies the location of DAF files (domain alignment files) (input). A 'domain alignment file' contains a sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is in DAF format (CLUSTAL-like) and is annotated with domain family classification information. The files generated by using SCOPALIGN will contain a structure-based sequence alignment of domains of known structure only. Such alignments can be extended with sequence relatives (of unknown structure) by using SEQALIGN. -mode menu [1] This option specifies the mode of signature generation. There are 3 modes for signatures generatation: (1) Use positions specified in alignment file. The alignment file must contain a line beginning with the text 'Positions' for each line of the alignment. A '1' in the 'Positions' line indicates that the signature should include data from the corresponding alignment site. The signature will only include the positions that are marked with a '1'. (2) Use a scoring method. The alignment is scored (see 'Algorithm') and the signature of a specified sparsity is sampled from high scoring positions. (3): Generate a randomised signature. A signature of a specified sparsity is sampled at random from the alignment. (Values: 1 (Use positions specified in alignment file); 2 (Use a scoring method); 3 (Generate a randomised signature)) * -conoption menu [5] This option specifies the structure-based scoring scheme. SIGGEN provides 2 structure-based scoring schemes (plus a combination method) that are used to score the input alignment. (Values: 1 (Number); 2 (Conservation); 3 (Number and conservation); 4 (None (structural data available)); 5 (None (no structural data available))) * -conpath directory [./] This option specifies the location of CON files (contact files) (input). A 'contact file' contains contact data for a protein or a domain from SCOP or CATH, in the CON format (EMBL-like). The contacts may be intra-chain residue-residue, inter-chain residue-residue or residue-ligand. The files are generated by using CONTACTS, INTERFACE and SITES. * -cpdbpath directory [./] This option specifies the location of domain CCF files (clean coordinate files) (input). A 'clean cordinate file' contains protein coordinate and derived data for a single PDB file ('protein clean coordinate file') or a single domain from SCOP or CATH ('domain clean coordinate file'), in CCF format (EMBL-like). The files, generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up' data that is self-consistent and error-corrected. Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. * -seqoption menu [3] This option specifies the sequence-based scoring scheme. SIGGEN provides 2 sequence-based scoring schemes that are used to score the input alignment. (Values: 1 (Substitution matrix); 2 (Residue class); 3 (None)) * -datafile matrixf [EBLOSUM62] This option specifies the the substitution matrix. The substitution matrix is used by the sequence-based scoring schemes. * -sparsity integer [10] This option specifies the % sparsity of signature. The signature sparsity is a user-defined parameter that determines how many residues the final signature will contain, for example, if the average sequence length of the proteins in the alignment is 250 residues, then a signature of sparsity 10% (default value) will contain 25 key residues or signature positions, that correspond to the top 25% highest scoring alignment positions. (Any integer value) -wsiz integer [0] This option specifies the window size. When a signature is aligned to a protein sequence, the permissible gaps between two signature positions is determined by the empirical gaps and the window size. The user is prompted for a window size that is used for every position in the signature. Likely this is not optimal. A future implementation will provide a range of methods for generating values of window size depending upon the alignment (window size is identified by the WSIZ record in the signature output file). (Any integer value) * -filtercon toggle [N] This option specifies whether to disregard positions forming few contacts only during the selection of signature positions. * -conthresh integer [10] This option specifies the threshold contact number. This controls the selection of key positions for the structure-based scoring scheme (number of contacts). (Any integer value) * -[no]filterpsim boolean [Y] This option specifies whether to disregard alignment sites that were not aligned satisfactorily (STAMP alignments only). [-sigoutdir] outdir [./] This option specifies the location of signature files (output). A 'signature file' contains a sparse sequence signature suitable for use with the SIGSCAN and SIGGEN programs. The files are generated by using SIGGEN & SIGGENLIG. Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: (none) Associated qualifiers: (none) General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages
Standard (Mandatory) qualifiers | Allowed values | Default | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[-algpath] (Parameter 1) |
This option specifies the location of DAF files (domain alignment files) (input). A 'domain alignment file' contains a sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is in DAF format (CLUSTAL-like) and is annotated with domain family classification information. The files generated by using SCOPALIGN will contain a structure-based sequence alignment of domains of known structure only. Such alignments can be extended with sequence relatives (of unknown structure) by using SEQALIGN. | Directory with files | ./ | ||||||||||
-mode | This option specifies the mode of signature generation. There are 3 modes for signatures generatation: (1) Use positions specified in alignment file. The alignment file must contain a line beginning with the text 'Positions' for each line of the alignment. A '1' in the 'Positions' line indicates that the signature should include data from the corresponding alignment site. The signature will only include the positions that are marked with a '1'. (2) Use a scoring method. The alignment is scored (see 'Algorithm') and the signature of a specified sparsity is sampled from high scoring positions. (3): Generate a randomised signature. A signature of a specified sparsity is sampled at random from the alignment. |
|
1 | ||||||||||
-conoption | This option specifies the structure-based scoring scheme. SIGGEN provides 2 structure-based scoring schemes (plus a combination method) that are used to score the input alignment. |
|
5 | ||||||||||
-conpath | This option specifies the location of CON files (contact files) (input). A 'contact file' contains contact data for a protein or a domain from SCOP or CATH, in the CON format (EMBL-like). The contacts may be intra-chain residue-residue, inter-chain residue-residue or residue-ligand. The files are generated by using CONTACTS, INTERFACE and SITES. | Directory | ./ | ||||||||||
-cpdbpath | This option specifies the location of domain CCF files (clean coordinate files) (input). A 'clean cordinate file' contains protein coordinate and derived data for a single PDB file ('protein clean coordinate file') or a single domain from SCOP or CATH ('domain clean coordinate file'), in CCF format (EMBL-like). The files, generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up' data that is self-consistent and error-corrected. Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. | Directory | ./ | ||||||||||
-seqoption | This option specifies the sequence-based scoring scheme. SIGGEN provides 2 sequence-based scoring schemes that are used to score the input alignment. |
|
3 | ||||||||||
-datafile | This option specifies the the substitution matrix. The substitution matrix is used by the sequence-based scoring schemes. | Comparison matrix file in EMBOSS data path | EBLOSUM62 | ||||||||||
-sparsity | This option specifies the % sparsity of signature. The signature sparsity is a user-defined parameter that determines how many residues the final signature will contain, for example, if the average sequence length of the proteins in the alignment is 250 residues, then a signature of sparsity 10% (default value) will contain 25 key residues or signature positions, that correspond to the top 25% highest scoring alignment positions. | Any integer value | 10 | ||||||||||
-wsiz | This option specifies the window size. When a signature is aligned to a protein sequence, the permissible gaps between two signature positions is determined by the empirical gaps and the window size. The user is prompted for a window size that is used for every position in the signature. Likely this is not optimal. A future implementation will provide a range of methods for generating values of window size depending upon the alignment (window size is identified by the WSIZ record in the signature output file). | Any integer value | 0 | ||||||||||
-filtercon | This option specifies whether to disregard positions forming few contacts only during the selection of signature positions. | Toggle value Yes/No | No | ||||||||||
-conthresh | This option specifies the threshold contact number. This controls the selection of key positions for the structure-based scoring scheme (number of contacts). | Any integer value | 10 | ||||||||||
-[no]filterpsim | This option specifies whether to disregard alignment sites that were not aligned satisfactorily (STAMP alignments only). | Boolean value Yes/No | Yes | ||||||||||
[-sigoutdir] (Parameter 2) |
This option specifies the location of signature files (output). A 'signature file' contains a sparse sequence signature suitable for use with the SIGSCAN and SIGGEN programs. The files are generated by using SIGGEN & SIGGENLIG. | Output directory | ./ | ||||||||||
Additional (Optional) qualifiers | Allowed values | Default | |||||||||||
(none) | |||||||||||||
Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||||||
(none) |
% siggen Generates a sparse protein signature from an alignment Domain alignment directories [./]: ../domainalign-keep/daf Specify mode of signature generation 1 : Use positions specified in alignment file 2 : Use a scoring method 3 : Generate a randomised signature Select number [1]: 2 Residue contacts scoring method 1 : Number 2 : Conservation 3 : Number and conservation 4 : None (structural data available) 5 : None (no structural data available) Select number [5]: 5 Sequence variability scoring method 1 : Substitution matrix 2 : Residue class 3 : None Select number [3]: 1 Substitution matrix to be used [EBLOSUM62]: EBLOSUM62 The % sparsity of signature [10]: 15 Window size [0]: 0 Ignore alignment positions with post_similar value of 0 [Y]: Y Domainatrix signature file output directory [./]: |
Go to the output files for this example
FILE TYPE | FORMAT | DESCRIPTION | CREATED BY | SEE ALSO |
Clean coordinate file (for domain) | CCF format (EMBL-like). | Protein coordinate and derived data for a single domain from SCOP or CATH. The data are 'cleaned-up': self-consistent and error-corrected. | DOMAINER | Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. |
Contact file (intra-chain residue-residue contacts) | CON format (EMBL-like.) | Intra-chain residue-residue contact data for a protein or a domain from SCOP or CATH. | CONTACTS | N.A. |
Domain alignment file | DAF format (CLUSTAL-like). | Sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is annotated with domain family classification information. | DOMAINALIGN (structure-based sequence alignment of domains of known structure). | DOMAINALIGN alignments can be extended with sequence relatives (of unknown structure) to the family in question by using SEQALIGN. |
Signature file | SIG format | Contains a sparse sequence signature suitable for use with the SIGSCAN program. Contains a sparse sequence signature. | SIGGENLIG, LIBGEN | The files are generated by using SIGGEN. |
Program name | Description |
---|---|
contacts | Generate intra-chain CON files from CCF files |
domainalign | Generate alignments (DAF file) for nodes in a DCF file |
domainrep | Reorder DCF file to identify representative structures |
domainreso | Remove low resolution domains from a DCF file |
interface | Generate inter-chain CON files from CCF files |
libgen | Generate discriminating elements from alignments |
matgen3d | Generate a 3D-1D scoring matrix from CCF files |
psiphi | Calculates phi and psi torsion angles from protein coordinates |
rocon | Generates a hits file from comparing two DHF files |
rocplot | Performs ROC analysis on hits files |
seqalign | Extend alignments (DAF file) with sequences (DHF file) |
seqfraggle | Removes fragment sequences from DHF files |
seqsearch | Generate PSI-BLAST hits (DHF file) from a DAF file |
seqsort | Remove ambiguous classified sequences from DHF files |
seqwords | Generates DHF files from keyword search of UniProt |
siggenlig | Generates ligand-binding signatures from a CON file |
sigscan | Generates hits (DHF file) from a signature search |
sigscanlig | Searches ligand-signature library & writes hits (LHF file) |
See also http://emboss.sourceforge.net/
Automatic generation and evaluation of sparse protein signatures for families of protein
structural domains. MJ Blades, JC Ison, R Ranasinghe, and JBC Findlay. Protein Science. 2005 (accepted)
A key residues approach to the definition of protein families and analysis
of sparse family signatures. JC Ison, AJ Bleasby, MJ Blades, SC Daniel,
JH Parish, JBC Findlay. PROTEINS: Structure, Function & Genetics. 2000,
40:330-341
Alignment of a sparse protein signature with protein sequences: application
to fold prediction for three small globulins. SC Daniel, JH Parish,
JC Ison, MJ Blades & JBC Findlay. FEBS Letters. 1999, 459:349-352.