SITES documentation


 


CONTENTS

1.0 SUMMARY
2.0 INPUTS & OUTPUTS
3.0 INPUT FILE FORMAT
4.0 OUTPUT FILE FORMAT
5.0 DATA FILES
6.0 USAGE
7.0 KNOWN BUGS & WARNINGS
8.0 NOTES
9.0 DESCRIPTION
10.0 ALGORITHM
11.0 RELATED APPLICATIONS
12.0 DIAGNOSTIC ERROR MESSAGES
13.0 AUTHORS
14.0 REFERENCES



1.0 SUMMARY

Generate residue-ligand CON files from CCF files


2.0 INPUTS & OUTPUTS

SITES reads CCF files (clean coordinate file) and writes a CON files (contacts file) of residue-ligand contact data for domains in a DCF file (domain classification file). The CON file contains contact data for all ligand-domain pairs (using domain definitions from the DCF file) found in the CCF files. The input and output files are specified by the user (file extensions in the ACD file). A log file is also written.


3.0 INPUT FILE FORMAT

The format of the protein CCF file is described in the PDBPARSE documentation.

Input files for usage example

File: ../scopparse-keep/all.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//




4.0 OUTPUT FILE FORMAT

The CON format used for the contact files (Figure 1) is similar to EMBL format and is described in the CONTACTS documentation. A few of the records differ in the SITES output compared to the CONTACTS output, however, so for the sake of clarity all records are described below.

Output files for usage example

File: SITES.con

XX   Residue-ligand contact data (for domains).
XX
TY   LIGAND
XX
EX   THRESH 1.0; IGNORE .; NMOD .; NCHA .;
XX
NE   11
XX
EN   [1]
XX
ID   PDB 1cs4; DOM d1cs4a_; LIG 101;
XX
DE   2'-DEOXY-ADENOSINE 3'-MONOPHOSPHATE
XX
SI   SN 1; NS 2
XX
CN   MO .; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 52; NRES2 .
XX
S1   SEQUENCE    52 AA;   5817 MW;  D8CCAE0E1FC0849A CRC64;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
NC   SM .; LI 6
XX
LI   ASP 2
LI   PHE 6
LI   THR 7
LI   LEU 44
LI   GLY 45
LI   ASP 46
XX
//
EN   [2]
XX
ID   PDB 1ii7; DOM d1ii7a_; LIG 101;
XX
DE   2'-DEOXY-ADENOSINE 3'-MONOPHOSPHATE
XX
SI   SN 2; NS 2
XX
CN   MO .; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 65; NRES2 .
XX
S1   SEQUENCE    65 AA;   7395 MW;  75FBE75B22FD3678 CRC64;
     MKFAHLADIH LGYEQFHKPQ REEEFAEAFK NALEIAVQEN VDFILIAGDL FHSSRPSPGT
     LKKAI
XX
NC   SM .; LI 2
XX
LI   HIS 10
LI   ASP 49
XX


  [Part of this file has been deleted for brevity]

NC   SM .; LI 3
XX
LI   ASP 8
LI   HIS 10
LI   ASP 49
XX
//
EN   [10]
XX
ID   PDB 2hhb; DOM .; LIG PO4;
XX
DE   PHOSPHATE ION
XX
SI   SN 1; NS 1
XX
CN   MO .; CN1 1; CN2 .; ID1 D; ID2 .; NRES1 146; NRES2 .
XX
S1   SEQUENCE   146 AA;  15867 MW;  EACBC707CFD466A1 CRC64;
     VHLTPEEKSA VTALWGKVNV DEVGGEALGR LLVVYPWTQR FFESFGDLST PDAVMGNPKV
     KAHGKKVLGA FSDGLAHLDN LKGTFATLSE LHCDKLHVDP ENFRLLGNVL VCVLAHHFGK
     EFTPPVQAAY QKVVAGVANA LAHKYH
XX
NC   SM .; LI 2
XX
LI   VAL 1
LI   LEU 81
XX
//
EN   [11]
XX
ID   PDB 1cs4; DOM d1cs4a_; LIG POP;
XX
DE   PYROPHOSPHATE 2-
XX
SI   SN 1; NS 1
XX
CN   MO .; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 52; NRES2 .
XX
S1   SEQUENCE    52 AA;   5817 MW;  D8CCAE0E1FC0849A CRC64;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
NC   SM .; LI 6
XX
LI   ASP 2
LI   ILE 3
LI   GLU 4
LI   GLY 5
LI   PHE 6
LI   THR 7
XX
//

File: sites.log

CCF: /homes/user/test/qa/pdbplus-keep/1cs4.ccf HETS:YES NHETS:7 SCOP:YES NDOMS: 1
CCF: /homes/user/test/qa/pdbplus-keep/1ii7.ccf HETS:YES NHETS:5 SCOP:YES NDOMS: 1
CCF: /homes/user/test/qa/pdbplus-keep/2hhb.ccf HETS:YES NHETS:5 SCOP:NO NCHN:4




5.0 DATA FILES

SITES uses a data file containing van der Waals radii for atoms in proteins (see CONTACTS documentation.) The file Evdw.dat is such a data file and is part of the EMBOSS distribution.

SITES uses a data file containing a dictionary of heterogen groups in PDB. This file may be generated by using HETPARSE and is part of the EMBOSS distribution. The file Ehet.dat is such a data file and is part of the EMBOSS distribution.


6.0 USAGE

6.1 COMMAND LINE ARGUMENTS

Generate residue-ligand CON files from CCF files.
Version: EMBOSS:6.2.0

   Standard (Mandatory) qualifiers:
  [-protpath]          dirlist    [./] This option specifies the location of
                                  the protein CCF files (clean coordinate
                                  files) (input). A 'clean cordinate file'
                                  contains protein coordinate and derived data
                                  for a single PDB file ('protein clean
                                  coordinate file') or a single domain from
                                  SCOP or CATH ('domain clean coordinate
                                  file'), in CCF format (EMBL-like). The
                                  files, generated by using PDBPARSE (PDB
                                  files) or DOMAINER (domains), contain
                                  'cleaned-up' data that is self-consistent
                                  and error-corrected. Records for residue
                                  solvent accessibility and secondary
                                  structure are added to the file by using
                                  PDBPLUS.
  [-domaindir]         directory  [./] This option specifies the location of
                                  the domain CCF files (clean coordinate
                                  files) (input). A 'clean cordinate file'
                                  contains protein coordinate and derived data
                                  for a single PDB file ('protein clean
                                  coordinate file') or a single domain from
                                  SCOP or CATH ('domain clean coordinate
                                  file'), in CCF format (EMBL-like). The
                                  files, generated by using PDBPARSE (PDB
                                  files) or DOMAINER (domains), contain
                                  'cleaned-up' data that is self-consistent
                                  and error-corrected. Records for residue
                                  solvent accessibility and secondary
                                  structure are added to the file by using
                                  PDBPLUS.
  [-dcffile]           infile     This option specifies the name of the DCF
                                  file (domain classification file) (input). A
                                  'domain classification file' contains
                                  classification and other data for domains
                                  from SCOP or CATH, in DCF format
                                  (EMBL-like). The files are generated by
                                  using SCOPPARSE and CATHPARSE. Domain
                                  sequence information can be added to the
                                  file by using DOMAINSEQS.
   -threshold          float      [1.0] This option specifies the threshold
                                  contact distance. (Any numeric value)
  [-outfile]           outfile    [SITES.con] This option specifies the name
                                  of the output file.
   -logfile            outfile    [sites.log] This option specifies the name
                                  of the log file.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers:
   -dicfile            datafile   [Ehet.dat] This option specifies the
                                  dictionary of heterogen groups in PDB. This
                                  file is generated by using HETPARSE and is
                                  part of the EMBOSS distribution.
   -vdwfile            datafile   [Evdw.dat] This option specifies the name of
                                  the data file with van der Waals radii for
                                  atoms in amino acid residues. This file is
                                  part of the EMBOSS distribution.

   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory4        string     Output directory

   "-logfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-protpath]
(Parameter 1)
dirlist This option specifies the location of the protein CCF files (clean coordinate files) (input). A 'clean cordinate file' contains protein coordinate and derived data for a single PDB file ('protein clean coordinate file') or a single domain from SCOP or CATH ('domain clean coordinate file'), in CCF format (EMBL-like). The files, generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up' data that is self-consistent and error-corrected. Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. Directory with files ./
[-domaindir]
(Parameter 2)
directory This option specifies the location of the domain CCF files (clean coordinate files) (input). A 'clean cordinate file' contains protein coordinate and derived data for a single PDB file ('protein clean coordinate file') or a single domain from SCOP or CATH ('domain clean coordinate file'), in CCF format (EMBL-like). The files, generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up' data that is self-consistent and error-corrected. Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. Directory ./
[-dcffile]
(Parameter 3)
infile This option specifies the name of the DCF file (domain classification file) (input). A 'domain classification file' contains classification and other data for domains from SCOP or CATH, in DCF format (EMBL-like). The files are generated by using SCOPPARSE and CATHPARSE. Domain sequence information can be added to the file by using DOMAINSEQS. Input file Required
-threshold float This option specifies the threshold contact distance. Any numeric value 1.0
[-outfile]
(Parameter 4)
outfile This option specifies the name of the output file. Output file SITES.con
-logfile outfile This option specifies the name of the log file. Output file sites.log
Additional (Optional) qualifiers
(none)
Advanced (Unprompted) qualifiers
-dicfile datafile This option specifies the dictionary of heterogen groups in PDB. This file is generated by using HETPARSE and is part of the EMBOSS distribution. Data file Ehet.dat
-vdwfile datafile This option specifies the name of the data file with van der Waals radii for atoms in amino acid residues. This file is part of the EMBOSS distribution. Data file Evdw.dat
Associated qualifiers
"-outfile" associated outfile qualifiers
-odirectory4
-odirectory_outfile
string Output directory Any string  
"-logfile" associated outfile qualifiers
-odirectory string Output directory Any string  
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

6.2 EXAMPLE SESSION

An example of interactive use of SITES is shown below. Here is a sample session with sites


% sites 
Generate residue-ligand CON files from CCF files.
Clean protein structure coordinates directories [./]: ../pdbplus-keep
Clean domain coordinates directory [./]: ../domainer-keep
Domain classification file: ../scopparse-keep/all.scop
Threshold contact distance [1.0]: 1
Structure contacts output file [SITES.con]: 
Domainatrix log output file [sites.log]: 

Entries in HetDic 4306
Entries in Dbase 4306
CCF FILE: /homes/user/test/qa/pdbplus-keep/1cs4.ccf (1/3)
CCF FILE: /homes/user/test/qa/pdbplus-keep/1ii7.ccf (2/3)
CCF FILE: /homes/user/test/qa/pdbplus-keep/2hhb.ccf (3/3)

Go to the input files for this example
Go to the output files for this example




7.0 KNOWN BUGS & WARNINGS

None.


8.0 NOTES

Types of contact
LI records are used for contacts to ligands (as defined above). In CONTACTS and INTERFACE output, SM records are used for contacts between either either side-chain or main-chain atoms. In a future implementation, SS will be used for side-chain only contacts, MM will be used for main-chain only contacts, and there will probably be several other forms of contact too.

8.1 GLOSSARY OF FILE TYPES

FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
Clean coordinate file (for protein) CCF format (EMBL-like). Protein coordinate and derived data for a single PDB file. The data are 'cleaned-up': self-consistent and error-corrected. PDBPARSE Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS.
Clean coordinate file (for domain) CCF format (EMBL-like). Protein coordinate and derived data for a single domain from SCOP or CATH. The data are 'cleaned-up': self-consistent and error-corrected. DOMAINER Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS.
Contact file (intra-chain residue-residue contacts) CON format (EMBL-like.) Intra-chain residue-residue contact data for a protein or a domain from SCOP or CATH. CONTACTS N.A.
Contact file (inter-chain residue-residue contacts) CON format (EMBL-like.) Inter-chain residue-residue contact data for a protein or a domain from SCOP or CATH. INTERFACE N.A.
Contact file (residue-ligand contacts) CON format (EMBL-like.) Residue-ligand contact data for a protein or a domain from SCOP or CATH. SITES N.A.
van der Waals radii A file of van der Waals radii for atoms in amino acid residues. Part of the emboss distribution. N.A. N.A.
Dictionary of heterogen groups A file of the dictionary of heterogen groups in PDB. HETPARSE N.A.



9.0 DESCRIPTION

Knowledge of the physical contacts that amino acid residues make with protein ligands is required for several different analyses. SITES calculates residue-ligand contact data from protein CCF files (clean coordinate files) and organises the data according to domains taken from a DCF file (domain classification file). None


10.0 ALGORITHM

Contact between two residues is defined as when the van der Waals surface of any atom of the first residue comes within the threshold contact distance of the van der Waals surface of any atom of the second residue. The threshold contact distance is a user-defined distance with a default value of 1 Angstrom.


11.0 RELATED APPLICATIONS

See also

Program name Description
aaindexextract Extract amino acid property data from AAINDEX
allversusall Sequence similarity data from all-versus-all comparison
cathparse Generates DCF file from raw CATH files
cutgextract Extract codon usage tables from CUTG database
domainer Generates domain CCF files from protein CCF files
domainnr Removes redundant domains from a DCF file
domainseqs Adds sequence records to a DCF file
domainsse Add secondary structure records to a DCF file
hetparse Converts heterogen group dictionary to EMBL-like format
jaspextract Extract data from JASPAR
pdbparse Parses PDB files and writes protein CCF files
pdbplus Add accessibility & secondary structure to a CCF file
pdbtosp Convert swissprot:PDB codes file to EMBL-like format
printsextract Extract data from PRINTS database for use by pscan
prosextract Processes the PROSITE motif database for use by patmatmotifs
rebaseextract Process the REBASE database for use by restriction enzyme applications
scopparse Generate DCF file from raw SCOP files
seqnr Removes redundancy from DHF files
ssematch Search a DCF file for secondary structure matches
tfextract Process TRANSFAC transcription factor database for use by tfscan



12.0 DIAGNOSTIC ERROR MESSAGES

SITES generates a log file an excerpt of which is shown below. The file contains a line for each protein CCF that was read containing diagnostic information is given (in case of difficulty email Jon Ison, jison@ebi.ac.uk).

Figure 2 Excerpt from an INTERFACE log file
Excerpt of log file
CCF: 000_testdata_new/sites/in/1cs4.ccf	HETS:YES	NHETS:7	SCOP:YES	NDOMS: 1
CCF: 000_testdata_new/sites/in/1ii7.ccf	HETS:YES	NHETS:5	SCOP:YES	NDOMS: 1



13.0 AUTHORS

Waqas Awan
Jon Ison (jison@ebi.ac.uk)
The European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SD UK


14.0 REFERENCES

Please cite the authors and EMBOSS. Please cite the authors and EMBOSS.

Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.

See also http://emboss.sourceforge.net/

14.1 Other useful references