CONTACTS documentation


 

CONTENTS

1.0 SUMMARY
2.0 INPUTS & OUTPUTS
3.0 INPUT FILE FORMAT
4.0 OUTPUT FILE FORMAT
5.0 DATA FILES
6.0 USAGE
7.0 KNOWN BUGS & WARNINGS
8.0 NOTES
9.0 DESCRIPTION
10.0 ALGORITHM
11.0 RELATED APPLICATIONS
12.0 DIAGNOSTIC ERROR MESSAGES
13.0 AUTHORS
14.0 REFERENCES



1.0 SUMMARY

Reads CCF files (clean coordinate files) and writes CON files (contact files) of intra-chain residue-residue contact data. Generate intra-chain CON files from CCF files


2.0 INPUTS & OUTPUTS

CONTACTS reads a directory of domain or protein CCF files (clean coordinate files) and writes a CON file (contacts file) of intra-chain residue-residue contact data for each file in the input directory. Each CON file contains residue contact data for every chain of every model in a protein coordinate file, or for a single domain where a domain CCF file is read. The user is prompted for the paths for the CCF (input) and CON (output) files and the file extensions are specified in the ACD file. The domain identifier code or pdb identifier code is used as appropriate to name the output files. A log file is also written.


3.0 INPUT FILE FORMAT

The format of the clean coordinate file is described in the PDBPARSE documentation.


4.0 OUTPUT FILE FORMAT

The CON format used for the contact files (Figure 1) is similar to EMBL format and uses the following records:

Output files for usage example

File: 1cs4.con

XX   Intra-chain residue-residue contact data.
XX
TY   INTRA
XX
EX   THRESH 1.0; IGNORE 20.0; NMOD 1; NCHA 1
XX
NE   1
XX
EN   [1]
XX
ID   PDB 1cs4; DOM .; LIG .
XX
CN   MO 1; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 52; NRES2 .
XX
S1   SEQUENCE    52 AA;   5817 MW;  D8CCAE0E1FC0849A CRC64;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
NC   SM 163; LI .
XX
SM   ASP 2 ; ILE 3
SM   ASP 2 ; GLU 4
SM   ASP 2 ; ASP 46
SM   ASP 2 ; CYS 47
SM   ILE 3 ; GLU 4
SM   ILE 3 ; GLY 5
SM   ILE 3 ; PHE 6
SM   ILE 3 ; LEU 9
SM   ILE 3 ; LEU 25
SM   ILE 3 ; ASP 46
SM   GLU 4 ; GLY 5
SM   GLU 4 ; PHE 6
SM   GLY 5 ; PHE 6
SM   GLY 5 ; THR 7
SM   GLY 5 ; SER 8
SM   GLY 5 ; LEU 9
SM   PHE 6 ; THR 7
SM   PHE 6 ; SER 8
SM   PHE 6 ; LEU 9
SM   PHE 6 ; ALA 10
SM   PHE 6 ; LEU 18
SM   PHE 6 ; LEU 22
SM   PHE 6 ; GLY 45
SM   PHE 6 ; ASP 46
SM   THR 7 ; SER 8
SM   THR 7 ; LEU 9
SM   THR 7 ; ALA 10
SM   THR 7 ; SER 11
SM   SER 8 ; LEU 9
SM   SER 8 ; ALA 10
SM   SER 8 ; SER 11


  [Part of this file has been deleted for brevity]

SM   PHE 29 ; LYS 31
SM   PHE 29 ; LEU 32
SM   PHE 29 ; ALA 33
SM   ASP 30 ; LYS 31
SM   ASP 30 ; LEU 32
SM   ASP 30 ; ALA 33
SM   ASP 30 ; ALA 34
SM   ASP 30 ; ARG 40
SM   LYS 31 ; LEU 32
SM   LYS 31 ; ALA 33
SM   LYS 31 ; ALA 34
SM   LYS 31 ; GLU 35
SM   LEU 32 ; ALA 33
SM   LEU 32 ; ALA 34
SM   LEU 32 ; GLU 35
SM   LEU 32 ; ASN 36
SM   ALA 33 ; ALA 34
SM   ALA 33 ; GLU 35
SM   ALA 33 ; ASN 36
SM   ALA 33 ; HIS 37
SM   ALA 33 ; CYS 38
SM   ALA 34 ; GLU 35
SM   ALA 34 ; ASN 36
SM   ALA 34 ; HIS 37
SM   GLU 35 ; ASN 36
SM   GLU 35 ; HIS 37
SM   ASN 36 ; HIS 37
SM   ASN 36 ; CYS 38
SM   HIS 37 ; CYS 38
SM   HIS 37 ; LEU 39
SM   CYS 38 ; LEU 39
SM   CYS 38 ; ARG 40
SM   LEU 39 ; ARG 40
SM   LEU 39 ; ILE 41
SM   ARG 40 ; ILE 41
SM   ARG 40 ; LYS 42
SM   ARG 40 ; ILE 43
SM   ILE 41 ; LYS 42
SM   LYS 42 ; ILE 43
SM   LYS 42 ; LEU 44
SM   LYS 42 ; CYS 47
SM   ILE 43 ; LEU 44
SM   ILE 43 ; GLY 45
SM   ILE 43 ; CYS 47
SM   LEU 44 ; GLY 45
SM   LEU 44 ; ASP 46
SM   LEU 44 ; CYS 47
SM   GLY 45 ; ASP 46
SM   GLY 45 ; CYS 47
SM   ASP 46 ; CYS 47
//

File: 1ii7.con

XX   Intra-chain residue-residue contact data.
XX
TY   INTRA
XX
EX   THRESH 1.0; IGNORE 20.0; NMOD 1; NCHA 1
XX
NE   1
XX
EN   [1]
XX
ID   PDB 1ii7; DOM .; LIG .
XX
CN   MO 1; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 65; NRES2 .
XX
S1   SEQUENCE    65 AA;   7395 MW;  75FBE75B22FD3678 CRC64;
     MKFAHLADIH LGYEQFHKPQ REEEFAEAFK NALEIAVQEN VDFILIAGDL FHSSRPSPGT
     LKKAI
XX
NC   SM 151; LI .
XX
SM   ASP 8 ; ILE 9
SM   ASP 8 ; HIS 10
SM   ASP 8 ; GLY 48
SM   ASP 8 ; ASP 49
SM   ILE 9 ; HIS 10
SM   ILE 9 ; LEU 11
SM   ILE 9 ; PHE 25
SM   ILE 9 ; PHE 29
SM   ILE 9 ; ILE 46
SM   ILE 9 ; ASP 49
SM   ILE 9 ; LEU 50
SM   HIS 10 ; LEU 11
SM   HIS 10 ; GLY 12
SM   HIS 10 ; TYR 13
SM   HIS 10 ; PHE 25
SM   HIS 10 ; ASP 49
SM   HIS 10 ; LEU 50
SM   LEU 11 ; GLY 12
SM   LEU 11 ; TYR 13
SM   LEU 11 ; ALA 26
SM   LEU 11 ; PHE 29
SM   LEU 11 ; LEU 50
SM   GLY 12 ; TYR 13
SM   GLY 12 ; GLU 14
SM   GLY 12 ; GLU 22
SM   TYR 13 ; GLU 14
SM   TYR 13 ; GLN 15
SM   TYR 13 ; GLU 22
SM   TYR 13 ; PHE 25
SM   GLU 14 ; GLN 15


  [Part of this file has been deleted for brevity]

SM   ASN 31 ; ILE 35
SM   ALA 32 ; LEU 33
SM   ALA 32 ; GLU 34
SM   ALA 32 ; ILE 35
SM   ALA 32 ; ALA 36
SM   LEU 33 ; GLU 34
SM   LEU 33 ; ILE 35
SM   LEU 33 ; ALA 36
SM   LEU 33 ; VAL 37
SM   LEU 33 ; ILE 44
SM   GLU 34 ; ILE 35
SM   GLU 34 ; ALA 36
SM   GLU 34 ; VAL 37
SM   GLU 34 ; GLN 38
SM   ILE 35 ; ALA 36
SM   ILE 35 ; VAL 37
SM   ILE 35 ; GLN 38
SM   ILE 35 ; GLU 39
SM   ALA 36 ; VAL 37
SM   ALA 36 ; GLN 38
SM   ALA 36 ; GLU 39
SM   ALA 36 ; ASN 40
SM   ALA 36 ; VAL 41
SM   ALA 36 ; ILE 44
SM   VAL 37 ; GLN 38
SM   VAL 37 ; GLU 39
SM   VAL 37 ; ASN 40
SM   GLN 38 ; GLU 39
SM   GLN 38 ; ASN 40
SM   GLU 39 ; ASN 40
SM   GLU 39 ; VAL 41
SM   ASN 40 ; VAL 41
SM   ASN 40 ; ASP 42
SM   VAL 41 ; ASP 42
SM   VAL 41 ; PHE 43
SM   VAL 41 ; ILE 44
SM   ASP 42 ; PHE 43
SM   PHE 43 ; ILE 44
SM   PHE 43 ; LEU 45
SM   ILE 44 ; LEU 45
SM   ILE 44 ; ILE 46
SM   LEU 45 ; ILE 46
SM   LEU 45 ; ALA 47
SM   ILE 46 ; ALA 47
SM   ILE 46 ; GLY 48
SM   ILE 46 ; LEU 50
SM   ALA 47 ; GLY 48
SM   GLY 48 ; ASP 49
SM   GLY 48 ; LEU 50
SM   ASP 49 ; LEU 50
//

File: 2hhb.con

XX   Intra-chain residue-residue contact data.
XX
TY   INTRA
XX
EX   THRESH 1.0; IGNORE 20.0; NMOD 1; NCHA 4
XX
NE   4
XX
EN   [1]
XX
ID   PDB 2hhb; DOM .; LIG .
XX
CN   MO 1; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 141; NRES2 .
XX
S1   SEQUENCE   141 AA;  15126 MW;  34D13618E62A33C1 CRC64;
     VLSPADKTNV KAAWGKVGAH AGEYGAEALE RMFLSFPTTK TYFPHFDLSH GSAQVKGHGK
     KVADALTNAV AHVDDMPNAL SALSDLHAHK LRVDPVNFKL LSHCLLVTLA AHLPAEFTPA
     VHASLDKFLA SVSTVLTSKY R
XX
NC   SM 643; LI .
XX
SM   VAL 1 ; LEU 2
SM   VAL 1 ; SER 3
SM   VAL 1 ; LYS 127
SM   LEU 2 ; SER 3
SM   LEU 2 ; PRO 4
SM   LEU 2 ; ASP 6
SM   LEU 2 ; LYS 7
SM   LEU 2 ; VAL 73
SM   LEU 2 ; MET 76
SM   LEU 2 ; LYS 127
SM   LEU 2 ; PHE 128
SM   LEU 2 ; SER 131
SM   SER 3 ; PRO 4
SM   SER 3 ; ALA 5
SM   SER 3 ; ASP 6
SM   SER 3 ; LYS 7
SM   SER 3 ; LYS 127
SM   PRO 4 ; ALA 5
SM   PRO 4 ; ASP 6
SM   PRO 4 ; LYS 7
SM   PRO 4 ; THR 8
SM   ALA 5 ; ASP 6
SM   ALA 5 ; LYS 7
SM   ALA 5 ; THR 8
SM   ALA 5 ; ASN 9
SM   ASP 6 ; LYS 7
SM   ASP 6 ; THR 8
SM   ASP 6 ; ASN 9
SM   ASP 6 ; VAL 10


  [Part of this file has been deleted for brevity]

SM   GLN 131 ; LYS 132
SM   GLN 131 ; VAL 133
SM   GLN 131 ; VAL 134
SM   GLN 131 ; ALA 135
SM   LYS 132 ; VAL 133
SM   LYS 132 ; VAL 134
SM   LYS 132 ; ALA 135
SM   LYS 132 ; GLY 136
SM   VAL 133 ; VAL 134
SM   VAL 133 ; ALA 135
SM   VAL 133 ; GLY 136
SM   VAL 133 ; VAL 137
SM   VAL 134 ; ALA 135
SM   VAL 134 ; GLY 136
SM   VAL 134 ; VAL 137
SM   VAL 134 ; ALA 138
SM   ALA 135 ; GLY 136
SM   ALA 135 ; VAL 137
SM   ALA 135 ; ALA 138
SM   ALA 135 ; ASN 139
SM   GLY 136 ; VAL 137
SM   GLY 136 ; ALA 138
SM   GLY 136 ; ASN 139
SM   GLY 136 ; ALA 140
SM   VAL 137 ; ALA 138
SM   VAL 137 ; ASN 139
SM   VAL 137 ; ALA 140
SM   VAL 137 ; LEU 141
SM   ALA 138 ; ASN 139
SM   ALA 138 ; ALA 140
SM   ALA 138 ; LEU 141
SM   ALA 138 ; ALA 142
SM   ASN 139 ; ALA 140
SM   ASN 139 ; LEU 141
SM   ASN 139 ; ALA 142
SM   ASN 139 ; HIS 143
SM   ALA 140 ; LEU 141
SM   ALA 140 ; ALA 142
SM   ALA 140 ; HIS 143
SM   LEU 141 ; ALA 142
SM   LEU 141 ; HIS 143
SM   LEU 141 ; TYR 145
SM   ALA 142 ; HIS 143
SM   ALA 142 ; LYS 144
SM   ALA 142 ; TYR 145
SM   HIS 143 ; LYS 144
SM   HIS 143 ; TYR 145
SM   LYS 144 ; TYR 145
SM   LYS 144 ; HIS 146
SM   TYR 145 ; HIS 146
//

File: contacts.log

1cs4
1ii7
2hhb




5.0 DATA FILES

CONTACTS uses a data file containing van der Waals radii for atoms in proteins (below). The file Evdw.dat is such a data file and is part of the EMBOSS distribution.


6.0 USAGE

   Standard (Mandatory) qualifiers:
  [-cpdbdir]           dirlist    [./] This option specifies the location of
                                  CCF files (clean coordinate files) (input).
                                  A 'clean cordinate file' contains protein
                                  coordinate and derived data for a single PDB
                                  file ('protein clean coordinate file') or a
                                  single domain from SCOP or CATH ('domain
                                  clean coordinate file'), in CCF format
                                  (EMBL-like). The files, generated by using
                                  PDBPARSE (PDB files) or DOMAINER (domains),
                                  contain 'cleaned-up' data that is
                                  self-consistent and error-corrected. Records
                                  for residue solvent accessibility and
                                  secondary structure are added to the file by
                                  using PDBPLUS.
   -vdwfile            datafile   [Evdw.dat] This option specifies the name of
                                  the data file with van der Waals radii of
                                  atoms for different amino acid residues.
   -threshold          float      [1.0] Contact between two residues is
                                  defined as when the van der Waals surface of
                                  any atom of the first residue comes within
                                  the threshold contact distance of the van
                                  der Waals surface of any atom of the second
                                  residue. The threshold contact distance is a
                                  user-defined distance with a default value
                                  of 1 Angstrom. (Any numeric value)
  [-conoutdir]         outdir     [./] This option specifies the location of
                                  CON files (contact files) (output). A
                                  'contact file' contains contact data for a
                                  protein or a domain from SCOP or CATH, in
                                  the CON format (EMBL-like). The contacts may
                                  be intra-chain residue-residue, inter-chain
                                  residue-residue or residue-ligand. The
                                  files are generated by using CONTACTS,
                                  INTERFACE and SITES.
   -conlogfile         outfile    [contacts.log] The log file contains
                                  messages about any errors arising while
                                  contacts ran.

   Additional (Optional) qualifiers:
   -[no]ccfnaming      boolean    [Y] This option specifies whether to use
                                  pdbid code to name the output files. If set,
                                  the PDB identifier code (from the PDB file)
                                  is used to name the file. Otherwise, the
                                  output files have the same names as the
                                  input files.
   -skip               boolean    [N] Whether to calculate contacts between
                                  residue adjacent in sequence.
   -ignore             float      [20.0] If any two atoms from two different
                                  residues are at least this distance apart
                                  then no futher inter-atomic contacts will be
                                  checked for for that residue pair . This
                                  speeds the calculation up considerably. (Any
                                  numeric value)

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-conlogfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

6.1 COMMAND LINE ARGUMENTS

Standard (Mandatory) qualifiers Allowed values Default
[-cpdbdir]
(Parameter 1)
This option specifies the location of CCF files (clean coordinate files) (input). A 'clean cordinate file' contains protein coordinate and derived data for a single PDB file ('protein clean coordinate file') or a single domain from SCOP or CATH ('domain clean coordinate file'), in CCF format (EMBL-like). The files, generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up' data that is self-consistent and error-corrected. Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. Directory with files ./
-vdwfile This option specifies the name of the data file with van der Waals radii of atoms for different amino acid residues. Data file Evdw.dat
-threshold Contact between two residues is defined as when the van der Waals surface of any atom of the first residue comes within the threshold contact distance of the van der Waals surface of any atom of the second residue. The threshold contact distance is a user-defined distance with a default value of 1 Angstrom. Any numeric value 1.0
[-conoutdir]
(Parameter 2)
This option specifies the location of CON files (contact files) (output). A 'contact file' contains contact data for a protein or a domain from SCOP or CATH, in the CON format (EMBL-like). The contacts may be intra-chain residue-residue, inter-chain residue-residue or residue-ligand. The files are generated by using CONTACTS, INTERFACE and SITES. Output directory ./
-conlogfile The log file contains messages about any errors arising while contacts ran. Output file contacts.log
Additional (Optional) qualifiers Allowed values Default
-[no]ccfnaming This option specifies whether to use pdbid code to name the output files. If set, the PDB identifier code (from the PDB file) is used to name the file. Otherwise, the output files have the same names as the input files. Boolean value Yes/No Yes
-skip Whether to calculate contacts between residue adjacent in sequence. Boolean value Yes/No No
-ignore If any two atoms from two different residues are at least this distance apart then no futher inter-atomic contacts will be checked for for that residue pair . This speeds the calculation up considerably. Any numeric value 20.0
Advanced (Unprompted) qualifiers Allowed values Default
(none)

6.2 EXAMPLE SESSION

An example of interactive use of CONTACTS is shown below. Here is a sample session with contacts


% contacts 
Generate intra-chain CON files from CCF files.
Clean protein structure coordinates directories [./]: ../pdbplus-keep/
Van der waals radii data file [Evdw.dat]: 
Threshold contact distance [1.0]: 1
Structure contacts file output directory [./]: 
Domainatrix log output file [contacts.log]: 

1cs4
1ii7
2hhb

Go to the output files for this example




7.0 KNOWN BUGS & WARNINGS

None.


8.0 NOTES

Types of contact
SM records are used for contacts between either either side-chain or main-chain atoms as defined above. In a future implementation, SS will be used for side-chain only contacts, MM will be used for main-chain only contacts, LI will be used for contacts to ligands, and there will probably be several other forms of contact too.

Threshold ignore distance option
The threshold ignore distance can be adjusted to speed-up the calculation of contacts considerably. If any two atoms from two different residues are at least this distance apart then no futher inter-atomic contacts will be checked for for that residue pair.

Skip option
Two residues that are adjacent in the sequence will always be in contact through their peptide bond, therefore it is not normally necessary to calculate such contacts directly. Furthermore, some analyses require such contacts to be omitted. The skip option determines whether contacts between residue adjacent in sequence are calculated and written to the output file.

8.1 GLOSSARY OF FILE TYPES

FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
Clean coordinate file (for protein) CCF format (EMBL-like). Protein coordinate and derived data for a single PDB file. The data are 'cleaned-up': self-consistent and error-corrected. PDBPARSE Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS.
Clean coordinate file (for domain) CCF format (EMBL-like). Protein coordinate and derived data for a single domain from SCOP or CATH. The data are 'cleaned-up': self-consistent and error-corrected. DOMAINER Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS.
Contact file (intra-chain residue-residue contacts) CON format (EMBL-like.) Intra-chain residue-residue contact data for a protein or a domain from SCOP or CATH. CONTACTS N.A.
Contact file (inter-chain residue-residue contacts) CON format (EMBL-like.) Inter-chain residue-residue contact data for a protein or a domain from SCOP or CATH. INTERFACE N.A.
Contact file (residue-ligand contacts) CON format (EMBL-like.) Residue-ligand contact data for a protein or a domain from SCOP or CATH. SITES N.A.
van der Waals radii A file of van der Waals radii for atoms in amino acid residues. Part of the emboss distribution. N.A. N.A.
None


9.0 DESCRIPTION

Knowledge of the physical contacts that amino acid residues within a protein or domain make with one another is required for several different analyses. CONTACTS calculates intra-chain residue-residue contact data from protein and domain CCF files (clean coordinate files).


10.0 ALGORITHM

Contact between two residues is defined when the van der Waals surface of any atom of the first residue comes within the threshold contact distance of the van der Waals surface of any atom of the second residue. The threshold contact distance is a user-defined distance with a default value of 1 Angstrom.


11.0 RELATED APPLICATIONS

See also

Program name Description
domainalign Generate alignments (DAF file) for nodes in a DCF file
domainrep Reorder DCF file to identify representative structures
domainreso Remove low resolution domains from a DCF file
interface Generate inter-chain CON files from CCF files
libgen Generate discriminating elements from alignments
matgen3d Generate a 3D-1D scoring matrix from CCF files
psiphi Calculates phi and psi torsion angles from protein coordinates
rocon Generates a hits file from comparing two DHF files
rocplot Performs ROC analysis on hits files
seqalign Extend alignments (DAF file) with sequences (DHF file)
seqfraggle Removes fragment sequences from DHF files
seqsearch Generate PSI-BLAST hits (DHF file) from a DAF file
seqsort Remove ambiguous classified sequences from DHF files
seqwords Generates DHF files from keyword search of UniProt
siggen Generates a sparse protein signature from an alignment
siggenlig Generates ligand-binding signatures from a CON file
sigscan Generates hits (DHF file) from a signature search
sigscanlig Searches ligand-signature library & writes hits (LHF file)



12.0 DIAGNOSTIC ERROR MESSAGES

CONTACTS generates a log file an excerpt of which is shown (Figure 2). If there is a problem in processing a CCF file, three lines containing the record '//', the domain or pdb identifier code and an error message respectively are written. The text 'WARN file open error filename', 'ERROR file read error filename' or 'ERROR file write error filename' will be reported when an error was encountered during a file open, read or write respectively. Various other error messages may also be given (in case of difficulty email jison@ebi.ac.uk).


13.0 AUTHORS

Jon Ison (jison@ebi.ac.uk)
The European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SD UK


14.0 REFERENCES

Please cite the authors and EMBOSS.

Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.

See also http://emboss.sourceforge.net/

14.1 Other useful references