HETPARSE documentation


 

CONTENTS

1.0 SUMMARY
2.0 INPUTS & OUTPUTS
3.0 INPUT FILE FORMAT
4.0 OUTPUT FILE FORMAT
5.0 DATA FILES
6.0 USAGE
7.0 KNOWN BUGS & WARNINGS
8.0 NOTES
9.0 DESCRIPTION
10.0 ALGORITHM
11.0 RELATED APPLICATIONS
12.0 DIAGNOSTIC ERROR MESSAGES
13.0 AUTHORS
14.0 REFERENCES



1.0 SUMMARY

Converts raw dictionary of heterogen groups EMBL-like format. Converts heterogen group dictionary to EMBL-like format


2.0 INPUTS & OUTPUTS

HETPARSE parse the dictionary of heterogen groups available at http://pdb.rutgers.edu/het_dictionary.txt and writes a file containing the group names, synonyms and 3-letter codes in EMBL-like format. Optionally, HETPARSE will search a directory of PDB files and will count the number of files that each heterogen appears in. The path and extension of the PDB files and the names of the input and output files are user- specified (file extension is set in the ACD file).


3.0 INPUT FILE FORMAT

An excerpt from the raw dictionary of heterogen group is shown (Figure 1).

Input files for usage example

File: het.txt

RESIDUE   061     58
CONECT      N1     2 N2   C5  
CONECT      N2     2 N1   N3  
CONECT      N3     2 N2   N4  
CONECT      N4     3 N3   C5   HN4 
CONECT      C5     3 N1   N4   C6  
CONECT      C6     3 C5   C7   C11 
CONECT      C7     3 C6   C8   C12 
CONECT      C8     3 C7   C9   H8  
CONECT      C9     3 C8   C10  H9  
CONECT      C10    3 C9   C11  H10 
CONECT      C11    3 C6   C10  H11 
CONECT      C12    3 C7   C13  C17 
CONECT      C13    3 C12  C14  H13 
CONECT      C14    3 C13  C15  H14 
CONECT      C15    3 C14  C16  C18 
CONECT      C16    3 C15  C17  H16 
CONECT      C17    3 C12  C16  H17 
CONECT      C18    4 C15  N19 1H18 2H18 
CONECT      N19    3 C18  C20  C33 
CONECT      C20    3 N19  C21  N25 
CONECT      C21    4 C20  C22 1H21 2H21 
CONECT      C22    4 C21  C23 1H22 2H22 
CONECT      C23    4 C22  C24 1H23 2H23 
CONECT      C24    4 C23 1H24 2H24 3H24 
CONECT      N25    2 C20  C26 
CONECT      C26    3 N25  C27  C32 
CONECT      C27    3 C26  C28  H27 
CONECT      C28    3 C27  C29  H28 
CONECT      C29    3 C28  O30  C31 
CONECT      O30    2 C29  HOU 
CONECT      C31    3 C29  C32  H31 
CONECT      C32    3 C26  C31  C33 
CONECT      C33    3 N19  C32  O34 
CONECT      O34    1 C33 
CONECT      HN4    1 N4  
CONECT      H8     1 C8  
CONECT      H9     1 C9  
CONECT      H10    1 C10 
CONECT      H11    1 C11 
CONECT      H13    1 C13 
CONECT      H14    1 C14 
CONECT      H16    1 C16 
CONECT      H17    1 C17 
CONECT     1H18    1 C18 
CONECT     2H18    1 C18 
CONECT     1H21    1 C21 
CONECT     2H21    1 C21 
CONECT     1H22    1 C22 
CONECT     2H22    1 C22 


  [Part of this file has been deleted for brevity]

CONECT     2H6     1 C6  
CONECT     1H8     1 C8  
CONECT     2H8     1 C8  
CONECT     1H9     1 C9  
CONECT     2H9     1 C9  
END
HET    104             28
HETSYN     104 TRIENTINE
HETNAM     104 N,N'-BIS(2-AMINOETHYL)-1,2-ETHANEDIAMINE
FORMUL      104    C6 H18 N4

RESIDUE   105     32
CONECT      B      3 O1   O2   C3  
CONECT      O1     2 B    H1  
CONECT      O2     2 B    H2  
CONECT      C3     4 B    N4  1H3  2H3  
CONECT      N4     3 C3   C5   H4  
CONECT      C5     3 N4   O6   C7  
CONECT      O6     1 C5  
CONECT      C7     3 C5   C8   C12 
CONECT      N11    2 O10  C12 
CONECT      O10    2 N11  C8  
CONECT      C8     3 C7   O10  C9  
CONECT      C12    3 C7   N11  C13 
CONECT      C9     4 C8  1H9  2H9  3H9  
CONECT      C13    3 C12  C14  C18 
CONECT      C14    3 C13  C15 CL1  
CONECT     CL1     1 C14 
CONECT      C15    3 C14  C16  H15 
CONECT      C16    3 C15  C17  H16 
CONECT      C17    3 C16  C18  H17 
CONECT      C18    3 C13  C17  H18 
CONECT      H1     1 O1  
CONECT      H2     1 O2  
CONECT     1H3     1 C3  
CONECT     2H3     1 C3  
CONECT      H4     1 N4  
CONECT     1H9     1 C9  
CONECT     2H9     1 C9  
CONECT     3H9     1 C9  
CONECT      H15    1 C15 
CONECT      H16    1 C16 
CONECT      H17    1 C17 
CONECT      H18    1 C18 
END
HET    105             32
HETSYN     105 CLOXACILLIN DERIVATIVE
HETNAM     105 N-[5-METHYL-3-O-TOLYL-ISOXAZOLE-4-CARBOXYLIC ACID
HETNAM   2 105 AMIDE] BORONIC ACID
FORMUL      105    C12 H12 N2 O4 B1 CL1




4.0 OUTPUT FILE FORMAT

The records used in the output file (Figure 2) are as follows:

Output files for usage example

File: Ehet.dat

ID   105
DE   N-[5-METHYL-3-O-TOLYL-ISOXAZOLE-4-CARBOXYLIC ACIDAMIDE] BORONIC ACID
SY   CLOXACILLIN DERIVATIVE
NN   0
//
ID   104
DE   N,N'-BIS(2-AMINOETHYL)-1,2-ETHANEDIAMINE
SY   TRIENTINE
NN   0
//
ID   103
DE   2',5'-DIDEOXY-ADENOSINE 3'-MONOPHOSPHATE
SY   .
NN   0
//
ID   102
DE   GAMMA-DEOXY-GAMMA-SULFO-GUANOSINE-5'-TRIPHOSPHATE
SY   .
NN   0
//
ID   101
DE   2'-DEOXY-ADENOSINE 3'-MONOPHOSPHATE
SY   .
NN   0
//
ID   100
DE   1-(5-CHLOROINDOL-3-YL)-3-HYDROXY-3-(2H-TETRAZOL-5-YL)-PROPENONE
SY   .
NN   0
//
ID   074
DE   [PROPYLAMINO-3-HYDROXY-BUTAN-1,4-DIONYL]-ISOLEUCYL-PROLINE
SY   CA-074;
SY   [N-(L-3-TRANS-PROPYLCARBAMOYL-OXIRANE-2-CARBONYL)-L-ISOLEUCYL-L-PROLINE]
NN   0
//
ID   072
DE   
DE   (+/-)(2S,5S)-3-(4-(4-CARBOXYPHENYL)BUTYL)-2-HEPTYL-4-OXO-5-THIAZOLIDINE
SY   THIAZOLIDINONE; GW0072
NN   0
//
ID   061
DE   
DE   2-BUTYL-6-HYDROXY-3-[2'-(1H-TETRAZOL-5-YL)-BIPHENYL-4-YLMETHYL]-3H-QUINAZOLIN-4-ONE
SY   L-159,061
NN   0
//




5.0 DATA FILES

HETPARSE does not use a data file.


6.0 USAGE

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-infile]            infile     This option specifies the name of input file
                                  (raw dictionary of heterogen groups) to
                                  parse, which should be of the format
                                  specified at
                                  http://pdb.rutgers.edu/het_dictionary.txt
   -dogrep             toggle     [N] This option specifies whether to search
                                  a directory of files (typically PDB files)
                                  with keywords. If set, HETPARSE will search
                                  the directory and will count the number of
                                  files that each heterogen appears in.
*  -dirlistpath        dirlist    [./] This option specifies the directory to
                                  search with keywords.
  [-outfile]           outfile    [Ehet.dat] This option specifies the name of
                                  EMBL-like format dictionary of heterogen
                                  groups.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

6.1 COMMAND LINE ARGUMENTS

Standard (Mandatory) qualifiers Allowed values Default
[-infile]
(Parameter 1)
This option specifies the name of input file (raw dictionary of heterogen groups) to parse, which should be of the format specified at http://pdb.rutgers.edu/het_dictionary.txt Input file Required
-dogrep This option specifies whether to search a directory of files (typically PDB files) with keywords. If set, HETPARSE will search the directory and will count the number of files that each heterogen appears in. Toggle value Yes/No No
-dirlistpath This option specifies the directory to search with keywords. Directory with files ./
[-outfile]
(Parameter 2)
This option specifies the name of EMBL-like format dictionary of heterogen groups. Output file Ehet.dat
Additional (Optional) qualifiers Allowed values Default
(none)
Advanced (Unprompted) qualifiers Allowed values Default
(none)

6.2 EXAMPLE SESSION

An example of interactive use of HETPARSE is shown below. Here is a sample session with hetparse


% hetparse 
Converts heterogen group dictionary to EMBL-like format.
Raw dictionary of heterogen groups file: het.txt
Search a directory of PDB files with keywords? [N]: Y
Pdb entry directories [./]: 
Dictionary of heterogen groups output file [Ehet.dat]: Ehet.dat

Go to the input files for this example
Go to the output files for this example




7.0 KNOWN BUGS & WARNINGS

None.


8.0 NOTES

HETPARSE is used to create the EMBOSS data file Ehet.dat that is included in the EMBOSS distribution.

8.1 GLOSSARY OF FILE TYPES

FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
Dictionary of heterogen groups A file of the dictionary of heterogen groups in PDB. HETPARSE N.A.
None


9.0 DESCRIPTION

Some research applications require knowledge of the types of small molecules or 'heterogens' (non-protein groups) that are represented in PDB files. A dictionary of such groups containing various data for all of the heterogens found in PDB is available, but is not in a convenient format. HETPARSE parses the dictionary in its raw format and converts it to an EMBL-like format.


10.0 ALGORITHM

None.


11.0 RELATED APPLICATIONS

See also

Program name Description
aaindexextract Extract amino acid property data from AAINDEX
allversusall Sequence similarity data from all-versus-all comparison
cathparse Generates DCF file from raw CATH files
cutgextract Extract codon usage tables from from CUTG database
domainer Generates domain CCF files from protein CCF files
domainnr Removes redundant domains from a DCF file
domainseqs Adds sequence records to a DCF file
domainsse Add secondary structure records to a DCF file
jaspextract Extract data from JASPAR
pdbparse Parses PDB files and writes protein CCF files
pdbplus Add accessibility & secondary structure to a CCF file
pdbtosp Convert swissprot:PDB codes file to EMBL-like format
printsextract Extract data from PRINTS database for use by pscan
prosextract Processes the PROSITE motif database for use by patmatmotifs
rebaseextract Process the REBASE database for use by restriction enzyme applications
scopparse Generate DCF file from raw SCOP files
seqnr Removes redundancy from DHF files
sites Generate residue-ligand CON files from CCF files
ssematch Search a DCF file for secondary structure matches
tfextract Process TRANSFAC transcription factor database for use by tfscan



12.0 DIAGNOSTIC ERROR MESSAGES

None.


13.0 AUTHORS

Jon Ison (jison@ebi.ac.uk)
The European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SD UK


14.0 REFERENCES

Please cite the authors and EMBOSS.

Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.

See also http://emboss.sourceforge.net/

14.1 Other useful references