SCOPPARSE documentation


 

CONTENTS

1.0 SUMMARY
2.0 INPUTS & OUTPUTS
3.0 INPUT FILE FORMAT
4.0 OUTPUT FILE FORMAT
5.0 DATA FILES
6.0 USAGE
7.0 KNOWN BUGS & WARNINGS
8.0 NOTES
9.0 DESCRIPTION
10.0 ALGORITHM
11.0 RELATED APPLICATIONS
12.0 DIAGNOSTIC ERROR MESSAGES
13.0 AUTHORS
14.0 REFERENCES



1.0 SUMMARY

Generate DCF file from raw SCOP files


2.0 INPUTS & OUTPUTS

SCOPPARSE parses the dir.cla.scop.txt and dir.des.scop.txt SCOP classification files, e.g. available at URLs:
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.57

The format of these files is explained at URL:
http://scop.mrc-lmb.cam.ac.uk/scop/release-notes-1.55.html

SCOPPARSE writes the classification to a DCF file (EMBL-like format). No changes are made to the data other than changing the format in which it is held. The file does not include domain sequence information. The input and output files are specified by the user.


3.0 INPUT FILE FORMAT

An excerpt from the dir.cla.scop.txt (Figure 1) and dir.des.scop.txt (Figure 2) SCOP input files is shown below. The format of these files is explained on the SCOP website:
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57

Input files for usage example

File: scop.cla.raw

# dir.cla.scop.txt 
# SCOP release 1.57 (January 2002)  [File format version 1.00]
# http://scop.mrc-lmb.cam.ac.uk/scop/
# Copyright (c) 1994-2002 the scop authors; see http://scop.mrc-lmb.cam.ac.uk/scop/lic/copy.html
d1cs4a_	1cs4	A:	d.58.29.1	39418	cl=53931,cf=54861,sf=55073,fa=55074,dm=55077,sp=55078,px=39418
d1ii7a_	1ii7	A:	d.159.1.4	62415	cl=53931,cf=56299,sf=56300,fa=64427,dm=64428,sp=64429,px=62415

File: scop.des.raw

# dir.des.scop.txt 
# SCOP release 1.57 (January 2002)  [File format version 1.00]
# http://scop.mrc-lmb.cam.ac.uk/scop/
# Copyright (c) 1994-2002 the scop authors; see http://scop.mrc-lmb.cam.ac.uk/scop/lic/copy.html
53931	cl	d	-	Alpha and beta proteins (a+b)
54861	cf	d.58	-	Ferredoxin-like
55073	sf	d.58.29	-	Adenylyl and guanylyl cyclase catalytic domain
55074	fa	d.58.29.1	-	Adenylyl and guanylyl cyclase catalytic domain
55077	dm	d.58.29.1	-	Adenylyl cyclase VC1, domain C1a
55078	sp	d.58.29.1	-	Dog (Canis familiaris)
39418	px	d.58.29.1	d1cs4a_	1cs4 A:
56299	cf	d.159	-	Metallo-dependent phosphatases
56300	sf	d.159.1	-	Metallo-dependent phosphatases
64427	fa	d.159.1.4	-	DNA double-strand break repair nuclease
64428	dm	d.159.1.4	-	Mre11
64429	sp	d.159.1.4	-	Archaeon Pyrococcus furiosus
62415	px	d.159.1.4	d1ii7a_	1ii7 A:




4.0 OUTPUT FILE FORMAT

An example of the DCF output file is shown in Figure 3. The records used to describe an entry are as follows. Records (4) to (9) are used to describe the position of the domain in the SCOP hierarchy. Various other ADDITIONAL RECORDS may be present if the file is processed by other programs, e.g. DOMAINSEQS or DOMAINSSE.

Output files for usage example

File: all.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//




5.0 DATA FILES

No data files are used.


6.0 USAGE

   Standard (Mandatory) qualifiers:
  [-classfile]         infile     This option specifies the name of raw SCOP
                                  classification file dir.cla.scop.txt_X.XX
                                  (input). This is the raw SCOP classification
                                  file available at
                                  http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57.
  [-desinfile]         infile     This option specifies the name of raw SCOP
                                  description file dir.des.scop.txt_X.XX
                                  (input). This is the raw SCOP description
                                  file available at
                                  http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.57.
   -nosegments         boolean    [N] This option specifies whether to omit
                                  domains comprising of more than one segment.
                                  This is necessary if a continuous residue
                                  sequence is required.
   -nomultichain       boolean    [N] This option specifies whether to omit
                                  domains comprising segments from more than
                                  one chain. This is necessary if a continuous
                                  residue sequence is required.
  [-dcffile]           outfile    [test.scop] This option specifies the name
                                  of SCOP DCF file (domain classification
                                  file) (output). A 'domain classification
                                  file' contains classification and other data
                                  for domains from the SCOP or CATH
                                  databases. The file is generated by using
                                  DOMAINER and is in DCF format (EMBL-like).
                                  Domain sequence information can be added to
                                  the file by using DOMAINSEQS.

   Additional (Optional) qualifiers:
   -nominor            boolean    [N] This option specifies whether to omit
                                  domains from minor classes (defined as
                                  anything not in class 'All alpha proteins',
                                  'All beta proteins', 'Alpha and beta
                                  proteins (a/b)' or 'Alpha and beta proteins
                                  (a+b)'). This is necessary or appropriate
                                  for many analyses.

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-dcffile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

6.1 COMMAND LINE ARGUMENTS

Standard (Mandatory) qualifiers Allowed values Default
[-classfile]
(Parameter 1)
This option specifies the name of raw SCOP classification file dir.cla.scop.txt_X.XX (input). This is the raw SCOP classification file available at http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57. Input file Required
[-desinfile]
(Parameter 2)
This option specifies the name of raw SCOP description file dir.des.scop.txt_X.XX (input). This is the raw SCOP description file available at http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.57. Input file Required
-nosegments This option specifies whether to omit domains comprising of more than one segment. This is necessary if a continuous residue sequence is required. Boolean value Yes/No No
-nomultichain This option specifies whether to omit domains comprising segments from more than one chain. This is necessary if a continuous residue sequence is required. Boolean value Yes/No No
[-dcffile]
(Parameter 3)
This option specifies the name of SCOP DCF file (domain classification file) (output). A 'domain classification file' contains classification and other data for domains from the SCOP or CATH databases. The file is generated by using DOMAINER and is in DCF format (EMBL-like). Domain sequence information can be added to the file by using DOMAINSEQS. Output file test.scop
Additional (Optional) qualifiers Allowed values Default
-nominor This option specifies whether to omit domains from minor classes (defined as anything not in class 'All alpha proteins', 'All beta proteins', 'Alpha and beta proteins (a/b)' or 'Alpha and beta proteins (a+b)'). This is necessary or appropriate for many analyses. Boolean value Yes/No No
Advanced (Unprompted) qualifiers Allowed values Default
(none)

6.2 EXAMPLE SESSION

An example of interactive use of SCOPPARSE is shown below. Here is a sample session with scopparse


% scopparse 
Generate DCF file from raw SCOP files.
Raw scop classification file: scop.cla.raw
Raw scop description file: scop.des.raw
Omit domains comprising of more than one segment. [N]: Y
Omit domains comprising segments from more than one chain. [N]: N
Domain classification output file [test.scop]: all.scop

Go to the input files for this example
Go to the output files for this example





7.0 KNOWN BUGS & WARNINGS

None.


8.0 NOTES

Some SCOP domains are comprised of more than one segments of polypeptide chain, these segments belonging to a single or more than one polypeptide chains. It is debatable whether a domain (using the widely accepted definition) can truly consist of regions from more than polypeptide. Accordingly, SCOPPARSE gives the option of omitting from the output file domains that consist of more than one segment and domains that consist of more than one segment where the segments are from different chains.

SCOP includes several minor classes which are not appropriate for some anaylses. Accordingly, SCOPPARSE gives the option to omit domains from minor classes. This is defined as anything not in class 'All alpha proteins', 'All beta proteins', 'Alpha and beta proteins (a/b)' or 'Alpha and beta proteins (a+b)'

8.1 GLOSSARY OF FILE TYPES

FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
SCOP parsable files SCOP format. Raw SCOP classification data. Available from http://scop.mrc-lmb.cam.ac.uk/scop/parse/ N.A.
Domain classification file (for SCOP) DCF format (EMBL-like format for domain classification data). Classification and other data for domains from SCOP. The file is in DCF format (EMBL-like). SCOPPARSE Domain sequence information can be added to the file by using DOMAINSEQS.



8.3 ADDITIONAL RECORDS

The following records for database sequence and secondary structure may be present in a DCF file that has been processed by using DOMAINSEQS or DOMAINSSE.
XX
AC   P02213
XX
SP   GLB1_SCAIN
XX
RA   1 START; 146 END;
XX
SQ   SEQUENCE   146 AA;  15947 MW;  5868B4E5 CRC32;
     PSVYDAAAQL TADVKKDLRD SWKVIGSDKK GNGVALMTTL FADNQETIGY FKRLGDVSQG
     MANDKLRGHS ITLMYALQNF IDQLDNPDDL VCVVEKFAVN HITRKISAAE FGKINGPIKK
     VLASKNFGDK YANAWAKLVA VVQAAL
XX
AC   P02213
XX
SP   GLB1_SCAIN
XX
RA   1 START; 146 END; 
XX
SQ   SEQUENCE   146 AA;  15947 MW;  5868B4E5 CRC32;
     PSVYDAAAQL TADVKKDLRD SWKVIGSDKK GNGVALMTTL FADNQETIGY FKRLGDVSQG
     MANDKLRGHS ITLMYALQNF IDQLDNPDDL VCVVEKFAVN HITRKISAAE FGKINGPIKK
     VLASKNFGDK YANAWAKLVA VVQAAL
None


9.0 DESCRIPTION

The raw SCOP classification files are inconvenient for some uses because the text describing the domain classification is given in a different file to the classification itself, the file formats are not easily extended and differ from other related classifications such as CATH. SCOPPARSE reads the raw SCOP classification files and writes a single file in DCF (EMBL-like) format, which is an easier format to work with, is more human-readable and is more extensible than the native SCOP database format.


10.0 ALGORITHM

None.


11.0 RELATED APPLICATIONS

See also

Program name Description
aaindexextract Extract amino acid property data from AAINDEX
allversusall Sequence similarity data from all-versus-all comparison
cathparse Generates DCF file from raw CATH files
cutgextract Extract codon usage tables from from CUTG database
domainer Generates domain CCF files from protein CCF files
domainnr Removes redundant domains from a DCF file
domainseqs Adds sequence records to a DCF file
domainsse Add secondary structure records to a DCF file
hetparse Converts heterogen group dictionary to EMBL-like format
jaspextract Extract data from JASPAR
pdbparse Parses PDB files and writes protein CCF files
pdbplus Add accessibility & secondary structure to a CCF file
pdbtosp Convert swissprot:PDB codes file to EMBL-like format
printsextract Extract data from PRINTS database for use by pscan
prosextract Processes the PROSITE motif database for use by patmatmotifs
rebaseextract Process the REBASE database for use by restriction enzyme applications
seqnr Removes redundancy from DHF files
sites Generate residue-ligand CON files from CCF files
ssematch Search a DCF file for secondary structure matches
tfextract Process TRANSFAC transcription factor database for use by tfscan



12.0 DIAGNOSTIC ERROR MESSAGES

None.


13.0 AUTHORS

Alan Bleasby (ableasby@ebi.ac.uk)

Jon Ison (jison@ebi.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK


14.0 REFERENCES

Please cite the authors and EMBOSS.

Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.

See also http://emboss.sourceforge.net/

14.1 Other useful references

1. Conte, L.L., Ailey, B., Hubbard, T.J. Brenner, S.E., Murzin, A.G. and Chothia, C. (2000) SCOP: a structural classification of proteins database. Nucleic Acids Res. 28, 257-259.