ROCON documentation


 


CONTENTS

1.0 SUMMARY
2.0 INPUTS & OUTPUTS
3.0 INPUT FILE FORMAT
4.0 OUTPUT FILE FORMAT
5.0 DATA FILES
6.0 USAGE
7.0 KNOWN BUGS & WARNINGS
8.0 NOTES
9.0 DESCRIPTION
10.0 ALGORITHM
11.0 RELATED APPLICATIONS
12.0 DIAGNOSTIC ERROR MESSAGES
13.0 AUTHORS
14.0 REFERENCES



1.0 SUMMARY

Reads a DHF file (domain hits file) of hits (sequences of unknown structural classification) and a domain families file (validation sequences of known classification) and writes a "hits file" for the hits, which are classified and rank-ordered on the basis of score. Generate a hits file from comparing two DHF files


2.0 INPUTS & OUTPUTS

ROCON reads a DHF file (domain hits file) of hits generated for a single node from a classification hierarchy, e.g. SCOP family. These sequences are putatively related to the node in question but are, in fact, of unknown classification. ROCON also reads a domain families file (in DHF format), containing "validation" sequences (of known classification). These sequences are used to classify the input hits. A "hits file" (suitable for input into the ROCPLOT application) is written, which contains the input hits, classified and rank-ordered on the basis of score.


3.0 INPUT FILE FORMAT

The format of the DHF is described in SEQSEARCH documentation. See also the example of the DHF file for hit sequences (Figure 1) and validation sequences (Figure 2) below.

Input files for usage example

File: rocon/rocon.dhf

> Q9YBD5^.^11^105^SCOP^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^SPARSE^61.50^0.000e+00^4.000e-10
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFNVIEDYKVVEKVKLKLP
> Q9YBD5^.^95^135^SCOP^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^SPARSE^11.50^0.000e+00^4.000e-5
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFNVIEDYKVVEKVKLKLP
> Q9YBD5^.^181^235^SCOP^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^SPARSE^161.50^0.000e+00^4.000e-5
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFNVIEDYKVVEKVKLKLP
> O26938^.^11^101^SCOP^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^81.90^0.000e+00^3.000e-16
VKPIKNGTVIDHITANRSLNVLNILGLPDGRSKVTVAMNMDSSQLGSKDIVKIENRELKPSEVDQIALIAPRATINIVRDYKIVEKAKVRL
> Q8Z130^.^8^99^SCOP^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^181.00^0.000e+00^0.000e+00
VEAIKCGTVIDHIPAQVGFKLLSLFKLTETDQRITIGLNLPSGEMGRKDLIKIENTFLTDEQVNQLALYAPQATVNRIDNYDVVGKSRPSLP
> Q7MX57^.^8^99^SCOP^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^80.80^0.000e+00^7.000e-16
VAAIRNGIVIDHIPPTKLFKVATLLQLDDLDKRITIGNNLRSRSHGSKGVIKIEDKTFEEEELNRIALIAPNVRLNIIRDYEVVEKRQVEVP
> Q8TVB1^.^7^98^SCOP^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^72.70^0.000e+00^2.000e-13
VKRIEMGTVLDHLPPGTAPQIMRILDIDPTETTLLVAINVESSKMGRKDILKIEGKILSEEEANKVALVAPNATVNIVRDYSVAEKFQVKPP
> P96175^.^8^99^SCOP^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^107.00^0.000e+00^7.000e-24
VEAICNGYVIDHIPSGQGVKILRLFSLTDTKQRVTVGFNLPSHDGTTKDLIKVENTEITKSQANQLALLAPNATVNIIENFKVTDKHSLALP

File: rocon.valid

> Q9YBD5^.^11^105^SCOP^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^61.50^0.000e+00^4.000e-10
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFNVIEDYKVVEKVKLKLP
> Q9UX07^.^12^104^SCOP^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^65.80^0.000e+00^2.000e-11
VSKIRNGTVIDHIPAGRALAVLRILGIRGSEGYRVALVMNVESKKIGRKDIVKIEDRVIDEKEASLITLIAPSATINIIRDYVVTEKRHLEVP
> Q9KP65^.^9^100^SCOP^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^128.00^0.000e+00^3.000e-30
VEAIKNGTVIDHIPAKVGIKVLKLFDMHNSAQRVTIGLNLPSSALGSKDLLKIENVFISEAQANKLALYAPHATVNQIENYEVVKKLALQLP
> Q9K1K9^.^8^99^SCOP^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^101.00^0.000e+00^5.000e-22
VEAIEKGTVIDHIPAGRGLTILRQFKLLHYGNAVTVGFNLPSKTQGSKDIIKIKGVCLDDKAADRLALFAPEAVVNTIDNFKVVQKRHLNLP
> Q9JWY6^.^8^99^SCOP^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^98.90^0.000e+00^2.000e-21
VEAIEKGTVIDHIPAGRGLTILRQFKLLHYGNAVTVGFNLPSKTQGSKDIIKIKGVCLDDKAADRLALFAPEAVVNTIDHFKVVQKRHLNLP
> Q9HKM3^.^7^99^SCOP^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^79.60^0.000e+00^2.000e-15
ISKIRDGTVIDHVPSGKGIRVIGVLGVHEDVNYTVSLAIHVPSNKMGFKDVIKIENRFLDRNELDMISLIAPNATISIIKNYEISEKFQVELP
> Q9HHN3^.^9^101^SCOP^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^78.50^0.000e+00^4.000e-15
VSKIQAGTVIDHIPAGQALQVLQILGTNGASDDQITVGMNVTSERHHRKDIVKIEGRELSQDEVDVLSLIAPDATINIVRDYEVDEKRRVDRP
> Q97FS4^.^4^93^SCOP^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^49.20^0.000e+00^2.000e-06
INSIKNGIVIDHIKAGHGIKIYNYLKLGEAEFPTALIMNAISKKNKAKDIIKIENVMDLDLAVLGFLDPNITVNIIEDEKIRQKIQLKLP
> Q97B28^.^8^100^SCOP^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^79.20^0.000e+00^2.000e-15
ISKIKDGTVIDHIPSGKALRVLSILGIRDDVDYTVSVGMHVPSSKMEYKDVIKIENRSLDKNELDMISLTAPNATISIIKNYEISEKFKVELP
> Q970X3^.^11^101^SCOP^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^78.50^0.000e+00^3.000e-15
VSKIKNGTVIDHIPAGRALAVLRILKIAEGYRIALVMNVESKKMGKKDIVKIENKEVDEKEANLITLIAPTATINIIRDYEVVEKKKLKIP
> Q8ZTG2^.^7^99^SCOP^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^66.10^0.000e+00^2.000e-11
VSKIENGTVIDHIPAGRALTVLRILGISGKEGLRVALVMNVESKKLGKKDIVKIEGRELTPEEVNIISAVAPTATINIIRNFAVVKKFKVTPP
> Q8ZB38^.^9^100^SCOP^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^156.00^0.000e+00^1.000e-38
VEAIKCGTVIDHIPAQIGFKLLSLFKLTATDQRITIGLNLPSKRSGRKDLIKIENTFLTEQQANQLAMYAPDATVNRIDNYEVVKKLTLSLP
> Q8Z130^.^8^99^SCOP^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^181.00^0.000e+00^0.000e+00
VEAIKCGTVIDHIPAQVGFKLLSLFKLTETDQRITIGLNLPSGEMGRKDLIKIENTFLTDEQVNQLALYAPQATVNRIDNYDVVGKSRPSLP
> Q8U374^.^6^99^SCOP^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^92.00^0.000e+00^3.000e-19
VSAIKEGTVIDHIPAGKGLKVIQILGLGELKNGGAVLLAMNVPSKKLGRKDIVKVEGKFLSEEEVNKIALVAPTATVNIIREYKVVEKFKVEIP
> Q8TVB1^.^7^98^SCOP^.^3^Class 1^.^.^Fold 1^Superfamily 2^Family 1^PSIBLAST^72.70^0.000e+00^2.000e-13
VKRIEMGTVLDHLPPGTAPQIMRILDIDPTETTLLVAINVESSKMGRKDILKIEGKILSEEEANKVALVAPNATVNIVRDYSVAEKFQVKPP
> Q8THL3^.^9^100^SCOP^.^3^Class 1^.^.^Fold 1^Superfamily 2^Family 1^PSIBLAST^69.20^0.000e+00^2.000e-12
IQAIENGTVIDHITAGQALNVLRILRISSAFRATVSFVMNAPGARGKKDVVKIEGKELSVEELNRIALISPKATINIIRDFEVVQKNKVVLP
> Q8PXK6^.^9^100^SCOP^.^3^Class 1^.^.^Fold 1^Superfamily 2^Family 1^PSIBLAST^62.70^0.000e+00^2.000e-10
VQAIESGTVIDHIKSGQALNVLRILGISSAFRATISFVMNAPGAGGKKDVVKIEGKELSVEELNRIALISPKATINIIRDFVVVQKNNVVLP
> Q8K9H8^.^8^99^SCOP^.^3^Class 1^.^.^Fold 1^Superfamily 2^Family 1^PSIBLAST^146.00^0.000e+00^1.000e-35
VEAIKSGSVIDHIPAHIGFKLLSLFRFTETEKRITIGLNLPSQKLDKKDIIKIENTFLSDDQINQLAIYAPCATVNYIEKYNLVGKIFPSLP
> Q8DCF7^.^9^100^SCOP^.^3^Class 1^.^.^Fold 1^Superfamily 2^Family 1^PSIBLAST^127.00^0.000e+00^9.000e-30
VEAIKNGTVIDHIPAQVGIKVLKLFDMHNSSQRVTIGLNLPSSALGNKDLLKIENVFINEEQASKLALYAPHATVNQIEDYQVVKKLALELP
> Q8D1W6^.^9^100^SCOP^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^123.00^0.000e+00^1.000e-28
VEAIFGGTVIDHIPAQVGLKLLSLFKWLHTKERITMGLNLPSNQQKKKDLIKLENVLLNEDQANQLSIYAPLATVNQIKNYIVIKKQKLKLP
> Q8A9S4^.^10^101^SCOP^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^63.80^0.000e+00^9.000e-11
VAALKNGTVIDHIPSEKLFTVVQLLGVEQMKCNITIGFNLDSKKLGKKGIIKIADKFFCDEEINRISVVAPYVKLNIIRDYEVVEKKEVRMP
> Q891I9^.^4^94^SCOP^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^52.30^0.000e+00^2.000e-07
ITSIKDGIVIDHIKSGYGIKIFNYLNLKNVEYSVALIMNVFSSKLGKKDIIKIANKEIDIDFTVLGLIDPTITINIIEDEKIKEKLNLELP
> Q87LF7^.^9^100^SCOP^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^130.00^0.000e+00^7.000e-31
VEAIKNGTVIDHIPAQIGIKVLKLFDMHNSSQRVTIGLNLPSSALGHKDLLKIENVFINEEQASKLALYAPHATVNQIENYEVVKKLALELP
> Q83IL8^.^8^99^SCOP^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^189.00^0.000e+00^0.000e+00
VEAIKRGTVIDHIPAQIGFKLLSLFKLTETDQRITIGLNLPSGEMGRKDLIKIENTFLSEEQVDQLALYAPQATVNRIDNYEVVGKSRPSLP
> Q7P144^.^7^98^SCOP^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^128.00^0.000e+00^3.000e-30
VEALKQGTVIDHIPAGEGVKILRLFKLTETGERVTVGLNLVSRHMGSKDLIKVENVALTEEQANELALFAPKATVNVIDNFEVVKKHKLTLP
> Q7MZ14^.^9^100^SCOP^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^150.00^0.000e+00^6.000e-37
VEAIRCGTVIDHIPAQVGFKLLSLFKLTETDQRITIGLNLPSNRLGKKDLIKIENTFLTEQQANQLAMYAPNATVNCIENYEVVKKLPINLP
> Q7MX57^.^8^99^SCOP^.^5^Class 1^.^.^Fold 2^Superfamily 1^Family 1^PSIBLAST^80.80^0.000e+00^7.000e-16
VAAIRNGIVIDHIPPTKLFKVATLLQLDDLDKRITIGNNLRSRSHGSKGVIKIEDKTFEEEELNRIALIAPNVRLNIIRDYEVVEKRQVEVP
> Q7MHF0^.^9^100^SCOP^.^5^Class 1^.^.^Fold 2^Superfamily 1^Family 1^PSIBLAST^127.00^0.000e+00^8.000e-30
VEAIKNGTVIDHIPAQVGIKVLKLFDMHNSSQRVTIGLNLPSSALGNKDLLKIENVFINEEQASKLALYAPHATVNQIEDYQVVKKLALELP
> Q58801^.^9^99^SCOP^.^5^Class 1^.^.^Fold 2^Superfamily 1^Family 1^PSIBLAST^61.50^0.000e+00^5.000e-10
VKKITNGTVIDHIDAGKALMVFKVLNVPKETSVMIAINVPSKKKGKKDILKIEGIELKKEDVDKISLISPDVTINIIRNGKVVEKLKPQIP
> P96175^.^8^99^SCOP^.^5^Class 1^.^.^Fold 2^Superfamily 1^Family 1^PSIBLAST^107.00^0.000e+00^7.000e-24
VEAICNGYVIDHIPSGQGVKILRLFSLTDTKQRVTVGFNLPSHDGTTKDLIKVENTEITKSQANQLALLAPNATVNIIENFKVTDKHSLALP
> P96111^.^375^472^SCOP^.^5^Class 1^.^.^Fold 2^Superfamily 1^Family 1^PSIBLAST^47.30^0.000e+00^9.000e-06
GIKPIENGTVIDHIAKGKTPEEIYSTILKIRKILRLYDVDSADGIFRSSDGSFKGYISLPDRYLSKKEIKKLSAISPNTTVNIIKNSTVVEKYRIKLP
> P77919^.^6^99^SCOP^.^6^Class 1^.^.^Fold 2^Superfamily 1^Family 2^PSIBLAST^93.50^0.000e+00^1.000e-19
VSAIKEGTVIDHIPAGKGLKVIEILKLGKLTNGGAVLLAMNVPSKKLGRKDIVKVEGRFLSEEEVNKIALVAPNATVNIIRDYKVVEKFKVEVP
> P74766^.^12^104^SCOP^.^6^Class 1^.^.^Fold 2^Superfamily 1^Family 2^PSIBLAST^74.20^0.000e+00^7.000e-14
VSKIKNGTVIDHIPAGRAFAVLNVLGIKGHEGFRIALVINVDSKKMGKKDIVKIEDKEISDTEANLITLIAPTATINIVREYEVVKKTKLEVP
> P57451^.^8^99^SCOP^.^7^Class 1^.^.^Fold 2^Superfamily 2^Family 1^PSIBLAST^143.00^0.000e+00^1.000e-34
VEAIKSGSVIDHIPEYIGFKLLSLFRFTETEKRITIGLNLPSKKLGRKDIIKIENTFLSDEQINQLAIYAPHATVNYINEYNLVRKVFPTLP
> P19936^.^8^99^SCOP^.^7^Class 1^.^.^Fold 2^Superfamily 2^Family 1^PSIBLAST^159.00^0.000e+00^1.000e-39
VEAIKCGTVIDHIPAQIGFKLLTLFKLTATDQRITIGLNLPSNELGRKDLIKIENTFLTEQQANQLAMYAPKATVNRIDNYEVVRKLTLSLP
> P08421^.^8^99^SCOP^.^7^Class 1^.^.^Fold 2^Superfamily 2^Family 1^PSIBLAST^183.00^0.000e+00^0.000e+00
VEAIKCGTVIDHIPAQVGFKLLSLFKLTETDQRITIGLNLPSGEMGRKDLIKIENTFLTEEQVNQLALYAPQATVNRIDNYDVVGKSRPSLP
> P00478^.^8^99^SCOP^.^8^Class 1^.^.^Fold 2^Superfamily 2^Family 2^PSIBLAST^191.00^0.000e+00^0.000e+00
VEAIKRGTVIDHIPAQIGFKLLSLFKLTETDQRITIGLNLPSGEMGRKDLIKIENTFLSEDQVDQLALYAPQATVNRIDNYEVVGKSRPSLP
> O58452^.^6^99^SCOP^.^8^Class 1^.^.^Fold 2^Superfamily 2^Family 2^PSIBLAST^94.30^0.000e+00^6.000e-20
VSAIKEGTVIDHIPAGKGLKVIEILGLSKLSNGGSVLLAMNVPSKKLGRKDIVKVEGKFLSEEEVNKIALVAPTATVNIIRNYKVVEKFKVEVP
> O30129^.^6^98^SCOP^.^9^Class 2^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^79.60^0.000e+00^2.000e-15
VSKIKEGTVIDHINAGKALLVLKILKIQPGTDLTVSMAMNVPSSKMGKKDIVKVEGMFIRDEELNKIALISPNATINLIRDYEIERKFKVSPP
> O26938^.^11^101^SCOP^.^10^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^81.90^0.000e+00^3.000e-16
VKPIKNGTVIDHITANRSLNVLNILGLPDGRSKVTVAMNMDSSQLGSKDIVKIENRELKPSEVDQIALIAPRATINIVRDYKIVEKAKVRL

<--

Figure 1 Excerpt of DHF file (hit sequences)
> Q9YBD5^.^11^105^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^SPARSE^61.50^0.000e+00^4.000e-10
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFNVIEDYKVVEKVKLKLP
> Q9YBD5^.^95^135^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^SPARSE^11.50^0.000e+00^4.000e-5
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFNVIEDYKVVEKVKLKLP
> Q9YBD5^.^181^235^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^SPARSE^161.50^0.000e+00^4.000e-5
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFNVIEDYKVVEKVKLKLP
> O26938^.^11^101^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^81.90^0.000e+00^3.000e-16
VKPIKNGTVIDHITANRSLNVLNILGLPDGRSKVTVAMNMDSSQLGSKDIVKIENRELKPSEVDQIALIAPRATINIVRDYKIVEKAKVRL
> Q8Z130^.^8^99^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^181.00^0.000e+00^0.000e+00
VEAIKCGTVIDHIPAQVGFKLLSLFKLTETDQRITIGLNLPSGEMGRKDLIKIENTFLTDEQVNQLALYAPQATVNRIDNYDVVGKSRPSLP
> Q7MX57^.^8^99^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^80.80^0.000e+00^7.000e-16
VAAIRNGIVIDHIPPTKLFKVATLLQLDDLDKRITIGNNLRSRSHGSKGVIKIEDKTFEEEELNRIALIAPNVRLNIIRDYEVVEKRQVEVP
> Q8TVB1^.^7^98^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^72.70^0.000e+00^2.000e-13
VKRIEMGTVLDHLPPGTAPQIMRILDIDPTETTLLVAINVESSKMGRKDILKIEGKILSEEEANKVALVAPNATVNIVRDYSVAEKFQVKPP
> P96175^.^8^99^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^107.00^0.000e+00^7.000e-24
VEAICNGYVIDHIPSGQGVKILRLFSLTDTKQRVTVGFNLPSHDGTTKDLIKVENTEITKSQANQLALLAPNATVNIIENFKVTDKHSLALP

Figure 1 Excerpt of domain families file (validation sequences)
> Q9YBD5^.^11^105^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^61.50^0.000e+00^4.000e-10
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFNVIEDYKVVEKVKLKLP
> Q9UX07^.^12^104^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^65.80^0.000e+00^2.000e-11
VSKIRNGTVIDHIPAGRALAVLRILGIRGSEGYRVALVMNVESKKIGRKDIVKIEDRVIDEKEASLITLIAPSATINIIRDYVVTEKRHLEVP
> Q9KP65^.^9^100^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^128.00^0.000e+00^3.000e-30
VEAIKNGTVIDHIPAKVGIKVLKLFDMHNSAQRVTIGLNLPSSALGSKDLLKIENVFISEAQANKLALYAPHATVNQIENYEVVKKLALQLP
> Q9K1K9^.^8^99^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^101.00^0.000e+00^5.000e-22
VEAIEKGTVIDHIPAGRGLTILRQFKLLHYGNAVTVGFNLPSKTQGSKDIIKIKGVCLDDKAADRLALFAPEAVVNTIDNFKVVQKRHLNLP
> Q9JWY6^.^8^99^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^98.90^0.000e+00^2.000e-21
VEAIEKGTVIDHIPAGRGLTILRQFKLLHYGNAVTVGFNLPSKTQGSKDIIKIKGVCLDDKAADRLALFAPEAVVNTIDHFKVVQKRHLNLP
> Q9HKM3^.^7^99^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^79.60^0.000e+00^2.000e-15
ISKIRDGTVIDHVPSGKGIRVIGVLGVHEDVNYTVSLAIHVPSNKMGFKDVIKIENRFLDRNELDMISLIAPNATISIIKNYEISEKFQVELP
> Q9HHN3^.^9^101^.^1^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^78.50^0.000e+00^4.000e-15
VSKIQAGTVIDHIPAGQALQVLQILGTNGASDDQITVGMNVTSERHHRKDIVKIEGRELSQDEVDVLSLIAPDATINIVRDYEVDEKRRVDRP
> Q97FS4^.^4^93^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^49.20^0.000e+00^2.000e-06
INSIKNGIVIDHIKAGHGIKIYNYLKLGEAEFPTALIMNAISKKNKAKDIIKIENVMDLDLAVLGFLDPNITVNIIEDEKIRQKIQLKLP
> Q97B28^.^8^100^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^79.20^0.000e+00^2.000e-15
ISKIKDGTVIDHIPSGKALRVLSILGIRDDVDYTVSVGMHVPSSKMEYKDVIKIENRSLDKNELDMISLTAPNATISIIKNYEISEKFKVELP
> Q970X3^.^11^101^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^78.50^0.000e+00^3.000e-15
VSKIKNGTVIDHIPAGRALAVLRILKIAEGYRIALVMNVESKKMGKKDIVKIENKEVDEKEANLITLIAPTATINIIRDYEVVEKKKLKIP
> Q8ZTG2^.^7^99^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^66.10^0.000e+00^2.000e-11
VSKIENGTVIDHIPAGRALTVLRILGISGKEGLRVALVMNVESKKLGKKDIVKIEGRELTPEEVNIISAVAPTATINIIRNFAVVKKFKVTPP
> Q8ZB38^.^9^100^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^156.00^0.000e+00^1.000e-38
VEAIKCGTVIDHIPAQIGFKLLSLFKLTATDQRITIGLNLPSKRSGRKDLIKIENTFLTEQQANQLAMYAPDATVNRIDNYEVVKKLTLSLP
> Q8Z130^.^8^99^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^181.00^0.000e+00^0.000e+00
VEAIKCGTVIDHIPAQVGFKLLSLFKLTETDQRITIGLNLPSGEMGRKDLIKIENTFLTDEQVNQLALYAPQATVNRIDNYDVVGKSRPSLP
> Q8U374^.^6^99^.^2^Class 1^.^.^Fold 1^Superfamily 1^Family 2^PSIBLAST^92.00^0.000e+00^3.000e-19
VSAIKEGTVIDHIPAGKGLKVIQILGLGELKNGGAVLLAMNVPSKKLGRKDIVKVEGKFLSEEEVNKIALVAPTATVNIIREYKVVEKFKVEIP
> Q8TVB1^.^7^98^.^3^Class 1^.^.^Fold 1^Superfamily 2^Family 1^PSIBLAST^72.70^0.000e+00^2.000e-13
VKRIEMGTVLDHLPPGTAPQIMRILDIDPTETTLLVAINVESSKMGRKDILKIEGKILSEEEANKVALVAPNATVNIVRDYSVAEKFQVKPP
> Q8THL3^.^9^100^.^3^Class 1^.^.^Fold 1^Superfamily 2^Family 1^PSIBLAST^69.20^0.000e+00^2.000e-12
IQAIENGTVIDHITAGQALNVLRILRISSAFRATVSFVMNAPGARGKKDVVKIEGKELSVEELNRIALISPKATINIIRDFEVVQKNKVVLP
> Q8PXK6^.^9^100^.^3^Class 1^.^.^Fold 1^Superfamily 2^Family 1^PSIBLAST^62.70^0.000e+00^2.000e-10
VQAIESGTVIDHIKSGQALNVLRILGISSAFRATISFVMNAPGAGGKKDVVKIEGKELSVEELNRIALISPKATINIIRDFVVVQKNNVVLP
> Q8K9H8^.^8^99^.^3^Class 1^.^.^Fold 1^Superfamily 2^Family 1^PSIBLAST^146.00^0.000e+00^1.000e-35
VEAIKSGSVIDHIPAHIGFKLLSLFRFTETEKRITIGLNLPSQKLDKKDIIKIENTFLSDDQINQLAIYAPCATVNYIEKYNLVGKIFPSLP
> Q8DCF7^.^9^100^.^3^Class 1^.^.^Fold 1^Superfamily 2^Family 1^PSIBLAST^127.00^0.000e+00^9.000e-30
VEAIKNGTVIDHIPAQVGIKVLKLFDMHNSSQRVTIGLNLPSSALGNKDLLKIENVFINEEQASKLALYAPHATVNQIEDYQVVKKLALELP
> Q8D1W6^.^9^100^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^123.00^0.000e+00^1.000e-28
VEAIFGGTVIDHIPAQVGLKLLSLFKWLHTKERITMGLNLPSNQQKKKDLIKLENVLLNEDQANQLSIYAPLATVNQIKNYIVIKKQKLKLP
> Q8A9S4^.^10^101^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^63.80^0.000e+00^9.000e-11
VAALKNGTVIDHIPSEKLFTVVQLLGVEQMKCNITIGFNLDSKKLGKKGIIKIADKFFCDEEINRISVVAPYVKLNIIRDYEVVEKKEVRMP
> Q891I9^.^4^94^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^52.30^0.000e+00^2.000e-07
ITSIKDGIVIDHIKSGYGIKIFNYLNLKNVEYSVALIMNVFSSKLGKKDIIKIANKEIDIDFTVLGLIDPTITINIIEDEKIKEKLNLELP
> Q87LF7^.^9^100^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^130.00^0.000e+00^7.000e-31
VEAIKNGTVIDHIPAQIGIKVLKLFDMHNSSQRVTIGLNLPSSALGHKDLLKIENVFINEEQASKLALYAPHATVNQIENYEVVKKLALELP
> Q83IL8^.^8^99^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^189.00^0.000e+00^0.000e+00
VEAIKRGTVIDHIPAQIGFKLLSLFKLTETDQRITIGLNLPSGEMGRKDLIKIENTFLSEEQVDQLALYAPQATVNRIDNYEVVGKSRPSLP
> Q7P144^.^7^98^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^128.00^0.000e+00^3.000e-30
VEALKQGTVIDHIPAGEGVKILRLFKLTETGERVTVGLNLVSRHMGSKDLIKVENVALTEEQANELALFAPKATVNVIDNFEVVKKHKLTLP
> Q7MZ14^.^9^100^.^4^Class 1^.^.^Fold 1^Superfamily 2^Family 2^PSIBLAST^150.00^0.000e+00^6.000e-37
VEAIRCGTVIDHIPAQVGFKLLSLFKLTETDQRITIGLNLPSNRLGKKDLIKIENTFLTEQQANQLAMYAPNATVNCIENYEVVKKLPINLP
> Q7MX57^.^8^99^.^5^Class 1^.^.^Fold 2^Superfamily 1^Family 1^PSIBLAST^80.80^0.000e+00^7.000e-16
VAAIRNGIVIDHIPPTKLFKVATLLQLDDLDKRITIGNNLRSRSHGSKGVIKIEDKTFEEEELNRIALIAPNVRLNIIRDYEVVEKRQVEVP
> Q7MHF0^.^9^100^.^5^Class 1^.^.^Fold 2^Superfamily 1^Family 1^PSIBLAST^127.00^0.000e+00^8.000e-30
VEAIKNGTVIDHIPAQVGIKVLKLFDMHNSSQRVTIGLNLPSSALGNKDLLKIENVFINEEQASKLALYAPHATVNQIEDYQVVKKLALELP
> Q58801^.^9^99^.^5^Class 1^.^.^Fold 2^Superfamily 1^Family 1^PSIBLAST^61.50^0.000e+00^5.000e-10
VKKITNGTVIDHIDAGKALMVFKVLNVPKETSVMIAINVPSKKKGKKDILKIEGIELKKEDVDKISLISPDVTINIIRNGKVVEKLKPQIP
> P96175^.^8^99^.^5^Class 1^.^.^Fold 2^Superfamily 1^Family 1^PSIBLAST^107.00^0.000e+00^7.000e-24
VEAICNGYVIDHIPSGQGVKILRLFSLTDTKQRVTVGFNLPSHDGTTKDLIKVENTEITKSQANQLALLAPNATVNIIENFKVTDKHSLALP
> P96111^.^375^472^.^5^Class 1^.^.^Fold 2^Superfamily 1^Family 1^PSIBLAST^47.30^0.000e+00^9.000e-06
GIKPIENGTVIDHIAKGKTPEEIYSTILKIRKILRLYDVDSADGIFRSSDGSFKGYISLPDRYLSKKEIKKLSAISPNTTVNIIKNSTVVEKYRIKLP
> P77919^.^6^99^.^6^Class 1^.^.^Fold 2^Superfamily 1^Family 2^PSIBLAST^93.50^0.000e+00^1.000e-19
VSAIKEGTVIDHIPAGKGLKVIEILKLGKLTNGGAVLLAMNVPSKKLGRKDIVKVEGRFLSEEEVNKIALVAPNATVNIIRDYKVVEKFKVEVP
> P74766^.^12^104^.^6^Class 1^.^.^Fold 2^Superfamily 1^Family 2^PSIBLAST^74.20^0.000e+00^7.000e-14
VSKIKNGTVIDHIPAGRAFAVLNVLGIKGHEGFRIALVINVDSKKMGKKDIVKIEDKEISDTEANLITLIAPTATINIVREYEVVKKTKLEVP
> P57451^.^8^99^.^7^Class 1^.^.^Fold 2^Superfamily 2^Family 1^PSIBLAST^143.00^0.000e+00^1.000e-34
VEAIKSGSVIDHIPEYIGFKLLSLFRFTETEKRITIGLNLPSKKLGRKDIIKIENTFLSDEQINQLAIYAPHATVNYINEYNLVRKVFPTLP
> P19936^.^8^99^.^7^Class 1^.^.^Fold 2^Superfamily 2^Family 1^PSIBLAST^159.00^0.000e+00^1.000e-39
VEAIKCGTVIDHIPAQIGFKLLTLFKLTATDQRITIGLNLPSNELGRKDLIKIENTFLTEQQANQLAMYAPKATVNRIDNYEVVRKLTLSLP
> P08421^.^8^99^.^7^Class 1^.^.^Fold 2^Superfamily 2^Family 1^PSIBLAST^183.00^0.000e+00^0.000e+00
VEAIKCGTVIDHIPAQVGFKLLSLFKLTETDQRITIGLNLPSGEMGRKDLIKIENTFLTEEQVNQLALYAPQATVNRIDNYDVVGKSRPSLP
> P00478^.^8^99^.^8^Class 1^.^.^Fold 2^Superfamily 2^Family 2^PSIBLAST^191.00^0.000e+00^0.000e+00
VEAIKRGTVIDHIPAQIGFKLLSLFKLTETDQRITIGLNLPSGEMGRKDLIKIENTFLSEDQVDQLALYAPQATVNRIDNYEVVGKSRPSLP
> O58452^.^6^99^.^8^Class 1^.^.^Fold 2^Superfamily 2^Family 2^PSIBLAST^94.30^0.000e+00^6.000e-20
VSAIKEGTVIDHIPAGKGLKVIEILGLSKLSNGGSVLLAMNVPSKKLGRKDIVKVEGKFLSEEEVNKIALVAPTATVNIIRNYKVVEKFKVEVP
> O30129^.^6^98^.^9^Class 2^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^79.60^0.000e+00^2.000e-15
VSKIKEGTVIDHINAGKALLVLKILKIQPGTDLTVSMAMNVPSSKMGKKDIVKVEGMFIRDEELNKIALISPNATINLIRDYEIERKFKVSPP
> O26938^.^11^101^.^54894^Class 1^.^.^Fold 1^Superfamily 1^Family 1^PSIBLAST^81.90^0.000e+00^3.000e-16
VKPIKNGTVIDHITANRSLNVLNILGLPDGRSKVTVAMNMDSSQLGSKDIVKIENRELKPSEVDQIALIAPRATINIVRDYKIVEKAKVRL
-->


4.0 OUTPUT FILE FORMAT

The format of the hits file is described in ROCPLOT documentation. See also Figure 3.

Output files for usage example

File: rocon.hits

> RELATED 8 ; ROC 2
CROSS        Q8Z130    8     99    
UNKNOWN      Q9YBD5    181   235   
FALSE        P96175    8     99    
TRUE         O26938    11    101   
FALSE        Q7MX57    8     99    
CROSS        Q8TVB1    7     98    
TRUE         Q9YBD5    11    105   
TRUE         Q9YBD5    95    135   




5.0 DATA FILES

None.


6.0 USAGE

Generate a hits file from comparing two DHF files
Version: EMBOSS:6.5.0.0

   Standard (Mandatory) qualifiers:
  [-hitsinfile]        infile     This option specifies the location of the
                                  DHF file (domain hits file) (input). A
                                  'domain hits file' contains database hits
                                  (sequences) with domain classification
                                  information, in the DHF format (FASTA or
                                  EMBL-like). The hits are relatives to a SCOP
                                  or CATH family and are found from a search
                                  of a sequence database. Files containing
                                  hits retrieved by PSIBLAST are generated by
                                  using SEQSEARCH, hits retrieved by a sparse
                                  protein signatare by using SIGSCAN or
                                  various types of HMM and profile by using
                                  LIBSCAN.
  [-validinfile]       infile     This option specifies the name of domain
                                  families file (input). A 'domain families
                                  file' contains sequence relatives (hits) for
                                  each of a number of different SCOP or CATH
                                  families found from searching a sequence
                                  database, e.g. by using SEQSEARCH
                                  (psiblast). The file contains the collated
                                  search results for the indvidual families;
                                  only those hits of unambiguous family
                                  assignment are included. Hits of ambiguous
                                  family assignment are assigned as relatives
                                  to a SCOP or CATH superfamily or fold
                                  instead and are collated into a 'domain
                                  ambiguities file'. The domain families and
                                  ambiguities files are generated by using
                                  SEQSORT and use the same format as a DHF
                                  file (domain hits file).
   -thresh             integer    [10] This option specifies the overlap
                                  threshold for hits. This is the minimum
                                  length (residues) of overlap required for
                                  two hits with the accession number to be
                                  counted as the same hit. The accession
                                  number of the hit, and the start and end
                                  point respectively of the hit relative to
                                  full length sequence are provided in the
                                  lists of hits in the DHF input file. The
                                  overlap is determined from the start and end
                                  points of the hit. For example two hits
                                  with the start and end points of 1-100 and
                                  91-190 respectively are considered to be the
                                  same hit if they have the same accession
                                  numbers and the overlap threshold is 10 or
                                  less. (Any integer value)
   -mode               menu       [1] This option specifies the classification
                                  scheme to use. See ROCON on-line
                                  documentation for more information. (Values:
                                  1 (Family classification scheme); 2 ((Not
                                  yet available)))
  [-hitsoutfile]       outfile    [*.rocon] This option specifies the name of
                                  the hits files (output). A 'hits
                                  file'contains a list of hits (e.g. from a
                                  prediction method) that are classified and
                                  rank-ordered on the basis of score, p-value,
                                  E-value etc. The files generated by using
                                  SIGSCAN and LIBSCAN will contain the results
                                  of a search of a discriminating element
                                  (e.g. hidden Markov model, profile or
                                  signature) against a sequence database. The
                                  ROCPLOT application is run on the files to
                                  perform Receiver Operator Characteristic
                                  (ROC) analysis on the hits.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-hitsoutfile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

6.1 COMMAND LINE ARGUMENTS

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-hitsinfile]
(Parameter 1)
infile This option specifies the location of the DHF file (domain hits file) (input). A 'domain hits file' contains database hits (sequences) with domain classification information, in the DHF format (FASTA or EMBL-like). The hits are relatives to a SCOP or CATH family and are found from a search of a sequence database. Files containing hits retrieved by PSIBLAST are generated by using SEQSEARCH, hits retrieved by a sparse protein signatare by using SIGSCAN or various types of HMM and profile by using LIBSCAN. Input file Required
[-validinfile]
(Parameter 2)
infile This option specifies the name of domain families file (input). A 'domain families file' contains sequence relatives (hits) for each of a number of different SCOP or CATH families found from searching a sequence database, e.g. by using SEQSEARCH (psiblast). The file contains the collated search results for the indvidual families; only those hits of unambiguous family assignment are included. Hits of ambiguous family assignment are assigned as relatives to a SCOP or CATH superfamily or fold instead and are collated into a 'domain ambiguities file'. The domain families and ambiguities files are generated by using SEQSORT and use the same format as a DHF file (domain hits file). Input file Required
-thresh integer This option specifies the overlap threshold for hits. This is the minimum length (residues) of overlap required for two hits with the accession number to be counted as the same hit. The accession number of the hit, and the start and end point respectively of the hit relative to full length sequence are provided in the lists of hits in the DHF input file. The overlap is determined from the start and end points of the hit. For example two hits with the start and end points of 1-100 and 91-190 respectively are considered to be the same hit if they have the same accession numbers and the overlap threshold is 10 or less. Any integer value 10
-mode list This option specifies the classification scheme to use. See ROCON on-line documentation for more information.
1 (Family classification scheme)
2 ((Not yet available))
1
[-hitsoutfile]
(Parameter 3)
outfile This option specifies the name of the hits files (output). A 'hits file'contains a list of hits (e.g. from a prediction method) that are classified and rank-ordered on the basis of score, p-value, E-value etc. The files generated by using SIGSCAN and LIBSCAN will contain the results of a search of a discriminating element (e.g. hidden Markov model, profile or signature) against a sequence database. The ROCPLOT application is run on the files to perform Receiver Operator Characteristic (ROC) analysis on the hits. Output file <*>.rocon
Additional (Optional) qualifiers
(none)
Advanced (Unprompted) qualifiers
(none)
Associated qualifiers
"-hitsoutfile" associated outfile qualifiers
-odirectory3
-odirectory_hitsoutfile
string Output directory Any string  
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

6.2 EXAMPLE SESSION

An example of interactive use of ROCON is shown below. Here is a sample session with rocon


% rocon 
Generate a hits file from comparing two DHF files
Domain hits file: rocon/rocon.dhf
Domain families file: rocon.valid
Overlap threshold for hits. [10]: 10
Classification scheme to use
         1 : Family classification scheme
         2 : (Not yet available)
Select number. [1]: 1
Hits output file [hits.dhf]: rocon.hits

Go to the input files for this example
Go to the output files for this example




7.0 KNOWN BUGS & WARNINGS

None.


8.0 NOTES

None.

8.1 GLOSSARY OF FILE TYPES

FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
Domain hits file DHF format (FASTA-like). Database hits (sequences) with domain classification information. The hits are relatives to a SCOP or CATH family (or other node in the structural hierarchies) and are found from a search of a sequence database. SEQSEARCH (hits retrieved by PSIBLAST) N.A.
Hits file Text file of classified hits A list of hits (e.g. from a prediction method) that are classified and rank-ordered on the basis of score, p-value, E-value etc. SIGSCAN and LIBSCAN (hits from searches of a discriminating element (hidden Markov model, profile or signature) against a sequence database). ROCPLOT is run on the files to perform Receiver Operator Characteristic (ROC) analysis on the hits.
Domain families & ambiguities file DHF format (FASTA-like). Contains sequence relatives (hits) for each of a number of different SCOP or CATH families found from PSIBLAST searches of a sequence database. The file contains the collated search results for the indvidual families; only those hits of unambiguous family assignment are included. Hits of ambiguous family assignment are assigned as relatives to a SCOP or CATH superfamily or fold instead and are collated into a 'domain ambiguities file'. SEQSORT N.A.
None


9.0 DESCRIPTION

Discrciminating elements such as hidden Markov models (HMM), sparse protein signatures and profiles can be generated for a set of proteins with related sequence, structural or functional properties. These discriminators are characteristic of the property considered and can be used diagnostically, for instance, by screening a database of uncharacterised sequences.

Such screens can be performed by using the LIBSCAN and SIGSCAN applications, which generate a DHF file (domain hits file) of database hits (sequences). The hits are relatives to a SCOP or CATH family (or other node in the structural hierarchies) and are found from a search of a sequence database. The DHF file includes domain classification information of the family in question.

When assessing the performance of a predictive method, a "gold standard" of truth is required. This is a set of examples that are known to be related to the discriminating element, and, ideally, a further set that is known to be definitely not related. For example, to assess a protein family HMM to detect true members of that family requires, at least, a list of the known family members. If a method works well for the "gold standard" we can infer it will work well generally. Increasingly, use is made of databases such as SCOP, in which sequence, structural and functional relationships are classified.

Such a "gold standard" can be generated for SCOP families by using the DOMAINATRIX package and particularly the SEQSORT application. SEQSORT generates a "domain families file" containing sequence relatives (hits) for each of a number of different SCOP or CATH families found from PSIBLAST searches of a sequence database. The file contains the collated search results for the indvidual families; only those hits of unambiguous family assignment are included.

A powerful measure of diagnostic performance is to use Receiver Operator Characteristic (ROC) curves to display graphically the sensitivity and specificity of a method. ROC analysis is implemented in the ROCPLOT application. ROCPLOT requires a "hits file" containing a list of classified hits that are rank-ordered on the basis of score.

The ROCON application was developed to take as input a DHF file of hits, and a domain families file of validation sequences (the gold standard) and generate a hits file for use with ROCPLOT.


10.0 ALGORITHM

The domain families file uses the same format as a DHF file (domain hits file). Thus ROCON takes two DHF files as input, one containing hits (sequences of unknown classification) to be classified and the the domain families file containing sequences (of known classification) that are used to make the classification. A DHF file includes 6 tokens (in bold in the example below) that collectively describe the classification of a sequence as follows: domain class (SCOP and CATH domains), domain architecture (CATH only), domain topology (CATH only), domain fold (SCOP domains only), domain superfamily and domain family (SCOP only) - see below.

> Q9WVI4^.^513^667^SCOP^.^55074^CLASS^ARCHITECTURE^TOPOLOGY^FOLD^SUPERFAMILY^FAMILY^PSIBLAST^113.00^0.000e+00^2.000e-25
RKFDDVTMLFSDIVGFTAICAQCTPMQVISMLNELYTRFDHQCGFLDIYKVETIGDAYCVASGLHRKSLCHAKPIALMALKMMELSEEVLTPDGRPIQMRIGIHSG


The value of the tokens (CLASS through to FAMILY, but, in the current implementation of classification scheme, excluding ARCHITECTURE and TOPOLOGY) determines the classification of a hit that is given in the ROCON output file as follows.

If a hit does not overlap significantly with any validation sequence then the hit is classified as UNKNOWN. A hit and validation sequence are defined as overlapping if they have identical accesssion number and have a common region of at least a user-defined number of residues. The overlap is determined from the start and end points (relative to the full-length sequences) of the hit and validation sequences. For example a hit and validation sequence with the same accession numbers and with the start and end points of 1-100 and 91 - 190 respectively are defined as overlapping if the overlap threshold is 10 or less.

If a hit does overlap significantly with a validation sequence it is defined as one of TRUE, CROSS or FALSE depending on the value of the tokens (CLASS through to FAMILY) as per the table below.
CLASS FOLD SUPERFAMILY FAMILY CLASSIFICATION
Not available Not available Not available Not available UNKNOWN
Different Different Different Different FALSE
Same Different Different Different FALSE
Same Same Different Different CROSS
Same Same Same Different CROSS
Same Same Same Same TRUE


Putting this in context of a real example, imagine an input DHF file containing hits derived from searching a sequence database with a novel type of profile specific to a SCOP family. In this case, the full SCOP classification (Class, Fold etc) of the hit are putatively assigned. To validate the novel method a validation file of manually curated sequences of known classification are used. A TRUE hit would be one that overlaps with a validation sequence belonging to the same Family (and by implication Superfamily, Fold and Class) to the hit. A CROSS hit overlaps with a sequence of the same fold, but different family, as a validation sequence, and a FALSE hit overlaps with a sequence of a different fold to the hit.

The hits are rank-ordered on the basis of score before they are written to the the Hits (output) file.


11.0 RELATED APPLICATIONS

See also

Program name Description
cathparse Generate DCF file from raw CATH files
domainalign Generate alignments (DAF file) for nodes in a DCF file
domainnr Remove redundant domains from a DCF file
domainrep Reorder DCF file to identify representative structures
domainseqs Add sequence records to a DCF file
domainsse Add secondary structure records to a DCF file
helixturnhelix Identify nucleic acid-binding motifs in protein sequences
libgen Generate discriminating elements from alignments
matgen3d Generate a 3D-1D scoring matrix from CCF files
pepcoil Predict coiled coil regions in protein sequences
rocplot Perform ROC analysis on hits files
scopparse Generate DCF file from raw SCOP files
seqalign Extend alignments (DAF file) with sequences (DHF file)
seqfraggle Remove fragment sequences from DHF files
seqsort Remove ambiguous classified sequences from DHF files
seqwords Generate DHF files from keyword search of UniProt
ssematch Search a DCF file for secondary structure matches



12.0 DIAGNOSTIC ERROR MESSAGES

None.


13.0 AUTHORS

Jon Ison (jison@ebi.ac.uk)
The European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SD UK


14.0 REFERENCES

Please cite the authors and EMBOSS.

Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.

See also http://emboss.sourceforge.net/

14.1 Other useful references