SEQSORT documentation


 


CONTENTS

1.0 SUMMARY
2.0 INPUTS & OUTPUTS
3.0 INPUT FILE FORMAT
4.0 OUTPUT FILE FORMAT
5.0 DATA FILES
6.0 USAGE
7.0 KNOWN BUGS & WARNINGS
8.0 NOTES
9.0 DESCRIPTION
10.0 ALGORITHM
11.0 RELATED APPLICATIONS
12.0 DIAGNOSTIC ERROR MESSAGES
13.0 AUTHORS
14.0 REFERENCES



1.0 SUMMARY

Remove ambiguous classified sequences from DHF files


2.0 INPUTS & OUTPUTS

SEQSORT reads a directory of DHF files (domain hits files) where each file contains hits to a single SCOP family, compares, processes and collates the hits and writes a directory of DHF files which contain only those hits that could be uniquely assigned to a SCOP family. Optionally, two further files of hits are written: (i) a domain families file, of ALL hits from the input files that could be uniquely assigned to a SCOP family and (ii) a domain ambiguities file, of hits from ALL the input files that are of ambiguous family assignment and are assigned as relatives to a SCOP superfamily or fold instead.
The path for the domain hits files (input and output) and the names of the output files are specified by the user. The file extension of the DHF files are set in the ACD file.


3.0 INPUT FILE FORMAT

The format of the domain hits file is described in SEQSEARCH documentation.


4.0 OUTPUT FILE FORMAT

The format of the domain hits file is described in SEQSEARCH documentation.

The domain families file and domain ambiguities file also use the DHF format. Whereas normally a DHF file contains hits for a single node from SCOP or CATH, the families and ambiguities files may contain domains from multiple different families (domain families file), or superfamilies or folds (ambiguities file). Domains of the same node (e.g. family) will be grouped together in blocks, i.e. all hits for domain A, then all hits for domain B and so on (see Figure 1).

Output files for usage example

File: fam.dhf

> Q9YBD5^.^1^95^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^.^56.10^0.000e+00^9.000e+00
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFNVIEDYKVVEKVKLKLP
> Q97FS4^.^1^90^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^.^43.40^0.000e+00^6.000e+00
INSIKNGIVIDHIKAGHGIKIYNYLKLGEAEFPTALIMNAISKKNKAKDIIKIENVMDLDLAVLGFLDPNITVNIIEDEKIRQKIQLKLP
> Q7MX57^.^1^92^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^.^73.80^0.000e+00^5.000e+00
VAAIRNGIVIDHIPPTKLFKVATLLQLDDLDKRITIGNNLRSRSHGSKGVIKIEDKTFEEEELNRIALIAPNVRLNIIRDYEVVEKRQVEVP
> P96111^.^1^98^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^.^43.00^0.000e+00^9.000e+00
GIKPIENGTVIDHIAKGKTPEEIYSTILKIRKILRLYDVDSADGIFRSSDGSFKGYISLPDRYLSKKEIKKLSAISPNTTVNIIKNSTVVEKYRIKLP
> Q08462^.^1^167^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^46.20^0.000e+00^4.000e+00
DCVCVMFASIPDFKEFYTESDVNKEGLECLRLLNEIIADFDDLLSKPKFSGVEKIKTIGSTYMAATGLSAVPSQEHSQEPERQYMHIGTMVEFAFALVGKLDAINKHSFNDFKLRVGINHGPVIAGVIGAQKPQYDIWGNTVNVASRMDSTGVLDKIQVTEETSLVL
> Q03101^.^1^149^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^65.80^0.000e+00^4.000e+00
NNACVFFLDIAGFTRFSSIHSPEQVIQVLIKIFNSMDLLCAKHGIEKIKTIGDAYMATCGIFPKCDDIRHNTYKMLGFAMDVLEFIPKEMSFHLGLQVRVGIHCGPVISGVISGYAKPHFDVWGDTVNVASRMESTGIAGQIHVSDRVY
> Q02153^.^1^165^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^68.90^0.000e+00^4.000e+00
HKRPVPAKRYDNVTILFSGIVGFNAFCSKHASGEGAMKIVNLLNDLYTRFDTLTDSRKNPFVYKVETVGDKYMTVSGLPEPCIHHARSICHLALDMMEIAGQVQVDGESVQITIGIHTGEVVTGVIGQRMPRYCLFGNTVNLTSRTETTGEKGKINVSEYTYRCL
> P46197^.^1^168^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^78.50^0.000e+00^7.000e+00
VQAEAFDSVTIYFSDIVGFTALSAESTPMQVVTLLNDLYTCFDAIIDNFDVYKVETIGDAYMVVSGLPGRNGQRHAPEIARMALALLDAVSSFRIRHRPHDQLRLRIGVHTGPVCAGVVGLKMPRYCLFGDTVNTASRMESNGQALKIHVSSTTKDALDELGCFQLEL
> P40137^.^1^139^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^48.50^0.000e+00^6.000e+00
VTLLFADIRDFTSLSERLRPEQVVTLLNEYYGRMVEVVFRHGGTLDKFIGDALMVYFGAPIADPAHARRGVQCALDMVQELETVNALRSARGEPCLRIGVGVHTGPAVLGNIGSATRRLEYTAIGDTVNLASRIESLTK
> P23466^.^1^154^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^50.80^0.000e+00^1.000e+00
PTGNVAIVFTDIKNSTFLWELFPDAMRAAIKTHNDIMRRQLRIYGGYEVKTEGDAFMVAFPTPTSALVWCLSVQLKLLEAEWPEEITSIQDGCLITDNSGTKVYLGLSVRMGVHWGCPVPEIDLVTQRMDYLGPVVNKAARVSGVADGGQITLS
> O30820^.^1^149^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^75.40^0.000e+00^6.000e+00
DEASVLFADIVGFTERASSTAPADLVRFLDRLYSAFDELVDQHGLEKIKVSGDSYMVVSGVPRPRPDHTQALADFALDMTNVAAQLKDPRGNPVPLRVGLATGPVVAGVVGSRRFFYDVWGDAVNVASRMESTDSVGQIQVPDEVYERL

File: oth.dhf


File: 54894.dhf

> Q9YBD5^.^1^95^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^.^56.10^0.000e+00^9.000e+00
VRKIRSGVVIDHIPPGRAFTMLKALGLLPPRGYRWRIAVVINAESSKLGRKDILKIEGYKPRQRDLEVLGIIAPGATFNVIEDYKVVEKVKLKLP
> Q97FS4^.^1^90^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^.^43.40^0.000e+00^6.000e+00
INSIKNGIVIDHIKAGHGIKIYNYLKLGEAEFPTALIMNAISKKNKAKDIIKIENVMDLDLAVLGFLDPNITVNIIEDEKIRQKIQLKLP
> Q7MX57^.^1^92^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^.^73.80^0.000e+00^5.000e+00
VAAIRNGIVIDHIPPTKLFKVATLLQLDDLDKRITIGNNLRSRSHGSKGVIKIEDKTFEEEELNRIALIAPNVRLNIIRDYEVVEKRQVEVP
> P96111^.^1^98^SCOP^.^54894^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain^.^43.00^0.000e+00^9.000e+00
GIKPIENGTVIDHIAKGKTPEEIYSTILKIRKILRLYDVDSADGIFRSSDGSFKGYISLPDRYLSKKEIKKLSAISPNTTVNIIKNSTVVEKYRIKLP

File: 55074.dhf

> Q08462^.^1^167^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^46.20^0.000e+00^4.000e+00
DCVCVMFASIPDFKEFYTESDVNKEGLECLRLLNEIIADFDDLLSKPKFSGVEKIKTIGSTYMAATGLSAVPSQEHSQEPERQYMHIGTMVEFAFALVGKLDAINKHSFNDFKLRVGINHGPVIAGVIGAQKPQYDIWGNTVNVASRMDSTGVLDKIQVTEETSLVL
> Q03101^.^1^149^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^65.80^0.000e+00^4.000e+00
NNACVFFLDIAGFTRFSSIHSPEQVIQVLIKIFNSMDLLCAKHGIEKIKTIGDAYMATCGIFPKCDDIRHNTYKMLGFAMDVLEFIPKEMSFHLGLQVRVGIHCGPVISGVISGYAKPHFDVWGDTVNVASRMESTGIAGQIHVSDRVY
> Q02153^.^1^165^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^68.90^0.000e+00^4.000e+00
HKRPVPAKRYDNVTILFSGIVGFNAFCSKHASGEGAMKIVNLLNDLYTRFDTLTDSRKNPFVYKVETVGDKYMTVSGLPEPCIHHARSICHLALDMMEIAGQVQVDGESVQITIGIHTGEVVTGVIGQRMPRYCLFGNTVNLTSRTETTGEKGKINVSEYTYRCL
> P46197^.^1^168^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^78.50^0.000e+00^7.000e+00
VQAEAFDSVTIYFSDIVGFTALSAESTPMQVVTLLNDLYTCFDAIIDNFDVYKVETIGDAYMVVSGLPGRNGQRHAPEIARMALALLDAVSSFRIRHRPHDQLRLRIGVHTGPVCAGVVGLKMPRYCLFGDTVNTASRMESNGQALKIHVSSTTKDALDELGCFQLEL
> P40137^.^1^139^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^48.50^0.000e+00^6.000e+00
VTLLFADIRDFTSLSERLRPEQVVTLLNEYYGRMVEVVFRHGGTLDKFIGDALMVYFGAPIADPAHARRGVQCALDMVQELETVNALRSARGEPCLRIGVGVHTGPAVLGNIGSATRRLEYTAIGDTVNLASRIESLTK
> P23466^.^1^154^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^50.80^0.000e+00^1.000e+00
PTGNVAIVFTDIKNSTFLWELFPDAMRAAIKTHNDIMRRQLRIYGGYEVKTEGDAFMVAFPTPTSALVWCLSVQLKLLEAEWPEEITSIQDGCLITDNSGTKVYLGLSVRMGVHWGCPVPEIDLVTQRMDYLGPVVNKAARVSGVADGGQITLS
> O30820^.^1^149^SCOP^.^55074^Alpha and beta proteins (a+b)^.^.^Ferredoxin-like^Adenylyl and guanylyl cyclase catalytic domain^Adenylyl and guanylyl cyclase catalytic domain^.^75.40^0.000e+00^6.000e+00
DEASVLFADIVGFTERASSTAPADLVRFLDRLYSAFDELVDQHGLEKIKVSGDSYMVVSGVPRPRPDHTQALADFALDMTNVAAQLKDPRGNPVPLRVGLATGPVVAGVVGSRRFFYDVWGDAVNVASRMESTDSVGQIQVPDEVYERL




5.0 DATA FILES

None.


6.0 USAGE

6.1 COMMAND LINE ARGUMENTS

Remove ambiguous classified sequences from DHF files.
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-dhfindir]          directory  [./] This option specifies the location of
                                  DHF files (domain hits files) (input). A
                                  'domain hits file' contains database hits
                                  (sequences) with domain classification
                                  information, in the DHF format (FASTA or
                                  EMBL-like). The hits are relatives to a SCOP
                                  or CATH family and are found from a search
                                  of a sequence database. Files containing
                                  hits retrieved by PSIBLAST are generated by
                                  using SEQSEARCH.
   -overlap            integer    [10] This option specifies the number of
                                  overlapping residues required for merging of
                                  two hits. Each family is also processed so
                                  that ovlerapping hits (hits with identical
                                  accesssion number that overlap by at least a
                                  user-defined number of residues) are
                                  replaced by a hit that is produced from
                                  merging the two overlapping hits. (Any
                                  integer value)
   -dofamilies         toggle     [N] This option specifies to write a domain
                                  families file. If this option is set a
                                  domain families file is written.
   -doambiguities      toggle     [N] This option specifies whether to write a
                                  domain ambiguities file. If this option is
                                  set a domain ambiguities file is written.
  [-dhfoutdir]         outdir     [./] This option specifies the location of
                                  DHF files (domain hits files) (output). A
                                  'domain hits file' contains database hits
                                  (sequences) with domain classification
                                  information, in the DHF format (FASTA or
                                  EMBL-like). The hits are relatives to a SCOP
                                  or CATH family and are found from a search
                                  of a sequence database. Files containing
                                  hits retrieved by PSIBLAST are generated by
                                  using SEQSEARCH.
*  -hitsfile           outfile    [fam.dhf] This option specifies the name of
                                  domain families file (output). A 'domain
                                  families file' contains sequence relatives
                                  (hits) for each of a number of different
                                  SCOP or CATH families found from searching a
                                  sequence database, e.g. by using SEQSEARCH
                                  (psiblast). The file contains the collated
                                  search results for the indvidual families;
                                  only those hits of unambiguous family
                                  assignment are included. Hits of ambiguous
                                  family assignment are assigned as relatives
                                  to a SCOP or CATH superfamily or fold
                                  instead and are collated into a 'domain
                                  ambiguities file'. The domain families and
                                  ambiguities files are generated by using
                                  SEQSORT and use the same format as a DHF
                                  file (domain hits file).
*  -ambigfile          outfile    [oth.dhf] This option specifies the name of
                                  domain ambiguities file (output). A 'domain
                                  families file' contains sequence relatives
                                  (hits) for each of a number of different
                                  SCOP or CATH families found from searching a
                                  sequence database, e.g. by using SEQSEARCH
                                  (psiblast). The file contains the collated
                                  search results for the indvidual families;
                                  only those hits of unambiguous family
                                  assignment are included. Hits of ambiguous
                                  family assignment are assigned as relatives
                                  to a SCOP or CATH superfamily or fold
                                  instead and are collated into a 'domain
                                  ambiguities file'. The domain families and
                                  ambiguities files are generated by using
                                  SEQSORT and use the same format as a DHF
                                  file (domain hits file).

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-dhfindir" associated qualifiers
   -extension1         string     Default file extension

   "-dhfoutdir" associated qualifiers
   -extension2         string     Default file extension

   "-hitsfile" associated qualifiers
   -odirectory         string     Output directory

   "-ambigfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-dhfindir]
(Parameter 1)
directory This option specifies the location of DHF files (domain hits files) (input). A 'domain hits file' contains database hits (sequences) with domain classification information, in the DHF format (FASTA or EMBL-like). The hits are relatives to a SCOP or CATH family and are found from a search of a sequence database. Files containing hits retrieved by PSIBLAST are generated by using SEQSEARCH. Directory ./
-overlap integer This option specifies the number of overlapping residues required for merging of two hits. Each family is also processed so that ovlerapping hits (hits with identical accesssion number that overlap by at least a user-defined number of residues) are replaced by a hit that is produced from merging the two overlapping hits. Any integer value 10
-dofamilies toggle This option specifies to write a domain families file. If this option is set a domain families file is written. Toggle value Yes/No No
-doambiguities toggle This option specifies whether to write a domain ambiguities file. If this option is set a domain ambiguities file is written. Toggle value Yes/No No
[-dhfoutdir]
(Parameter 2)
outdir This option specifies the location of DHF files (domain hits files) (output). A 'domain hits file' contains database hits (sequences) with domain classification information, in the DHF format (FASTA or EMBL-like). The hits are relatives to a SCOP or CATH family and are found from a search of a sequence database. Files containing hits retrieved by PSIBLAST are generated by using SEQSEARCH. Output directory ./
-hitsfile outfile This option specifies the name of domain families file (output). A 'domain families file' contains sequence relatives (hits) for each of a number of different SCOP or CATH families found from searching a sequence database, e.g. by using SEQSEARCH (psiblast). The file contains the collated search results for the indvidual families; only those hits of unambiguous family assignment are included. Hits of ambiguous family assignment are assigned as relatives to a SCOP or CATH superfamily or fold instead and are collated into a 'domain ambiguities file'. The domain families and ambiguities files are generated by using SEQSORT and use the same format as a DHF file (domain hits file). Output file fam.dhf
-ambigfile outfile This option specifies the name of domain ambiguities file (output). A 'domain families file' contains sequence relatives (hits) for each of a number of different SCOP or CATH families found from searching a sequence database, e.g. by using SEQSEARCH (psiblast). The file contains the collated search results for the indvidual families; only those hits of unambiguous family assignment are included. Hits of ambiguous family assignment are assigned as relatives to a SCOP or CATH superfamily or fold instead and are collated into a 'domain ambiguities file'. The domain families and ambiguities files are generated by using SEQSORT and use the same format as a DHF file (domain hits file). Output file oth.dhf
Additional (Optional) qualifiers
(none)
Advanced (Unprompted) qualifiers
(none)
Associated qualifiers
"-dhfindir" associated directory qualifiers
-extension1
-extension_dhfindir
string Default file extension Any string dhf
"-dhfoutdir" associated outdir qualifiers
-extension2
-extension_dhfoutdir
string Default file extension Any string dhf
"-hitsfile" associated outfile qualifiers
-odirectory string Output directory Any string  
"-ambigfile" associated outfile qualifiers
-odirectory string Output directory Any string  
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

6.2 EXAMPLE SESSION

An example of interactive use of SEQSORT is shown below. Here is a sample session with seqsort


% seqsort 
Remove ambiguous classified sequences from DHF files.
Domain hits directory [./]: ../seqnr-keep/hitsnr
Number of overlapping residues required for merging of two hits. [10]: 10
Write domain families file. [N]: Y
Write domain ambiguities file. [N]: Y
Domain hits file output directory [./]: 

Go to the output files for this example




7.0 KNOWN BUGS & WARNINGS

None.


8.0 NOTES

None.

8.1 GLOSSARY OF FILE TYPES

FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
Domain hits file DHF format (FASTA-like). Database hits (sequences) with domain classification information. The hits are relatives to a SCOP or CATH family (or other node in the structural hierarchies) and are found from a search of a discriminating element (e.g. a protein signature, hidden Markov model, simple frequency matrix, Gribskov profile or Hennikoff profile) against a sequence database. SEQSEARCH (hits retrieved by PSIBLAST). SIGSCAN (hits retrieved by sparse protein signature). LIBSCAN (hits retrieved by various types of HMM and profile). N.A.
Domain families & ambiguities file Contains sequence relatives (hits) for each of a number of different SCOP or CATH families found from PSIBLAST searches of a sequence database. The file contains the collated search results for the indvidual families; only those hits of unambiguous family assignment are included. Hits of ambiguous family assignment are assigned as relatives to a SCOP or CATH superfamily or fold instead and are collated into a 'domain ambiguities file'. The domain families and ambiguities files are generated by using SEQSORT and use the same format as a DHF file (domain hits file). N.A.
Domain validation file Contains sequence relatives (hits) for each of a number of different SCOP or CATH families, superfamilies and folds. The file contains the collated results from PSIBLAST searches of a sequence database for the indvidual families; hits of unambiguous family assignment are listed under their respective family, otherwise a hit is assigned as relatives to a superfamily or fold instead. The domain validation file is generated by using SEQNR and uses the same format as a DHF file (domain hits file). N.A.
None


9.0 DESCRIPTION

The results of multiple searches of a sequence database using an homology search tool such as blast may contain overlapping or identical hits, especially where the query sequences are related, for instance, belong to different families but the same superfamily. For certain analyses it is desirable to assign a hit with confidence to a unique family, or otherwise assign it as a member of a larger superfamily or fold instead. SEQSORT reads a directory of DHF files (domain hits files) where each file containing hits to a different SCOP family, compares, processes and collates the hits and writes a directory of DHF files which contain only those hits that could be uniquely assigned to a SCOP family. Optionally, two further files are written: (i) a domain families file, of hits (from ALL the input files) that could be uniquely assigned to a SCOP family and (ii) a domain ambiguities file, for hits (from ALL the input files) of ambiguous family assignment which are assigned as relatives to a SCOP superfamily or fold instead.


10.0 ALGORITHM

A rough outline of the algorithm follows; a better description will appear in a publication in preparation. Hits from searches for all domain families are collated into a single list and the list sorted according to family name. The hits hits within each family are sorted by accession number, then hits within a family and with identical accession number are sorted by the start position of the hit relative to the full length sequence in swissprot. In each family identical hits (i.e. those with identical accession number and the same start and end points relative to the full-length sequence in swissprot) were removed leaving only a single copy. Each family is also processed so that ovlerapping hits (hits with identical accesssion number that overlap by at least a user-defined number of residues) are replaced by a hit that is produced from merging the two overlapping hits. If two hits have the same accession number and overlap but are from searches for different families, the hits are merged and the merged hit placed into a new list for hits to superfamilies (if the two families belonged to the same superfamily) or for hits to folds (if the two families were in different superfamilies but the same fold). In this way hits that are unique to a particular family are identified, and hits of ambiguous family assignment are assigned as belonging to a superfamily or fold instead.


11.0 RELATED APPLICATIONS

See also

Program name Description
cathparse Generate DCF file from raw CATH files
domainalign Generate alignments (DAF file) for nodes in a DCF file
domainnr Remove redundant domains from a DCF file
domainrep Reorder DCF file to identify representative structures
domainseqs Add sequence records to a DCF file
domainsse Add secondary structure records to a DCF file
helixturnhelix Identify nucleic acid-binding motifs in protein sequences
libgen Generate discriminating elements from alignments
matgen3d Generate a 3D-1D scoring matrix from CCF files
pepcoil Predict coiled coil regions in protein sequences
rocon Generate a hits file from comparing two DHF files
rocplot Perform ROC analysis on hits files
scopparse Generate DCF file from raw SCOP files
seqalign Extend alignments (DAF file) with sequences (DHF file)
seqfraggle Remove fragment sequences from DHF files
seqwords Generate DHF files from keyword search of UniProt
ssematch Search a DCF file for secondary structure matches



12.0 DIAGNOSTIC ERROR MESSAGES

None.


13.0 AUTHORS

Ranjeeva Ranasinghe

Jon Ison (jison@ebi.ac.uk)
The European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SD UK


14.0 REFERENCES

Please cite the authors and EMBOSS.

Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.

See also http://emboss.sourceforge.net/

14.1 Other useful references