PIR Database File Structure and Format Specification

             P R O T E I N  S E Q U E N C E  D A T A B A S E
                           of PIR-International
                        PIR Document PRDBFS-1293
            Database File Structure and Format Specification
                       Version 6.0, December 1993
                  Protein Information Resource (PIR)*
                National Biomedical Research Foundation
                       3900 Reservoir Road, N.W.,
                       Washington, DC  20007, USA

 Japan International Protein           Munich Information Center for 
Information Database (JIPID)             Protein Sequences (MIPS)
       Amakubo 1-16-1          GSF-Forschungszentrum f. Umwelt und Gesundheit
  Tsukuba 305-0005, Japan            am Max-Planck-Instut f. Biochemie
                                Am Klopferspitz 18, D-82152 Martinsried, FRG

  This database may be redistributed without prior consent, provided
  that this notice be given to each user and that the words "Derived
  from" shall precede this notice if the database has been altered
  by the redistributor.
  We have made every effort to ensure proper functioning of the
  programs and cannot be held responsible for the consequences to
  users of any problems encountered during their operation.
                    *PIR is a registered mark of NBRF
      PIR is partially supported by the National Library of Medicine

This Document describes the files comprising the PIR-International Protein Sequence Database and the format of each. The format has been enhanced significantly for Release 39.00 to what is referred to as "enhanced NBRF" format. A Technical Development Bulletin is available upon request. Each data set contains primary files, index files and auxiliary files.

1.0 Protein Sequence Database Files

The Protein Sequence Database is divided into three Sections: Section 1: Annotated and Classified Entries, Section 2: Preliminary Entries, and Section 3: Unverified Entries. The files corresponding to these sections have the file names PIR1, PIR2, and PIR3, respectively. Each section is composed of the following files; however, since Section 3 (PIR3) contains minimal information only it may not contain certain index files.

Primary Database Files (required):

PIRn.SEQ: primary file containing the title and sequence for each entry
PIRn.REF: primary file containing the title and text information for each entry
PIRn.INX: primary index file for the SEQ and REF files; it allows PSQ to use the VAX-11 RMS RFA record access mode for random access of the primary files.

Information associated with one sequence database entry is split between the .REF and .SEQ files; amino acid sequence is contained in the .SEQ and all annotation is contained in the .REF file. The first two records of the .REF file is duplicated in the .SEQ file. This description concentrates on file structure and not on the conceptual view of an entry as seen when using PSQ or other software.

!!!!!!!!!!!!!!NOTE CHANGES IN FILES AVAILABLE !!!!!!!!!!!!!!!!!
IT IS PIR'S INTENTION TO STOP PROVIDING INDEX FILES FOR THE
NBRF-PIR FORMAT AND THE CODATA FORMAT AFTER RELEASE 66.00!!
THE *.REF, *.SEQ, *.NAM, AND *.DAT FLAT FILES WILL CONTINUE.
IF YOU WISH US TO CONTINUE TO PROVIDE ANY OF THE INDEX FILES
PLEASE LET US KNOW. CONTACT PIRMAIL@NBRF.GEORGETOWN.EDU .
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Database Index Files (optional):

PIRn.CAX: file to support the PSQ SELECT and REPORT commands
PIRn.CDX: file to support the PSQ REPORT, SELECT, SUPERFAMILY, and TAXONOMY commands
PIRn.ACX: file to support the PSQ ACCESSION command
PIRn.AUX: file to support the PSQ AUTHOR command
PIRn.CRX*: cross-reference index
PIRn.FTX: file to support the PSQ FEATURE command
PIRn.GNX*: gene name index
PIRn.JRX*: journal index
PIRn.RNX*: reference number index
PIRn.SFX: file to support the PSQ SUPERFAMILY/NAME command
PIRn.SNX*: superfamily number index
PIRn.SPX: file to support the PSQ SPECIES command
PIRn.TSC: file to support the PSQ SCAN command
PIRn.TTX: entry Title index
PIRn.WOX: file to support the PSQ KEYWORD command (*) index files not supported by PSQ

Auxiliary Database Files (optional):

PIRn.NAM: file to support the PSQ SHOW command
PIRn.CHN: list of changed entry identification codes
PIRn.TTL: file containing title information for each entry (the table of contents) (optional)
PRINDX.LIS: short directory of PIR1 and PIR2 ordered by superfamily number
SUPFAMNUM.LIS: file of superfamily names ordered by superfamily number
JOURNALS.LIS: file of journal abbreviations
TAXONOMY.LIS: file of species classifications
SGC.LIS: file of special genetic codes
AVAIL.ENZ: restriction enzyme file used by PSQ ENZYME command
LONG.ENZ: restriction enzyme file used by PSQ ENZYME command
MERGED.ENZ: restriction enzyme file used by PSQ ENZYME command
NBRF.ENZ: restriction enzyme file used by PSQ ENZYME command
SHORT.ENZ: restriction enzyme file used by PSQ ENZYME command

1.1 Primary Database Files Description

The .SEQ and .REF files contain sequence and accompanying annotation for all entries in the data sets. These are ASCII flat files with characteristics such as sequential access and variable length records (with a maximum of 500 bytes). FORTRAN programs create this type of file by using the OPEN keyword values shown below. Files created by programs written in other languages should have similar characteristics.

    OPEN Keyword       Keyword value
    ------------       -------------
    ORGANIZATION       Sequential
    ACCESS             Sequential
    FORM               Formatted
    RECORDTYPE         Variable
    RECORDSIZE         500 (or less)
    CARRIAGECONTROL    List

The .INX file is the main index file containing byte offsets into the .SEQ, .REF and .TTL (if present); the PSQ program and other PIR-International software does quick entry lookup via this index. If the .SEQ, .REF, or .TTL file is altered in any way, the information in the .INX file becomes invalid and the database system programs will not operate. To recreate the .INX file use the program CREATEINX (see the PIR document CREATEDB, Protein Database Creating Programs for more details).

Below is a table of .INX file characteristics.
    OPEN Keyword       Keyword value
    ------------       -------------
    ORGANIZATION       Sequential
    ACCESS             Direct
    FORM               Unformatted
    RECORDTYPE         Fixed
    RECORDSIZE         (512/4)

PIR-supplied applications programs generally use only the .SEQ and .INX files; see individual program documentation for specific details. The Protein Sequence Query (PSQ) program needs all the primary database files.

1.1.1 PIRn.SEQ - Primary Sequence File

The .SEQ file contains only amino acid sequence information for each entry of the data sets. Each entry consists of a variable number of consecutive records. The information contained in these lines is divided into three sections. The sections are listed below in the order in which they occur in the entry.

Section 1. The HEADER (exactly 1 record) contains information that marks the line as the first line of an entry and that identifies the sequence contained in the entry.

Format:
  Field      Length     Contents of field
  -----      ------     -----------------
  '>'           1       marks the line as an entry header
  TYPE          2       type of sequence in the entry:
                          'P1'  protein, complete
                          'F1'  protein, fragment
  ';'           1       field separator
  CODE        4 - 6     unique retrieval key (four to six
                          alphanumeric characters) assigned
                          to the entry

Examples:
  >P1;AZBR
  >P1;DENCEN
  >F1;XNECV
  >P1;LQBP37

Section 2. The TITLE (exactly 1 record) contains the protein name and species name.

Format:
  Field      Length     Contents of field
  -----      ------     -----------------
  SEQNAM     variable   name of the protein
  ' - '         3       field separator
  ORGNAM     variable   name of biological source of the protein

If the organism or organelle translates nucleic acid to protein by a special genetic code, this fact is noted in the ORGNAM field by the symbol '(SGCn)', where n denotes the special genetic code (see database file SGC.LIS).

Examples:
  finger protein zfpA - turnip fern chloroplast
  E6 protein - European elk papillomavirus
  glycogen (starch) synthase (EC 2.4.1.11) precursor - potato
  L-aspartate oxidase (EC 1.4.3.16) - Escherichia coli

Section 3. The SEQUENCE (variable number of records) section contains the amino acid sequence in one letter coding.

The format of the sequence section of an entry consists of a variable number of records. Each sequence record may be up to 500 characters long. The characters represent the amino acid sequence stored from the amino end to the carboxyl end. The symbols used for the amino acids are the one-letter abbreviations shown in Table I. The amino acid sequence is terminated by an asterisk, which is the last character on the last line of the sequence section. In addition, the sequence may contain punctuation symbols to indicate various degrees of reliability of the data; this coding is described in Table II. One punctuation symbol may precede any amino acid symbol or the terminating asterisk. The sequence lines contain no blank characters.

Examples:
  GDVE(G.K.G.I.F=T,M,C.S.Q,C.H.V,E.K.G.G.K.H)
  FTGPNLHGLFGRK.TGQAVGYSYTAANK.NK.GIIWGDDTLM
  EYLENPK.RYIPGTK.MVFTGLSK.YRE
  RTNLIAYLK.EK.TAA*

1.1.2 PIRn.REF - Primary Reference and Text File

The .REF file contains all annotation information of an entry. The first two records are identical to the HEADER and TITLE records of the .SEQ file described above. Each entry consists of a variable number of consecutive records. The information contained in these records is divided into three sections. The sections are listed below in the order in which they occur in the entry.

Section 1. The HEADER (exactly 1 record) contains information that marks the beginning of an entry and identifies the sequence contained in the entry. The HEADER line is identical with that in the PIRn.SEQ file.

Section 2. The TITLE (exactly 1 record) contains the protein name and species name. The TITLE line is identical with that in the PIRn.SEQ file.

Section 3. The TEXT section (variable number of records) contains annotation information such as species, reference citations, genetic information, superfamily classification and search keywords.

The format of the TEXT section consists of a variable number of records. Each TEXT record may be up to 500 characters long. Each TEXT record, with the exception of the citation record, begins with a Tag that indicates the type of information contained on that record. Certain Tags mark the beginning of a block of data. The types of Tags or data items are listed below with examples. A more compact format specification is given in TableIII.

> Alternate names record (optional)

An alternate name specification consists of a single text record that is identified by having 'N;Alternate names:' Tag as the first 18 characters. The remainder of the line contains a list of other common names for the protein. Alternate names are separated by semicolons.

Examples:
  N;Alternate names: soluble cytochrome f; cytochrome c553
  N;Alternate names: cusacyanin; plantacyanin
  N;Alternate names: component II; nitrogenase reductase

> Contains record (optional)

A contains line consists of a single text line that is identified by having 'N;Contains:' as the first 11 characters. This record specifies other activities or functions that are included in the sequence shown in the entry. Multiple "contains" titles are separated by semicolons.

Examples:
  N;Contains: cytochrome b2 core
  N;Contains: ribonuclease (EC 3.1.-.-) activity
  N;Contains: Arg-vasopressin; neurophysin 2
  N;Contains: intestinal peptide PHM-27

> Species record (required; start of Species block)

The Species line is a single record identified by 'C;Species:' as the first 10 characters. This record describes the source or organism from which the sequence is derived. Each entry contains this record and it may mark the beginning of a Species data block.

> Species Note record (optional; contained in Species block)

The Species Note line is a single record identified by 'A;Note:' as the first 7 characters. This record describes special Species information and is contained in the Species block. All information in the 'C;Host:' records from previous versions of the database has been transferred to this record.

Examples of maximal Species block:
  C;Species: vaccinia virus
  A;Note: host Homo sapiens (man)
  C;Species: equine herpesvirus 1, equine abortion virus
  A;Note: host Equus caballus (domestic horse)

> Date record (required)

The Date line is a single record identified by 'C;Date:' as the first 7 characters. This record indicates the date the entry was added to the dataset, the date the SEQUENCE was modified and the date the TEXT was modified. Each of the dates (add, seq, text) is optional but at least one must appear.

Examples:
  C;Date: 31-Jul-1979  sequence_revision 30-Sep-1992     text_change 14-Oct-1993
  C;Date: 31-May-1979  sequence_revision 25-Feb-1985     text_change 14-Oct-1993
  C;Date:  text_change 30-Jun-1993
  C;Date:  sequence_revision 30-Sep-1991  text_change 02-Dec-1993

> Entry-specific Accession record (required)

The Entry-specific Accession line is a single record identified by 'C;Accession:' as the first 12 characters. This record indicates a list of Accession numbers associated with the entry.

Examples:
  C;Accession: A92196; A92218; A00169
  C;Accession: A90383; A92053; B93774; A92231; A00170
  C;Accession: JS0745; A00172; S07960

> Author/citation records (required; start of Reference block)

The Author line is a single record identified by 'R;' as the first 2 characters. This record indicates a list of authors, separated by semicolons, associated with the Reference block. Immediately following is the citation record that specifies the source of the Reference. This pair of records is required and begins the Reference block. The Reference block may be repeated in the TEXT section.

Examples:
  R;Aigle, M.; Biteau, N.; Crouzet, M.
  submitted to the Protein Sequence Database, March 1992
  R;Cossart, P.; Katinka, M.; Yaniv, M.
  Nucleic Acids Res. 9, 339-347, 1981
  R;Skala, J.; Purnelle, B.; Goffeau, A.
  Yeast 8, 409-417, 1992

> Authors record (optional; repeating; contained in Reference block)

The Authors line is an optionally repeating record identified by 'A;Authors:' as the first 10 characters. This record is used as a supplement to the Author list in the 'R;' record. If the 'R;' record exceeds the maximum record length of 500 characters then additional authors are listed on the Authors line.

> Reference Title record (optional; contained in Reference block)

The Reference Title line is a single record identified by 'A;Title:' as the first 8 characters or 'A;Description:' as the first 14 characters. This record is the publication title (A;Title) or description of a sequence submission (A;Description:). Either Tag may be present but not both.

Examples:
  A;Title: Amino acid sequence of ragweed allergen Ra3.
  A;Description: The amino acid sequence of a type I copper protein
      with an unusual serine- and hydroxyproline-rich C-terminal
      domain isolated from cucumber peelings.

> Reference number record (required; contained in Reference block)

The Reference number line is a single record identified by 'A;Reference number:' as the first 19 characters. This record contains standardized information relating a reference with a six character alphanumeric string (ref_num). Optionally a Medline reference number may appear in the record identified by the 'MUID:' Tag and separated form the ref_num by a semicolon.

Examples:
  A;Reference number: A94561
  A;Reference number: A00100; MUID:82075747

> Contents record (optional; contained in Reference block)

The Contents line is a single record identified by 'A;Contents:' as the first 11 characters. This record specifies the source of the protein (species and/or strain), the portion of the sequence reported, the method of sequence determination, or the extent of experimental detail reported. The record may indicate that the reference is included as a source of ancillary information such as X-ray crystallography or active site identification.

Examples:
  A;Contents: ATCC 16455
  A;Contents: annotation; methylation
  A;Contents: X-ray crystallography, 2.8 angstroms
  A;Contents: Strain BALB/c

> Reference Note record (optional; repeating; contained in Reference block)

The Reference Note line is a repeating record identified by 'A;Note:' as the first 7 characters beneath the start of a Reference block. This record describes reference specific comments.

Examples:
  A;Note: This is the final paper in a series.
  A;Note: The nucleotide sequence is not given in this paper.

> Reference-specific Accession records (optional; start of Accession block; contained in Reference block)

The Reference-specific Accession line is a single record identified by 'A;Accession:' as the first 12 characters. This record indicates a single unique six character alphanumeric string (acc_num) associated with the shown sequence according to the sequence specification in the Residues record described below. These unique numbers specifiy a unique sequence. The presence of this Accession record implies the start on an Accession block that may be repeated beneath the Reference block; the Accession block(s) is contained in the Reference block.

Examples:
  A;Accession: A00086
  A;Accession: JT0008
  A;Accession: S13939

&> Accession Status record (optional; contained in Accession block)

The Accession-specific Status line is a single record identified by 'A;Status:' as the first 9 characters. This record indicates the review status of the sequence refered to by the acc_num. Currently "preliminary" is the only value for this information.

Example:
  A;Status: preliminary

> Molecule type record (required if Accession present; contained in Accession block)

The molecule type line is a single record identified by 'A;Molecule type:' Page 10 as the first 16 characters. This record indicates the type of molecule from which the sequence was determined. Valid values for this data item are: "protein", "DNA", "mRNA", "nucleic acid" and "genomic RNA."

Examples:
  A;Molecule type: protein
  A;Molecule type: DNA; mRNA

> Residues record (required if Accession present; contained in Accession block)

The Residues line is a single record identified by 'A;Residues:' as the first 11 characters. This record specifies a reported sequence according to the amino acid sequence as depicted in PIRn.SEQ.

Examples:
  A;Residues: 1-85,'SK',88-92,'N',94-100,'K',102-103,'A' 
  A;Residues: 2-57,'IV',60-61,'ZZ',64-66,'Z',68-69,'ZB',72-105 
  A;Residues: 1-107

> Cross-references record (optional; contained in Accession block)

The Cross-references line is a single record identified by 'A;Cross-references:' as the first 19 characters. This record specifies a list of database/identifier pairs that indicate related sequence information in another database. Current cross referenced databases are: "GB", "EMBL", "PDB", "DDBJ", "CAS", "NCBIP" and "NCBIN."

Examples:
  A;Cross-references: GB:J04618; GB:J04619
  A;Cross-references: CAS:124041-95-8

> Accession Genetics record (optional; contained in Accession block)

The Accession Genetics line is a single record identified by 'A;Genetics:' as the first 11 characters. This record contains a Tag that specifies which 'C;Genetics:' block (defined below) describes the sequence report depicted by the Accession block. This record is present only when more than one 'C;Genetics:' block is present in the entry.

Examples:
  A;Genetics: ST1
  A;Genetics: ST2
  A;Genetics: HBA1

> Accession Note record (optional; repeating; contained in Reference block)

The Accession Note line is a repeating record identified by 'A;Note:' as the first 7 characters beneath the start of an Accession block. This record describes sequence specific comments.

Examples:
  A;Note: The authors translated the codon CTG for residue 169 as Ile.
  A;Note: 175-Ala was also found.
  A;Note: The difference at the carboxyl end is due to a frameshift.

> Comment records (optional; repeating)

The Comment line is a repeating record identified by 'C;Comment:' as the first 10 characters. This record contains general information in a free format, natural language form about the protein sequence entry. Some Comment records can be decomposed and the information move to more appropriate records; this is an ongoing standardization project.

Examples:
  C;Comment: Met preceding 1-Gly is removed after translation.
  C;Comment: The sequence shown is iso-1-cytochrome c.
  C;Comment: Euglena is a genus of green algae.

> Genetics record (optional; start of Genetics block)

The Genetics line is a single record identified by 'C;Genetics:' as the first 11 characters. This record contains no information except if more than one Genetics block exists in the entry. In the case of multiple gene information this record will contain a unique Tag that points to an Accession block; this indicates which sequence report is related to the genetic data. Presence of this record implies the start of a Genetics block and is required if other genetic information exists such as that defined below. The Genetics block may be repeated within an entry.

Examples:
  C;Genetics:
  C;Genetics: 
  C;Genetics: 
  C;Genetics:

> Gene record (optional; contained in Genetics block)

The Gene line is a single record identified by 'A;Gene:' as the first 7 characters. This record specifies the gene symbol used to denote the gene. Some symbols may contain "GDB" as a Tag; this indicates a cross reference to the Genome Database.

Examples:
  A;Gene: psbE
  A;Gene: GDB:CYP2D6
  A;Gene: CYP2B1

> Map position record (optional; contained in Genetics block)

The Map position line is a single record identified by 'A;Map position:' as the first 15 characters. This record specifies a map position on which Page 12 the gene is located. For viruses this may indicate a segment number.

Examples:
  A;Map position: 71.6-76.2
  A;Map position: 85 min
  A;Map position: 4q21-q23

> Genome record (optional; contained in the Genetics block)

The Genome line is a single record identified by 'A;Genome:' as the first 9 characters. This record specifies which type of genome is described. Current values for this record are: "mitochondrion", "chloroplast", "cyanellle" and "plasmid."

Examples:
  A;Genome: chloroplast
  A;Genome: cyanelle
  A;Genome: plasmid
  A;Genome: mitochondrion

> Genetic code record (optional; contained in Genetics block)

The Genetic code line is a single record identified by 'A;Genetic code:' as the first 15 characters. This record indicates which Special Genetic Code table is used by the specified organism for nucleic acid translation. Current values for this record are "SGC1" - "SGC9." For more information see the file SGC.LIS described below.

Examples:
  A;Genetic code: SGC1

> Start codon record (optional; contained in Genetics block) The Start codon line is a single record identified by 'A;Start codon:' as the first 14 characters. This record indicates the codon in the nucleic acid sequence where translation is initiated.

Examples:
  A;Start codon: ATT
  A;Start codon: ATC
  A;Start codon: GTG

> Introns record (optional; contained in Genetics block)

The Introns line is a single record identified by 'A;Introns:' as the first 10 characters. This record specifies the intron segments needed to code for the gene product. Segments are separated by semicolons.

Examples:
  A;Introns: 253/3; 270/3
  A;Introns: 139/1; 143/2; 169/2; 253/3; 270/3

Page 13

> Genetics Note record (optional; repeating; contained in Genetics block)

The Genetics Note line is a repeating record identified by 'A;Note:' as the first 7 characters beneath the start of a Genetics block. This record describes genetics specific comments.

Examples:
  A;Note: strain D273-10B/A21
  A;Note: strain 777-3A

> Function record (optional; start of Function block)

The Function line is a single record identified by 'C;Function:' as the first 11 characters. This record contains no information except if more than one Function block exists in the entry. In the case of multiple function information this record will contain a unique Tag to distinguish each block. Presence of this record implies the start of a Function block and is required if other function information exists such as that defined below. The Function block may be repeated within an entry.

> Function Description record (optional; contained in Function block)

The Function Description line is a single record identified by 'A;Description:' as the first 14 characters and immediately following the Function record. This record indicates the type of function the specified protein may have.

Example:
  C;Function:
  A;Description: This protein functions as a molecular chaperone in the
      endosymbiont.

> Superfamily record (optional)

The Superfamily line is a single record identified by 'C;Superfamily:' as the first 14 characters. This record indicates which Superfamily(s) has the protein as a member. Individual Superfamily names are separated by semicolons.

Examples:
  C;Superfamily: phosphorylase
  C;Superfamily: sucrose synthase; sucrose/sucrose-phosphate synthase homology

> Keywords record (optional)

The Keywords line is a single record identified by 'C;Keywords:' as the first 11 characters. This record indicates which keyword(s) are associated with the entry. Terms in the list are separated by semicolons and are used as a retreival key. Page 14

Example:
  C;Keywords: homodimer; NAD; oxidoreductase
  C;Keywords: oxidoreductase; pentose phosphate pathway

> Feature record (optional; repeating)

The Feature line is a repeating record identified by 'F;' as the first 2 characters. This record may be one of many comprising a Feature table. Each line of the Feature table has the following format. Positions 3 to the occurence of a '/' character is a range or site specification. Multiple segments corresponding to a single feature are separated by commas. The feature title appear after the '/' and consists of a feature descriptor, a title, and an optional code extension enclosed by the symbols < and >. A feature descriptor is a word or short phrase followed by a colon that defines the type of feature (refer to Table IV for a complete listing and explanation of feature descriptors). The code extension consists of a short character string (alphanumerics only) that when associated with the entry identification code defines a logical address for the subsequence that is unique throughout the database. For example, the path DEECK->DKI uniquely defines the feature with code extension DKI in entry DEECK.

Examples:
  F;316/Active site: Asp
  F;483/Binding site: carbohydrate
  F;19-69,28-52,44-65/Disulfide bonds:
  F;1-249/Domain: aspartokinase I <DKI>
  F;1-24/Domain: signal sequence  <SIG>
  F;50-54/Peptide: Met-enkephalin 1  <ME1>
  F;15-72/Protein: basic protease inhibitor  <MAT>

1.1.3 PIRn.INX - Primary Database Index File

This file is an index to the three primary database files: PIRn.SEQ, PIRn.REF, PIRn.TTL. The index provides information that allows PSQ to use the VAX-11 RMS RFA record access mode to locate and access entry records in the primary database files. Entries are placed in the same order in the .REF, .SEQ and .TTL files; indexing is based on this order. Database records are accessed using the ISN_CODE, LOC_ENTRY, GET_SEQUENCE, GET_TEXT, and GET_TITLE routines located in module DBMS of the PSQ text library, PSQ.TLB. Refer to these routines for more specific information. NOTE: If changes are made to any of the primary database files, the index file will no longer correctly point to the database entries and the PIR programs will not operate.

1.2 Database Index Files Description

The Database Index files contain information present in the Primary Database files but in a manner that facilitates quick access. If any of the auxiliary index files are absent, the PSQ commands supported by the sefiles will not operate. These files can be created using the INDEXER/SORTTMP programs except for the .CAX and .CDX files which must be supplied and the .TSC file which may be created using the CREATETSC program. See PIR document CREATEDB for more information.

1.2.1 PIRn.CAX - Sequence Composition Data File

This file contains the values of the molecular weight, sequence length, and percentage composition of each entry; it is used by the SELECT and REPORT commands of PSQ. There is one record for each entry and the order is the same as in the Primary Database files. The file attributes are as follows:

    OPEN Keyword       Keyword value
    ------------       -------------
    ORGANIZATION       Sequential
    ACCESS             Direct
    FORM               Unformatted
    RECORDTYPE         Fixed
    RECORDSIZE         25

Each record contains LENGTH, WEIGHT and DATA variables where: LENGTH is a real variable (sequence length); WEIGHT is a real variable (molecular weight); and DATA is an array of 23 real variables (% composition of amino acids A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,B,Z,X)

1.2.2 PIRn.CDX - Ancillary Data File

This file contains ancillary character data associated with each data baseentry; it is used by the SELECT, REPORT, SUPERFAMILY, and TAXONOMY commands of PSQ. There is one record for each entry and they are ordered in the same sequence as the Primary Database files. The file attributes are as follows:

    OPEN Keyword       Keyword value
    ------------       -------------
    ORGANIZATION       Sequential
    ACCESS             Direct
    FORM               Formatted
    RECORDTYPE         Fixed
    RECORDSIZE         81

Each record is organized according to the following table:

Format:
  Field      Length     Contents of field
  -----      ------     -----------------
  CODE          6       retrieval code
  SUPER         6       superfamily number
  FAMILY        4       family number
  SUBFAM        4       subfamily number
  ENTRY         4       entry number
  SUBENT        4       subentry number
  ADDED_DATE    6       date entry was added to the database
  SEQ_DATE      6       date of last sequence update
  TEXT_DATE     6       date of last text update
  TYPE          1       sequence type (P=complete; F=fragment)
  NULL          1       unspecified field
  GRP           3       primary taxonomic group
  GRP1          3       auxiliary taxonomic group 1
   .            .         .
   .            .         .
   .            .         .
  GRP10         3       auxiliary taxonomic group 10

The dates have the format: YYMMDD; YY-year, MM-month, DD-day, where YY, MM, DD are numeric values. Dates prior to 1982 are not reliable; entries added prior to 1980 may contain blank characters in these fields. The primary taxonomic group code is associated with the species; each record may contain up to 10 auxiliary taxonomic group codes.

The text retrieval programs of the PIR-International use an inverted database model. For each text field in the database an index file containing an alphabetically sorted list of the terms found in that field and the logical addresses of the corresponding entries is compiled. These text index files are identically formatted and contain ASCII data and ISN numbers. These files are accessed by the INDEXS module of PSQ; refer to this routine for specific information. The file attributes are as follows:

    OPEN Keyword       Keyword value
    ------------       -------------
    ORGANIZATION       Sequential
    ACCESS             Sequential
    FORM               Formatted
    RECORDTYPE         Variable
    RECORDSIZE         500 (or less)
    CARRIAGECONTROL    List

The following files are currently available. Note that none can be displayed by the DCL TYPE, PRINT, or EDIT commands. The CONVINDX program will produce an ASCII version of these files. See PIR document CREATEDB for more information.

    PIRn.ACX      Accession number index
    PIRn.AUX      Author name index
    PIRn.CRX      Cross-reference index
    PIRn.FTX      Feature title index
    PIRn.GNX      Gene symbol index
    PIRn.JRX      Citation index
    PIRn.RNX      Reference number index
    PIRn.SFX      Superfamily name index
    PIRn.SNX      Superfamily number index
    PIRn.SPX      Species index
    PIRn.TTX      Entry title index
    PIRn.WOX      Keyword index

1.2.4 PIRn.TSC - Tripeptide Catalogue File

This file contains a lookup table of all tripeptides in the data set and the corresponding locations of the sequences and positions in the sequence where the tripeptides occur; it is used by the SCAN command of PSQ. Refer to module SCAN of the PSQ text library, PSQ.TLB, for more specific information. The file attributes are as follows:

    OPEN Keyword       Keyword value
    ------------       -------------
    ORGANIZATION       Sequential
    ACCESS             Sequential
    FORM               Unformatted
    RECORDTYPE         Fixed
    RECORDSIZE         (512/4)

1.3 Auxiliary Database Files Description

1.3.1 PIRn.NAM - Database Citation File

This file contains free format text that gives the database citation. The citation for each data set contains release description, version, date and number for entries and total residues. Also, all collaborating sites are listed. The .NAM file is an ASCII flat file used by the SHOW command of PSQ.

1.3.2 PIRn.CHN - Entry Identification Code Change File

This file is provided to keep a brief record on changes to the entry identification codes. The first six characters of each record in this file contain the new entry identification code; positions 10 to 15 contain the old code. The file is an ASCII flat file that may be displayed using the DCL TYPE or PRINT commands.

1.3.3 PIRn.TTL - Primary Title File

The Title file lists the code (unique entry retrieval identifier) and the TITLE (protein name - organism name) of each of the sequence entries. There is one primary TITLE record for each database entry; this line is immediately followed by a variable number of Alternate names and/or Contains lines. These additional lines contain blanks in the first six spaces in place of the CODE; this distinguishes them from the primary title lines. Continuation lines are indicated by a dash as the last character on the preceding line.

Format:
  Field      Length     Contents of field
  -----      ------     -----------------
  CODE         6         retrieval code (blank characters are used to
                           fill the code field to 6 characters)
  TITLE      variable    sequence title (see PIRn.SEQ file description for
                           TITLE format information)

Examples:
  CCBYBCCytochrome c, iso-2 - Baker's yeast
  CCML6 Cytochrome c6 - Monochrysis lutheri-
        Alternate names: soluble cytochrome f; cytochrome c553
  NRMS  Ribonuclease (EC 3.1.27.5) - Mouse
  MYSH  Myoglobin - Sheep and red deer
  CCHECCCytochrome c - Hemp
  AXBO  Adrenodoxin - Bovine

Note that the .TTL file is not supplied with the PIR-International Protein Sequence Database. This optional file is useful only for older software; all versions of the PIR programs do not use this file. The program CREATETTL will create this file and is included on the Tape Release. See PIR document CREATEDB for more information.

1.3.4 PRINDX.LIS - Database Short Directory; SUPFAMNUM.LIS - Superfamily name File

The PRINDX.LIS file is a database listing (PIR1 and PIR2) of classified sequences ordered by Superfamily number. SUPFAMNUM.LIS contains a listing of the Superfamily names found for each Superfamily number represented in Section 1 (PIR1) and Section 2 (PIR2).

1.3.5 JOURNALS.LIS - Journal Abbreviation File

The JOURNALS.LIS file is compiled from all reference citations of the Protein Sequence Database. The abbreviations are determined and maintained by the National Library of Medicine (NLM). Journal names followed by an asterisk are no longer in use but continue to exist in the database.

1.3.6 TAXONOMY.LIS - Species Classification File; SGC.LIS - Special Genetic Code File

The TAXONOMY.LIS and SGC.LIS files are compiled and maintained by Andrzej Elzanowski of MIPS at the Max-Planck-Institut Fuer Biochemie in Martinsried, Germany. These files represent data as it appears in Section 1 (PIR1) and Section 2 (PIR2) of the Protein Sequence Database.

The TAXONOMY.LIS file is a heirarchical classification of source organisms in a order of increasing complexity. The PIR2 and PIR3 data sets are ordered according to the heirarchy of the this file.

The SGC.LIS file contains the genetic code tables corresponding to the special genetic codes recognized by PSQ. There is an entry in the file for each special genetic code; if not present, the corresponding genetic code will not be recognized by PSQ. Each entry begins with a header line, which must be immediately followed by 16 codon table lines. Codons are ordered from the uppermost left corner of the table in the nucleotide order U, C, A, G (this order is not mandatory). Blank lines between the header line and the codon table lines or between codon table lines are ignored and may be used as separators within the table. All lines that occur between entries in the file are ignored. The file is an ASCII flat file.

Table Header Format:
  >SGC1 - Mammalian mitochondrial genetic code
  Field      Length     Contents of field
  -----      ------     -----------------
  '>'          1        begins a new SGC entry
  SGCn       variable   designates special genetic code number;
                          must contain SGC as first three characters
                          followed by the integer special genetic code
                          number
  ' - '        3        field separator
  SGCNAM     variable   special genetic code name

Each record contains four codon-amino acid field groups separated by 6 blank spaces.

Table Data Format:
  UUU  F Phe      UCU  S Ser      UAU  Y Tyr      UGU  C Cys

  Field      Length     Contents of field
  -----      ------     -----------------
  CODON        3        codon consisting of three single-
                          character nucleotide symbols
  ' '          1        field separator
  AAONE        2        one-letter amino acid code
                          '+' in the first space indicates
                          an initiation codon; otherwise the
                          first space should be blank.
  ' '          1        field separator
  AATHR        3        three-letter amino acid code
  '      '     6        field separator

1.3.7 *.ENZ - Restriction Endonuclease Files

These files contain the restriction endonuclease lists utilized by the PSQ ENZYME comand. The recognition site is given in ambiguous nucleotide code (see Table V); the apostrophe designates the cut-site. If the cut-site is not symmetrical the complementary recognition and cut-site are given, separated by a colon.

Format:
  Field      Length     Contents of field
  -----      ------     -----------------
  CODE         13       restriction enzyme code
  SPEC       variable   recognition and cut-site specification
  ' - '        3        field separator
  ORGAN      variable   biological source, or comment

Examples:
  AvaI         C'YCGRG - Anabaena variabilis
  EcoRI'       RRATYY - Escherichia coli (stain RY13) plasmid RTF1
  HhaI         GCG'C - Haemophilus haemolyticus
  MboII        GAAGANNNNNNNN':'NNNNNNNTCTTC - Moraxella bovis
  XbaI         T'CTAGA - Xanthomonas badrii

Table I: One- and Three-letter Amino Acid Abbreviations

The following abbreviations conform to those suggested by the IUPAC-IUB
Commission on Biochemical Nomenclature, J. Biol. Chem. 243, 3557-3559,
1968.
    A   Ala   Alanine
    C   Cys   Cysteine
    D   Asp   Aspartic acid
    E   Glu   Glutamic acid
    F   Phe   Phenylalanine
    G   Gly   Glycine
    H   His   Histidine
    I   Ile   Isoleucine
    K   Lys   Lysine
    L   Leu   Leucine
    M   Met   Methionine
    N   Asn   Asparagine
    P   Pro   Proline
    Q   Gln   Glutamine
    R   Arg   Arginine
    S   Ser   Serine
    T   Thr   Threonine
    V   Val   Valine
    W   Trp   Tryptophan
    Y   Tyr   Tyrosine
    B   Asx   Asp or Asn,
              not distinguished
    Z   Glx   Glu or Gln,
              not distinguished
    X    X    Undetermined or atypical
              amino acid

Table II: Punctuation Description in Protein Sequences

XX   Two adjacent amino acids, with no punctuation between, indicates
       that they are connected, as determined experimentally.
()   Encloses a region, the composition but not the complete sequence
       of which has been determined experimentally, or encloses a
       single residue that has been tentatively identified.
 =   Indicates ")(", the juxtaposition of two regions of indeterminate
       sequence, while preserving proper spacing between amino acids.
 /   Indicates that the adjacent amino acids are from different
       peptides, not necessarily connected. When the amino end of a
       protein has not been determined, "/" precedes the first residue.
       When the carboxyl end has not been determined, "/" follows the
       last residue. When ")/", "/(", or ")/(" are needed, only "/" is
       used.
 .  Outside of parentheses, indicates the ends of sequence fragments.
       The relative order of these fragments was not determined
       experimentally but is clear from homology or other indirect
       evidence.
 .  Within parentheses, indicates that the amino acid to the left
       has been placed with at least 90% confidence by homology with
       known sequences.
 ,  Indicates that the amino acid to its left could not be
       positioned with confidence by homology.

Table III: Format Specification Entry TEXT Records

>..;code                       HEADER
title - trivial name           TITLE
N;Alternate names:             TEXT
N;Contains:
C;Species:                       Species block
  A;Variety:
  A;Note:
C;Date:
C;Accession:
R;                               Reference block (may repeat)
  citation
  A;Authors:
  A;Title:
  A;Description:
  A;Reference number:
  A;Contents:
  A;Note:
  A;Accession:                   Accession block (may repeat)
    A;Status:
    A;Molecule type:
    A;Residues:
    A;Cross-references:
    A;Experimental source:
    A;Genetics:
    A;Note:
C;Comment:                       Comments (may repeat)
C;Genetics:                      Genetics block (may repeat)
  A;Gene:
  A;Map position:
  A;Genome:
  A;Gene origin:
  A;Genetic code:
  A;Start codon:
  A;Introns:
  A;Other products:
  A;Note:
C;Complex:                       Complex
C;Function:                      Function block
  A;Description:
  A;Pathway:
  A;Note:
C;Superfamily:
C;Keywords:
F;                               Features block (may repeat)

Table IV: Feature Table Descriptors

The following descriptors denote single residue sites. These sites may be represented using the full feature residue specification conventions; however, the specification should be interpreted to specify a collection of single sites.

    Active site:
    Binding site:
    Cleavage site:
    Inhibitory site:
    Modified site:

The following descriptors denote residues connected by covalent bonds. The pairs of residues linked by hyphens should be intepreted as being connected by bonds. Single residues may be specified if the bond is between an amino acid in the sequence and another protein chain or molecule. The Cross-links: descriptor specifically denotes bonds linking the sequence shown to an adjacent protein chain.

    Disulfide bonds:
    Cross-links:

The following descriptors denoting sequence regions. Pairs of residues linked by hyphens denote a region extending from the first to the second position inclusive. The Domain: descriptor indicates distinct functional regions that may be of separate evolutionary origin. The Duplication: descriptor indicates regions that have evolved by sequence duplication. The Peptide: and Protein: descriptors indicate regions corresponding to the mature protein or peptides derived from the sequence shown. The Region: descriptor is used to indicate all other regions.

    Domain:
    Peptide:
    Protein:
    Region:

Table V: Ambiguous Nucleotide Codes

These abbreviations conform to those suggested by the IUPAC-IUB Commission on Biochemical Nomenclature, Nucl. Acids Res. 13, 3021-3031, 1985

    Symbol   Meaning
    ------   -------
      A      Adenine
      C      Cytosine
      G      Guanine
      T      Thymine
      U      Uracil
      R      puRine (A or G)
      Y      pYrimidine (C or T/U)
      M      A or C
      W      A or T/U
      S      C or G
      K      G or T/U
      D      A, G, or T/U
      H      A, C, or T/U
      V      A, C, or G
      B      C, G, or T/U
      N      A, C, G, or T/U