P R O T E I N S E Q U E N C E D A T A B A S E of PIR-International PIR Document PRDBFS-1293 Database File Structure and Format Specification Version 6.0, December 1993 Protein Information Resource (PIR)* National Biomedical Research Foundation 3900 Reservoir Road, N.W., Washington, DC 20007, USA Japan International Protein Munich Information Center for Information Database (JIPID) Protein Sequences (MIPS) Amakubo 1-16-1 GSF-Forschungszentrum f. Umwelt und Gesundheit Tsukuba 305-0005, Japan am Max-Planck-Instut f. Biochemie Am Klopferspitz 18, D-82152 Martinsried, FRG This database may be redistributed without prior consent, provided that this notice be given to each user and that the words "Derived from" shall precede this notice if the database has been altered by the redistributor. We have made every effort to ensure proper functioning of the programs and cannot be held responsible for the consequences to users of any problems encountered during their operation. *PIR is a registered mark of NBRF PIR is partially supported by the National Library of MedicineThis Document describes the files comprising the PIR-International Protein Sequence Database and the format of each. The format has been enhanced significantly for Release 39.00 to what is referred to as "enhanced NBRF" format. A Technical Development Bulletin is available upon request. Each data set contains primary files, index files and auxiliary files.
The Protein Sequence Database is divided into three Sections: Section 1: Annotated and Classified Entries, Section 2: Preliminary Entries, and Section 3: Unverified Entries. The files corresponding to these sections have the file names PIR1, PIR2, and PIR3, respectively. Each section is composed of the following files; however, since Section 3 (PIR3) contains minimal information only it may not contain certain index files.
Information associated with one sequence database entry is split between the .REF and .SEQ files; amino acid sequence is contained in the .SEQ and all annotation is contained in the .REF file. The first two records of the .REF file is duplicated in the .SEQ file. This description concentrates on file structure and not on the conceptual view of an entry as seen when using PSQ or other software.
!!!!!!!!!!!!!!NOTE CHANGES IN FILES AVAILABLE !!!!!!!!!!!!!!!!! IT IS PIR'S INTENTION TO STOP PROVIDING INDEX FILES FOR THE NBRF-PIR FORMAT AND THE CODATA FORMAT AFTER RELEASE 66.00!! THE *.REF, *.SEQ, *.NAM, AND *.DAT FLAT FILES WILL CONTINUE. IF YOU WISH US TO CONTINUE TO PROVIDE ANY OF THE INDEX FILES PLEASE LET US KNOW. CONTACT PIRMAIL@NBRF.GEORGETOWN.EDU . !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
OPEN Keyword Keyword value ------------ ------------- ORGANIZATION Sequential ACCESS Sequential FORM Formatted RECORDTYPE Variable RECORDSIZE 500 (or less) CARRIAGECONTROL List
The .INX file is the main index file containing byte offsets into the .SEQ, .REF and .TTL (if present); the PSQ program and other PIR-International software does quick entry lookup via this index. If the .SEQ, .REF, or .TTL file is altered in any way, the information in the .INX file becomes invalid and the database system programs will not operate. To recreate the .INX file use the program CREATEINX (see the PIR document CREATEDB, Protein Database Creating Programs for more details).
Below is a table of .INX file characteristics. OPEN Keyword Keyword value ------------ ------------- ORGANIZATION Sequential ACCESS Direct FORM Unformatted RECORDTYPE Fixed RECORDSIZE (512/4)
PIR-supplied applications programs generally use only the .SEQ and .INX files; see individual program documentation for specific details. The Protein Sequence Query (PSQ) program needs all the primary database files.
Section 1. The HEADER (exactly 1 record) contains information that marks the line as the first line of an entry and that identifies the sequence contained in the entry.
Format: Field Length Contents of field ----- ------ ----------------- '>' 1 marks the line as an entry header TYPE 2 type of sequence in the entry: 'P1' protein, complete 'F1' protein, fragment ';' 1 field separator CODE 4 - 6 unique retrieval key (four to six alphanumeric characters) assigned to the entry
Examples: >P1;AZBR >P1;DENCEN >F1;XNECV >P1;LQBP37
Section 2. The TITLE (exactly 1 record) contains the protein name and species name.
Format: Field Length Contents of field ----- ------ ----------------- SEQNAM variable name of the protein ' - ' 3 field separator ORGNAM variable name of biological source of the protein
If the organism or organelle translates nucleic acid to protein by a special genetic code, this fact is noted in the ORGNAM field by the symbol '(SGCn)', where n denotes the special genetic code (see database file SGC.LIS).
Examples: finger protein zfpA - turnip fern chloroplast E6 protein - European elk papillomavirus glycogen (starch) synthase (EC 2.4.1.11) precursor - potato L-aspartate oxidase (EC 1.4.3.16) - Escherichia coli
Section 3. The SEQUENCE (variable number of records) section contains the amino acid sequence in one letter coding.
The format of the sequence section of an entry consists of a variable number of records. Each sequence record may be up to 500 characters long. The characters represent the amino acid sequence stored from the amino end to the carboxyl end. The symbols used for the amino acids are the one-letter abbreviations shown in Table I. The amino acid sequence is terminated by an asterisk, which is the last character on the last line of the sequence section. In addition, the sequence may contain punctuation symbols to indicate various degrees of reliability of the data; this coding is described in Table II. One punctuation symbol may precede any amino acid symbol or the terminating asterisk. The sequence lines contain no blank characters.
Examples: GDVE(G.K.G.I.F=T,M,C.S.Q,C.H.V,E.K.G.G.K.H) FTGPNLHGLFGRK.TGQAVGYSYTAANK.NK.GIIWGDDTLM EYLENPK.RYIPGTK.MVFTGLSK.YRE RTNLIAYLK.EK.TAA*
The .REF file contains all annotation information of an entry. The first two records are identical to the HEADER and TITLE records of the .SEQ file described above. Each entry consists of a variable number of consecutive records. The information contained in these records is divided into three sections. The sections are listed below in the order in which they occur in the entry.
Section 1. The HEADER (exactly 1 record) contains information that marks the beginning of an entry and identifies the sequence contained in the entry. The HEADER line is identical with that in the PIRn.SEQ file.
Section 2. The TITLE (exactly 1 record) contains the protein name and species name. The TITLE line is identical with that in the PIRn.SEQ file.
Section 3. The TEXT section (variable number of records) contains annotation information such as species, reference citations, genetic information, superfamily classification and search keywords.
The format of the TEXT section consists of a variable number of records. Each TEXT record may be up to 500 characters long. Each TEXT record, with the exception of the citation record, begins with a Tag that indicates the type of information contained on that record. Certain Tags mark the beginning of a block of data. The types of Tags or data items are listed below with examples. A more compact format specification is given in TableIII.
> Alternate names record (optional)
An alternate name specification consists of a single text record that is identified by having 'N;Alternate names:' Tag as the first 18 characters. The remainder of the line contains a list of other common names for the protein. Alternate names are separated by semicolons.
Examples: N;Alternate names: soluble cytochrome f; cytochrome c553 N;Alternate names: cusacyanin; plantacyanin N;Alternate names: component II; nitrogenase reductase
> Contains record (optional)
A contains line consists of a single text line that is identified by having 'N;Contains:' as the first 11 characters. This record specifies other activities or functions that are included in the sequence shown in the entry. Multiple "contains" titles are separated by semicolons.
Examples: N;Contains: cytochrome b2 core N;Contains: ribonuclease (EC 3.1.-.-) activity N;Contains: Arg-vasopressin; neurophysin 2 N;Contains: intestinal peptide PHM-27
> Species record (required; start of Species block)
The Species line is a single record identified by 'C;Species:' as the first 10 characters. This record describes the source or organism from which the sequence is derived. Each entry contains this record and it may mark the beginning of a Species data block.
> Species Note record (optional; contained in Species block)
The Species Note line is a single record identified by 'A;Note:' as the first 7 characters. This record describes special Species information and is contained in the Species block. All information in the 'C;Host:' records from previous versions of the database has been transferred to this record.
Examples of maximal Species block: C;Species: vaccinia virus A;Note: host Homo sapiens (man) C;Species: equine herpesvirus 1, equine abortion virus A;Note: host Equus caballus (domestic horse)
> Date record (required)
The Date line is a single record identified by 'C;Date:' as the first 7 characters. This record indicates the date the entry was added to the dataset, the date the SEQUENCE was modified and the date the TEXT was modified. Each of the dates (add, seq, text) is optional but at least one must appear.
Examples: C;Date: 31-Jul-1979 sequence_revision 30-Sep-1992 text_change 14-Oct-1993 C;Date: 31-May-1979 sequence_revision 25-Feb-1985 text_change 14-Oct-1993 C;Date: text_change 30-Jun-1993 C;Date: sequence_revision 30-Sep-1991 text_change 02-Dec-1993
> Entry-specific Accession record (required)
The Entry-specific Accession line is a single record identified by 'C;Accession:' as the first 12 characters. This record indicates a list of Accession numbers associated with the entry.
Examples: C;Accession: A92196; A92218; A00169 C;Accession: A90383; A92053; B93774; A92231; A00170 C;Accession: JS0745; A00172; S07960
> Author/citation records (required; start of Reference block)
The Author line is a single record identified by 'R;' as the first 2 characters. This record indicates a list of authors, separated by semicolons, associated with the Reference block. Immediately following is the citation record that specifies the source of the Reference. This pair of records is required and begins the Reference block. The Reference block may be repeated in the TEXT section.
Examples: R;Aigle, M.; Biteau, N.; Crouzet, M. submitted to the Protein Sequence Database, March 1992 R;Cossart, P.; Katinka, M.; Yaniv, M. Nucleic Acids Res. 9, 339-347, 1981 R;Skala, J.; Purnelle, B.; Goffeau, A. Yeast 8, 409-417, 1992
> Authors record (optional; repeating; contained in Reference block)
The Authors line is an optionally repeating record identified by 'A;Authors:' as the first 10 characters. This record is used as a supplement to the Author list in the 'R;' record. If the 'R;' record exceeds the maximum record length of 500 characters then additional authors are listed on the Authors line.
> Reference Title record (optional; contained in Reference block)
The Reference Title line is a single record identified by 'A;Title:' as the first 8 characters or 'A;Description:' as the first 14 characters. This record is the publication title (A;Title) or description of a sequence submission (A;Description:). Either Tag may be present but not both.
Examples: A;Title: Amino acid sequence of ragweed allergen Ra3. A;Description: The amino acid sequence of a type I copper protein with an unusual serine- and hydroxyproline-rich C-terminal domain isolated from cucumber peelings.
> Reference number record (required; contained in Reference block)
The Reference number line is a single record identified by 'A;Reference number:' as the first 19 characters. This record contains standardized information relating a reference with a six character alphanumeric string (ref_num). Optionally a Medline reference number may appear in the record identified by the 'MUID:' Tag and separated form the ref_num by a semicolon.
Examples: A;Reference number: A94561 A;Reference number: A00100; MUID:82075747
> Contents record (optional; contained in Reference block)
The Contents line is a single record identified by 'A;Contents:' as the first 11 characters. This record specifies the source of the protein (species and/or strain), the portion of the sequence reported, the method of sequence determination, or the extent of experimental detail reported. The record may indicate that the reference is included as a source of ancillary information such as X-ray crystallography or active site identification.
Examples: A;Contents: ATCC 16455 A;Contents: annotation; methylation A;Contents: X-ray crystallography, 2.8 angstroms A;Contents: Strain BALB/c
> Reference Note record (optional; repeating; contained in Reference block)
The Reference Note line is a repeating record identified by 'A;Note:' as the first 7 characters beneath the start of a Reference block. This record describes reference specific comments.
Examples: A;Note: This is the final paper in a series. A;Note: The nucleotide sequence is not given in this paper.
> Reference-specific Accession records (optional; start of Accession block; contained in Reference block)
The Reference-specific Accession line is a single record identified by 'A;Accession:' as the first 12 characters. This record indicates a single unique six character alphanumeric string (acc_num) associated with the shown sequence according to the sequence specification in the Residues record described below. These unique numbers specifiy a unique sequence. The presence of this Accession record implies the start on an Accession block that may be repeated beneath the Reference block; the Accession block(s) is contained in the Reference block.
Examples: A;Accession: A00086 A;Accession: JT0008 A;Accession: S13939
&> Accession Status record (optional; contained in Accession block)
The Accession-specific Status line is a single record identified by 'A;Status:' as the first 9 characters. This record indicates the review status of the sequence refered to by the acc_num. Currently "preliminary" is the only value for this information.
Example: A;Status: preliminary
> Molecule type record (required if Accession present; contained in Accession block)
The molecule type line is a single record identified by 'A;Molecule type:' Page 10 as the first 16 characters. This record indicates the type of molecule from which the sequence was determined. Valid values for this data item are: "protein", "DNA", "mRNA", "nucleic acid" and "genomic RNA."
Examples: A;Molecule type: protein A;Molecule type: DNA; mRNA
> Residues record (required if Accession present; contained in Accession block)
The Residues line is a single record identified by 'A;Residues:' as the first 11 characters. This record specifies a reported sequence according to the amino acid sequence as depicted in PIRn.SEQ.
Examples: A;Residues: 1-85,'SK',88-92,'N',94-100,'K',102-103,'A'A;Residues: 2-57,'IV',60-61,'ZZ',64-66,'Z',68-69,'ZB',72-105 A;Residues: 1-107
> Cross-references record (optional; contained in Accession block)
The Cross-references line is a single record identified by 'A;Cross-references:' as the first 19 characters. This record specifies a list of database/identifier pairs that indicate related sequence information in another database. Current cross referenced databases are: "GB", "EMBL", "PDB", "DDBJ", "CAS", "NCBIP" and "NCBIN."
Examples: A;Cross-references: GB:J04618; GB:J04619 A;Cross-references: CAS:124041-95-8
> Accession Genetics record (optional; contained in Accession block)
The Accession Genetics line is a single record identified by 'A;Genetics:' as the first 11 characters. This record contains a Tag that specifies which 'C;Genetics:' block (defined below) describes the sequence report depicted by the Accession block. This record is present only when more than one 'C;Genetics:' block is present in the entry.
Examples: A;Genetics: ST1 A;Genetics: ST2 A;Genetics: HBA1
> Accession Note record (optional; repeating; contained in Reference block)
The Accession Note line is a repeating record identified by 'A;Note:' as the first 7 characters beneath the start of an Accession block. This record describes sequence specific comments.
Examples: A;Note: The authors translated the codon CTG for residue 169 as Ile. A;Note: 175-Ala was also found. A;Note: The difference at the carboxyl end is due to a frameshift.
> Comment records (optional; repeating)
The Comment line is a repeating record identified by 'C;Comment:' as the first 10 characters. This record contains general information in a free format, natural language form about the protein sequence entry. Some Comment records can be decomposed and the information move to more appropriate records; this is an ongoing standardization project.
Examples: C;Comment: Met preceding 1-Gly is removed after translation. C;Comment: The sequence shown is iso-1-cytochrome c. C;Comment: Euglena is a genus of green algae.
> Genetics record (optional; start of Genetics block)
The Genetics line is a single record identified by 'C;Genetics:' as the first 11 characters. This record contains no information except if more than one Genetics block exists in the entry. In the case of multiple gene information this record will contain a unique Tag that points to an Accession block; this indicates which sequence report is related to the genetic data. Presence of this record implies the start of a Genetics block and is required if other genetic information exists such as that defined below. The Genetics block may be repeated within an entry.
Examples: C;Genetics: C;Genetics:C;Genetics: C;Genetics:
> Gene record (optional; contained in Genetics block)
The Gene line is a single record identified by 'A;Gene:' as the first 7 characters. This record specifies the gene symbol used to denote the gene. Some symbols may contain "GDB" as a Tag; this indicates a cross reference to the Genome Database.
Examples: A;Gene: psbE A;Gene: GDB:CYP2D6 A;Gene: CYP2B1
> Map position record (optional; contained in Genetics block)
The Map position line is a single record identified by 'A;Map position:' as the first 15 characters. This record specifies a map position on which Page 12 the gene is located. For viruses this may indicate a segment number.
Examples: A;Map position: 71.6-76.2 A;Map position: 85 min A;Map position: 4q21-q23
> Genome record (optional; contained in the Genetics block)
The Genome line is a single record identified by 'A;Genome:' as the first 9 characters. This record specifies which type of genome is described. Current values for this record are: "mitochondrion", "chloroplast", "cyanellle" and "plasmid."
Examples: A;Genome: chloroplast A;Genome: cyanelle A;Genome: plasmid A;Genome: mitochondrion
> Genetic code record (optional; contained in Genetics block)
The Genetic code line is a single record identified by 'A;Genetic code:' as the first 15 characters. This record indicates which Special Genetic Code table is used by the specified organism for nucleic acid translation. Current values for this record are "SGC1" - "SGC9." For more information see the file SGC.LIS described below.
Examples: A;Genetic code: SGC1> Start codon record (optional; contained in Genetics block) The Start codon line is a single record identified by 'A;Start codon:' as the first 14 characters. This record indicates the codon in the nucleic acid sequence where translation is initiated.
Examples: A;Start codon: ATT A;Start codon: ATC A;Start codon: GTG
> Introns record (optional; contained in Genetics block)
The Introns line is a single record identified by 'A;Introns:' as the first 10 characters. This record specifies the intron segments needed to code for the gene product. Segments are separated by semicolons.
Examples: A;Introns: 253/3; 270/3 A;Introns: 139/1; 143/2; 169/2; 253/3; 270/3Page 13
> Genetics Note record (optional; repeating; contained in Genetics block)
The Genetics Note line is a repeating record identified by 'A;Note:' as the first 7 characters beneath the start of a Genetics block. This record describes genetics specific comments.
Examples: A;Note: strain D273-10B/A21 A;Note: strain 777-3A
> Function record (optional; start of Function block)
The Function line is a single record identified by 'C;Function:' as the first 11 characters. This record contains no information except if more than one Function block exists in the entry. In the case of multiple function information this record will contain a unique Tag to distinguish each block. Presence of this record implies the start of a Function block and is required if other function information exists such as that defined below. The Function block may be repeated within an entry.
> Function Description record (optional; contained in Function block)
The Function Description line is a single record identified by 'A;Description:' as the first 14 characters and immediately following the Function record. This record indicates the type of function the specified protein may have.
Example: C;Function: A;Description: This protein functions as a molecular chaperone in the endosymbiont.
> Superfamily record (optional)
The Superfamily line is a single record identified by 'C;Superfamily:' as the first 14 characters. This record indicates which Superfamily(s) has the protein as a member. Individual Superfamily names are separated by semicolons.
Examples: C;Superfamily: phosphorylase C;Superfamily: sucrose synthase; sucrose/sucrose-phosphate synthase homology
> Keywords record (optional)
The Keywords line is a single record identified by 'C;Keywords:' as the first 11 characters. This record indicates which keyword(s) are associated with the entry. Terms in the list are separated by semicolons and are used as a retreival key. Page 14
Example: C;Keywords: homodimer; NAD; oxidoreductase C;Keywords: oxidoreductase; pentose phosphate pathway
> Feature record (optional; repeating)
The Feature line is a repeating record identified by 'F;' as the first 2 characters. This record may be one of many comprising a Feature table. Each line of the Feature table has the following format. Positions 3 to the occurence of a '/' character is a range or site specification. Multiple segments corresponding to a single feature are separated by commas. The feature title appear after the '/' and consists of a feature descriptor, a title, and an optional code extension enclosed by the symbols < and >. A feature descriptor is a word or short phrase followed by a colon that defines the type of feature (refer to Table IV for a complete listing and explanation of feature descriptors). The code extension consists of a short character string (alphanumerics only) that when associated with the entry identification code defines a logical address for the subsequence that is unique throughout the database. For example, the path DEECK->DKI uniquely defines the feature with code extension DKI in entry DEECK.
Examples: F;316/Active site: Asp F;483/Binding site: carbohydrate F;19-69,28-52,44-65/Disulfide bonds: F;1-249/Domain: aspartokinase I <DKI> F;1-24/Domain: signal sequence <SIG> F;50-54/Peptide: Met-enkephalin 1 <ME1> F;15-72/Protein: basic protease inhibitor <MAT>
This file is an index to the three primary database files: PIRn.SEQ, PIRn.REF, PIRn.TTL. The index provides information that allows PSQ to use the VAX-11 RMS RFA record access mode to locate and access entry records in the primary database files. Entries are placed in the same order in the .REF, .SEQ and .TTL files; indexing is based on this order. Database records are accessed using the ISN_CODE, LOC_ENTRY, GET_SEQUENCE, GET_TEXT, and GET_TITLE routines located in module DBMS of the PSQ text library, PSQ.TLB. Refer to these routines for more specific information. NOTE: If changes are made to any of the primary database files, the index file will no longer correctly point to the database entries and the PIR programs will not operate.
The Database Index files contain information present in the Primary Database files but in a manner that facilitates quick access. If any of the auxiliary index files are absent, the PSQ commands supported by the sefiles will not operate. These files can be created using the INDEXER/SORTTMP programs except for the .CAX and .CDX files which must be supplied and the .TSC file which may be created using the CREATETSC program. See PIR document CREATEDB for more information.
This file contains the values of the molecular weight, sequence length, and percentage composition of each entry; it is used by the SELECT and REPORT commands of PSQ. There is one record for each entry and the order is the same as in the Primary Database files. The file attributes are as follows:
OPEN Keyword Keyword value ------------ ------------- ORGANIZATION Sequential ACCESS Direct FORM Unformatted RECORDTYPE Fixed RECORDSIZE 25Each record contains LENGTH, WEIGHT and DATA variables where: LENGTH is a real variable (sequence length); WEIGHT is a real variable (molecular weight); and DATA is an array of 23 real variables (% composition of amino acids A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,B,Z,X)
This file contains ancillary character data associated with each data baseentry; it is used by the SELECT, REPORT, SUPERFAMILY, and TAXONOMY commands of PSQ. There is one record for each entry and they are ordered in the same sequence as the Primary Database files. The file attributes are as follows:
OPEN Keyword Keyword value ------------ ------------- ORGANIZATION Sequential ACCESS Direct FORM Formatted RECORDTYPE Fixed RECORDSIZE 81Each record is organized according to the following table:
Format: Field Length Contents of field ----- ------ ----------------- CODE 6 retrieval code SUPER 6 superfamily number FAMILY 4 family number SUBFAM 4 subfamily number ENTRY 4 entry number SUBENT 4 subentry number ADDED_DATE 6 date entry was added to the database SEQ_DATE 6 date of last sequence update TEXT_DATE 6 date of last text update TYPE 1 sequence type (P=complete; F=fragment) NULL 1 unspecified field GRP 3 primary taxonomic group GRP1 3 auxiliary taxonomic group 1 . . . . . . . . . GRP10 3 auxiliary taxonomic group 10The dates have the format: YYMMDD; YY-year, MM-month, DD-day, where YY, MM, DD are numeric values. Dates prior to 1982 are not reliable; entries added prior to 1980 may contain blank characters in these fields. The primary taxonomic group code is associated with the species; each record may contain up to 10 auxiliary taxonomic group codes.
The text retrieval programs of the PIR-International use an inverted database model. For each text field in the database an index file containing an alphabetically sorted list of the terms found in that field and the logical addresses of the corresponding entries is compiled. These text index files are identically formatted and contain ASCII data and ISN numbers. These files are accessed by the INDEXS module of PSQ; refer to this routine for specific information. The file attributes are as follows:
OPEN Keyword Keyword value ------------ ------------- ORGANIZATION Sequential ACCESS Sequential FORM Formatted RECORDTYPE Variable RECORDSIZE 500 (or less) CARRIAGECONTROL ListThe following files are currently available. Note that none can be displayed by the DCL TYPE, PRINT, or EDIT commands. The CONVINDX program will produce an ASCII version of these files. See PIR document CREATEDB for more information.
PIRn.ACX Accession number index PIRn.AUX Author name index PIRn.CRX Cross-reference index PIRn.FTX Feature title index PIRn.GNX Gene symbol index PIRn.JRX Citation index PIRn.RNX Reference number index PIRn.SFX Superfamily name index PIRn.SNX Superfamily number index PIRn.SPX Species index PIRn.TTX Entry title index PIRn.WOX Keyword index
This file contains a lookup table of all tripeptides in the data set and the corresponding locations of the sequences and positions in the sequence where the tripeptides occur; it is used by the SCAN command of PSQ. Refer to module SCAN of the PSQ text library, PSQ.TLB, for more specific information. The file attributes are as follows:
OPEN Keyword Keyword value ------------ ------------- ORGANIZATION Sequential ACCESS Sequential FORM Unformatted RECORDTYPE Fixed RECORDSIZE (512/4)
The Title file lists the code (unique entry retrieval identifier) and the TITLE (protein name - organism name) of each of the sequence entries. There is one primary TITLE record for each database entry; this line is immediately followed by a variable number of Alternate names and/or Contains lines. These additional lines contain blanks in the first six spaces in place of the CODE; this distinguishes them from the primary title lines. Continuation lines are indicated by a dash as the last character on the preceding line.
Format: Field Length Contents of field ----- ------ ----------------- CODE 6 retrieval code (blank characters are used to fill the code field to 6 characters) TITLE variable sequence title (see PIRn.SEQ file description for TITLE format information) Examples: CCBYBCCytochrome c, iso-2 - Baker's yeast CCML6 Cytochrome c6 - Monochrysis lutheri- Alternate names: soluble cytochrome f; cytochrome c553 NRMS Ribonuclease (EC 3.1.27.5) - Mouse MYSH Myoglobin - Sheep and red deer CCHECCCytochrome c - Hemp AXBO Adrenodoxin - BovineNote that the .TTL file is not supplied with the PIR-International Protein Sequence Database. This optional file is useful only for older software; all versions of the PIR programs do not use this file. The program CREATETTL will create this file and is included on the Tape Release. See PIR document CREATEDB for more information.
The PRINDX.LIS file is a database listing (PIR1 and PIR2) of classified sequences ordered by Superfamily number. SUPFAMNUM.LIS contains a listing of the Superfamily names found for each Superfamily number represented in Section 1 (PIR1) and Section 2 (PIR2).
The TAXONOMY.LIS and SGC.LIS files are compiled and maintained by Andrzej Elzanowski of MIPS at the Max-Planck-Institut Fuer Biochemie in Martinsried, Germany. These files represent data as it appears in Section 1 (PIR1) and Section 2 (PIR2) of the Protein Sequence Database.
The TAXONOMY.LIS file is a heirarchical classification of source organisms in a order of increasing complexity. The PIR2 and PIR3 data sets are ordered according to the heirarchy of the this file.
The SGC.LIS file contains the genetic code tables corresponding to the special genetic codes recognized by PSQ. There is an entry in the file for each special genetic code; if not present, the corresponding genetic code will not be recognized by PSQ. Each entry begins with a header line, which must be immediately followed by 16 codon table lines. Codons are ordered from the uppermost left corner of the table in the nucleotide order U, C, A, G (this order is not mandatory). Blank lines between the header line and the codon table lines or between codon table lines are ignored and may be used as separators within the table. All lines that occur between entries in the file are ignored. The file is an ASCII flat file.
Table Header Format: >SGC1 - Mammalian mitochondrial genetic code Field Length Contents of field ----- ------ ----------------- '>' 1 begins a new SGC entry SGCn variable designates special genetic code number; must contain SGC as first three characters followed by the integer special genetic code number ' - ' 3 field separator SGCNAM variable special genetic code nameEach record contains four codon-amino acid field groups separated by 6 blank spaces.
Table Data Format: UUU F Phe UCU S Ser UAU Y Tyr UGU C Cys Field Length Contents of field ----- ------ ----------------- CODON 3 codon consisting of three single- character nucleotide symbols ' ' 1 field separator AAONE 2 one-letter amino acid code '+' in the first space indicates an initiation codon; otherwise the first space should be blank. ' ' 1 field separator AATHR 3 three-letter amino acid code ' ' 6 field separator
These files contain the restriction endonuclease lists utilized by the PSQ ENZYME comand. The recognition site is given in ambiguous nucleotide code (see Table V); the apostrophe designates the cut-site. If the cut-site is not symmetrical the complementary recognition and cut-site are given, separated by a colon.
Format: Field Length Contents of field ----- ------ ----------------- CODE 13 restriction enzyme code SPEC variable recognition and cut-site specification ' - ' 3 field separator ORGAN variable biological source, or comment
Examples: AvaI C'YCGRG - Anabaena variabilis EcoRI' RRATYY - Escherichia coli (stain RY13) plasmid RTF1 HhaI GCG'C - Haemophilus haemolyticus MboII GAAGANNNNNNNN':'NNNNNNNTCTTC - Moraxella bovis XbaI T'CTAGA - Xanthomonas badrii
The following abbreviations conform to those suggested by the IUPAC-IUB Commission on Biochemical Nomenclature, J. Biol. Chem. 243, 3557-3559, 1968. A Ala Alanine C Cys Cysteine D Asp Aspartic acid E Glu Glutamic acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine B Asx Asp or Asn, not distinguished Z Glx Glu or Gln, not distinguished X X Undetermined or atypical amino acid
XX Two adjacent amino acids, with no punctuation between, indicates that they are connected, as determined experimentally. () Encloses a region, the composition but not the complete sequence of which has been determined experimentally, or encloses a single residue that has been tentatively identified. = Indicates ")(", the juxtaposition of two regions of indeterminate sequence, while preserving proper spacing between amino acids. / Indicates that the adjacent amino acids are from different peptides, not necessarily connected. When the amino end of a protein has not been determined, "/" precedes the first residue. When the carboxyl end has not been determined, "/" follows the last residue. When ")/", "/(", or ")/(" are needed, only "/" is used. . Outside of parentheses, indicates the ends of sequence fragments. The relative order of these fragments was not determined experimentally but is clear from homology or other indirect evidence. . Within parentheses, indicates that the amino acid to the left has been placed with at least 90% confidence by homology with known sequences. , Indicates that the amino acid to its left could not be positioned with confidence by homology.
>..;code HEADER title - trivial name TITLE N;Alternate names: TEXT N;Contains: C;Species: Species block A;Variety: A;Note: C;Date: C;Accession: R; Reference block (may repeat) citation A;Authors: A;Title: A;Description: A;Reference number: A;Contents: A;Note: A;Accession: Accession block (may repeat) A;Status: A;Molecule type: A;Residues: A;Cross-references: A;Experimental source: A;Genetics: A;Note: C;Comment: Comments (may repeat) C;Genetics: Genetics block (may repeat) A;Gene: A;Map position: A;Genome: A;Gene origin: A;Genetic code: A;Start codon: A;Introns: A;Other products: A;Note: C;Complex: Complex C;Function: Function block A;Description: A;Pathway: A;Note: C;Superfamily: C;Keywords: F; Features block (may repeat)
The following descriptors denote single residue sites. These sites may be represented using the full feature residue specification conventions; however, the specification should be interpreted to specify a collection of single sites.
Active site: Binding site: Cleavage site: Inhibitory site: Modified site:
The following descriptors denote residues connected by covalent bonds. The pairs of residues linked by hyphens should be intepreted as being connected by bonds. Single residues may be specified if the bond is between an amino acid in the sequence and another protein chain or molecule. The Cross-links: descriptor specifically denotes bonds linking the sequence shown to an adjacent protein chain.
Disulfide bonds: Cross-links:
The following descriptors denoting sequence regions. Pairs of residues linked by hyphens denote a region extending from the first to the second position inclusive. The Domain: descriptor indicates distinct functional regions that may be of separate evolutionary origin. The Duplication: descriptor indicates regions that have evolved by sequence duplication. The Peptide: and Protein: descriptors indicate regions corresponding to the mature protein or peptides derived from the sequence shown. The Region: descriptor is used to indicate all other regions.
Domain: Peptide: Protein: Region:
These abbreviations conform to those suggested by the IUPAC-IUB Commission on Biochemical Nomenclature, Nucl. Acids Res. 13, 3021-3031, 1985
Symbol Meaning ------ ------- A Adenine C Cytosine G Guanine T Thymine U Uracil R puRine (A or G) Y pYrimidine (C or T/U) M A or C W A or T/U S C or G K G or T/U D A, G, or T/U H A, C, or T/U V A, C, or G B C, G, or T/U N A, C, G, or T/U