Indexing and configuring FASTA databases

Next: Configuring EMBOSS to use Up: Databases Previous: Indexing and configuring BLAST Contents

Indexing and configuring FASTA databases

The FASTA specifications just define the sequence file as a header line that begins with

and subsequent lines containing the sequence. The header line can be present in an almost infinite number of formats, several of which can be processed by EMBOSS. EMBOSS attempts to determine the accession number and/or ID for each sequence. For indexing purposes there is no semantic difference between an accession number and an ID. In the real world, acession numbers are immutable, ie. they do not change with subsequent releases of the dataabse, but ID's may change. In any case IDs and accession numbers are unique, and that is all that matters for database indexing EMBOSS.

The program used to process FASTA format databases is dbifasta. It can recognise the following header line formats, specified on the command line:

simple
>id ...

idacc
>id accno ...

gcgid
>db:id ...
^3.6

gcgidacc
>db:id acc ...
^3.6

dbid
>db id ...
^3.7

ncbi
>...[|accno]|id ...
^3.8

Other header formats will not be recognised by dbifasta and will cause indexing and/or database lookup to fail. If you have a different header format that dbifasta cannot yet handle you have two options:

(The preferred option) Get a C programmer to modify the source code for dbifasta and recompile. If you are a community spirited person you will also contribute these changes to the main EMBOSS source tree. (email emboss-dev@emboss.open-bio.org for more information on contributing changes to the EMBOSS source code and/or read the EMBOSS developers documentation)
(The quick hack) Write a custom script (using e.g. BioPerl http://www.bioperl.org) to access your database and use
```
method: external
```
to configure it. This is less desirable as you may be limited in the access modes you can use.

To index a FASTA format database, run dbifasta.

% dbifasta
Index a fasta database
    simple : >ID
     idacc : >ID ACC
     gcgid : >db:ID
  gcgidacc : >db:ID ACC
      ncbi : >blah|...[|ACC]|ID
ID line format [idacc]: 
Database name: mydb
Database directory [.]: 
Wildcard database filename [*.dat]: mydb.fasta
Release number [0.0]: 
Index date [00/00/00]:

dbifasta will chug along for a little while and will produce the index files. You can use the same

indexdir

options as for dbiflat, dbigcg and dbiblast to place the indices in a different directory.

Place the following entry in your

.embossrc

DB mydb [
        type: P
        method: emblcd
        format: fasta
        dir: \$emboss_db_dir/mydb
	file: mydb.fasta
        comment: "My database"
]

format:

should be

dbid

ncbi

fasta

(for every format except

dbid

ncbi

. The same

file:

and

include:

tags can be used as for the other database indexing programs.

Next: Configuring EMBOSS to use Up: Databases Previous: Indexing and configuring BLAST Contents

Peter Rice 2007-04-26

simple	>id ...
idacc	>id accno ...
gcgid	>db:id ... ^3.6
gcgidacc	>db:id acc ... ^3.6
dbid	>db id ... ^3.7
ncbi	>...[\|accno]\|id ... ^3.8