The FASTA specifications just define the sequence file as a header line that begins with
>and subsequent lines containing the sequence. The header line can be present in an almost infinite number of formats, several of which can be processed by EMBOSS. EMBOSS attempts to determine the accession number and/or ID for each sequence. For indexing purposes there is no semantic difference between an accession number and an ID. In the real world, acession numbers are immutable, ie. they do not change with subsequent releases of the dataabse, but ID's may change. In any case IDs and accession numbers are unique, and that is all that matters for database indexing EMBOSS.
The program used to process FASTA format databases is dbifasta. It can recognise the following header line formats, specified on the command line:
>id accno ...
>db:id acc ...3.6
>db id ...3.7
Other header formats will not be recognised by dbifasta and will cause indexing and/or database lookup to fail. If you have a different header format that dbifasta cannot yet handle you have two options:
method: externalto configure it. This is less desirable as you may be limited in the access modes you can use.
To index a FASTA format database, run dbifasta.
% dbifasta Index a fasta database simple : >ID idacc : >ID ACC gcgid : >db:ID gcgidacc : >db:ID ACC ncbi : >blah|...[|ACC]|ID ID line format [idacc]: Database name: mydb Database directory [.]: Wildcard database filename [*.dat]: mydb.fasta Release number [0.0]: Index date [00/00/00]:
dbifasta will chug along for a little while and will produce the index files. You can use the same
indexdiroptions as for dbiflat, dbigcg and dbiblast to place the indices in a different directory.
Place the following entry in your
DB mydb [ type: P method: emblcd format: fasta dir: \$emboss_db_dir/mydb file: mydb.fasta comment: "My database" ]
fasta(for every format except
ncbi. The same
include:tags can be used as for the other database indexing programs.