We will begin with two EMBL sequences whose identifiers are XL23808 and XLRHODOP; these sequences are the genomic and the corresponding cDNA sequence for Xenopus laevis rhodopsin.
You need to tell EMBOSS where to read the sequence(s) you want to analyse. EMBOSS can read sequences either from text files or directly from a sequence database. The easiest way to see this is with examples.
As an example, here are the first few databases available using EMBOSS at the HGMP. Your local site will probably have a different selection of databases depending on what the local EMBOSS maintainer has set up.
unix % showdb
Displays information on the currently available databases
#Name | Type | ID | Qry | All | Comment |
#==== | ==== | == | === | === | ======= |
nbrf | P | OK | OK | OK | PIR/NBRF |
pir | P | OK | OK | OK | PIR/NBRF |
remtrembl | P | OK | OK | OK | REMTREMBL sequences |
sptrembl | P | OK | OK | OK | SPTREMBL sequences |
sw | P | OK | OK | OK | SWISSPROT sequences |
swissprot | P | OK | OK | OK | SWISSPROT sequences |
trarc | P | OK | OK | OK | TREMBL ARC sequences |
trembl | P | OK | OK | OK | TREMBL sequences |
tremblnew | P | OK | OK | OK | New TREMBL sequences |
You can access EMBL by either identifier eg xlrhodop, or accession number eg L07770. Let's try these now.
unix % seqret
Reads and writes (returns) a sequence
Input sequence: embl:xlrhodop
Output sequence [xlrhodop.fasta]:
unix % more xlrhodop.fasta
>XLRHODOP L07770 Xenopus laevis rhodopsin mRNA, complete cds. ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac
unix % seqret
Reads and writes (returns) a sequence
Input sequence: embl:L07770
Output sequence [xlrhodop.fasta]: xlrhodop2.fasta
unix % more xlrhodop2.fasta
>XLRHODOP L07770 Xenopus laevis rhodopsin mRNA, complete cds. ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac
unix % seqret embl:xlrhodop -outseq xlrhodop.fasta
By default, seqret writes the sequence in fasta
format. You can also tell it to use a different output format:
unix % seqret embl:L07770 -outseq gcg::xlrhodop.gcg
As an alternative to specifying the format in the output sequence USA you
can use the -osformat qualifier. This command is identical in action to the previous one:
unix % seqret embl:L07770 -outseq xlrhodop.gcg -osformat gcg
unix % more xlrhodop.gcg
!!NA_SEQUENCE 1.0 Xenopus laevis rhodopsin mRNA, complete cds. XLRHODOP Length: 1684 Type: N Check: 9453 .. 1 ggtagaacag cttcagttgg gatcacaggc ttctagggat cctttgggca 51 aaaaagaaac acagaaggca ttctttctat acaagaaagg actttataga
unix % seqret xlrhodop.fasta -outseq gcg::myseq.gcg
or
unix % seqret xlrhodop.fasta -outseq myseq.gcg -osformat gcg
unix % infoseq embl:xlrhodop
Displays some simple information about sequences
# USA | Name | Accession | Type | Length | GC | Description |
embl-id:XLRHODOP | XLRHODOP | L07770 | N | 1684 | 45.72 | X.laevis rhodopsin |
Sequence databases do not just contain sequences, they also contain a great deal of associated information (annotation) about the sequence entries. By default EMBOSS does not return all this information when you run seqret.
To retrieve the full entry for a sequence in it's original database form you can use the utility entret.
unix % entret embl:xl23808
Reads and writes (returns) flatfile entries
Output file [xl23808.entret]:
unix % more xl23808.entret
ID XL23808 standard; DNA; VRT; 4734 BP.
XX
AC U23808;
XX
SV U23808.1
XX
DT 23-APR-1995 (Rel. 43, Created)
DT 04-MAR-2000 (Rel. 63, Last updated, Version 7)
XX
DE Xenopus laevis rhodopsin gene, complete cds.
XX
KW .
XX
OS Xenopus laevis (African clawed frog)
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Amphibia;
OC Batrachia; Anura; Mesobatrachia; Pipoidea; Pipidae; Xenopodinae; Xenopus.
XX
There is a lot of information here. Near the bottom, just above
the sequence itself is a list of features associated with the
sequence. A feature is any defined region of the sequence that
has a particular description associated with it. We can view a
simple graphical overview of the sequence features using the
utility showfeat:
unix % showfeat embl:xl23808
Show features of a sequence.
Output file [xl23808.showfeat]:
unix % more xl23808.showfeat
XL23808 Xenopus laevis rhodopsin gene, complete cds. |==========================================================| 4734 |----------------------------------------------------------> source |-----> mRNA |---> CDS |-> CDS |-> mRNA |-> CDS |-> mRNA |--> CDS |--> mRNA |> CDS |-------> mRNATo retrieve a sequence with all its features we can use the program seqret with the option seqret -feature. This has some properties of which to be aware.
unix % seqret -feature embl:xl23808
Reads and writes (returns) one or more sequences
Output sequence [xl23808.fasta]:
This looks pretty much like running seqret except that a second file has been created, unknown.gff. Take a look at this file:
unix % more unknown.gff
##gff-version 2.0 ##date 2003-02-21 ##Type DNA XL23808 XL23808 EMBL source 1 4734 0.000 + . Sequence "XL23808.1" ; db_xref \ "taxon:8355" ; organism "Xenopus laevis" XL23808 EMBL mRNA 1181 1650 0.000 + . Sequence "XL23808.2" ; FeatFlags \ "0x100" ; product "rhodopsin" XL23808 EMBL mRNA 1899 2067 0.000 + . Sequence "XL23808.2" ; FeatFlags \ "0x104" XL23808 EMBL mRNA 2669 2834 0.000 + . Sequence "XL23808.2" ; FeatFlags \ "0x104"
This is a list of the features in the database entry in GFF (General Feature Format).
You can find out more about feature formats on the EMBOSS web site.
In order to change the feature formats and filename we need to use
associated qualifiers when running seqret -feature. Lets save the
features in EMBL format in the file rhodop.features:
unix % seqret -feature embl:xl23808 -offormat embl -ofname rhodop.features
Reads and writes (returns) one or more sequences
Output sequence [xl23808.fasta]:
And the file appears as expected
.
We could use a Uniform Feature Object (UFO) instead of
the -offormat and -ofname qualifiers.
unix % seqret -feature embl:xl23808 -oufo embl::rhodop.features
EMBOSS programs can also deal with multiple sequences. A quick search using SRS will tell you that the SwissProt sequence corresponding to the EMBL sequence we've been looking at has the identifier OPSD_XENLA. To retrieve the information about all the other OPSD sequences in SwissProt we can use the wild card character:
unix % infoseq
Displays some simple information about sequences
Input sequence(s): sw:opsd_*
# USA | Name | Accession | Type | Length | Description |
sw-id:OPSD_ABYKO | OPSD_ABYKO | O42294 | P | 289 | RHODOPSIN (FRAGMENT). |
sw-id:OPSD_ALLMI | OPSD_ALLMI | P52202 | P | 352 | RHODOPSIN. |
sw-id:OPSD_AMBTI | OPSD_AMBTI | Q90245 | P | 354 | RHODOPSIN. |
sw-id:OPSD_ANGAN | OPSD_ANGAN | Q90214 | P | 352 | RHODOPSIN, DEEP-SEA |
sw-id:OPSD_ANOCA | OPSD_ANOCA | P41591 | P | 352 | RHODOPSIN. |
sw-id:OPSD_APIME | OPSD_APIME | Q17053 | P | 377 | RHODOPSIN. |
sw-id:OPSD_ASTFA | OPSD_ASTFA | P41590 | P | 352 | RHODOPSIN. |
sw-id:OPSD_BATMU | OPSD_BATMU | O42300 | P | 289 | RHODOPSIN (FRAGMENT). |
sw-id:OPSD_BATNI | OPSD_BATNI | O42301 | P | 289 | RHODOPSIN (FRAGMENT). |
sw-id:OPSD_BOVIN | OPSD_BOVIN | P02699 | P | 348 | RHODOPSIN. |
We can also use the wild card character on the command line, but here we must enclose the specification in quotation marks:
unix % infoseq ``sw:opsd_*''
You can use seqret to retrieve multiple sequences into a file; for exmaple:
unix % seqret ``sw:opsd_a*'' -outseq opsd_a.seqs
retrieves all the sequences whose identifiers start ``opsd_a'' into a file called opsd_a.seqs. If we wanted to have each sequence in a separate file, we could type:
unix % seqret ``sw:opsd_a*'' -ossingle
Filenames are generated based on the identifiers of the sequences.
Here's an example of a valid list file, called seq.list:
opsd_abyko.fasta sw:opsd_xenla sw:opsd_c* @another_list
If you have created this file then you can read it using:
unix % more seq.list
This may look a bit odd, but it's really very straightforward; the file contains:
Notice the @ in front of the last entry. This is the way you tell EMBOSS that this file is a list file, not a regular sequence file. As an alternative to using the @ you can write list:filename instead. Let's demonstrate this by using this file as the input to seqret and get the sequences into a new file, perhaps for use in a multiple sequence alignment (see Section 5.3).
First of all, we'll make the file opsd_abyko.fasta using seqret:
unix % seqret sw:opsd_abyko -outseq opsd_abyko.fasta
Now let's look at another_list. Note that its structure is very similar to that of seq.list but this time only contains database references:
sw:opsd_anoca sw:opsd_apime sw:opsd_astfa
After you have created this file you will be able to view it using
unix % more another_list
Finally, let's run seqret with seq.list (not forgetting the @ sign) and look at the results:
unix % seqret @seq.list -outseq outfile
unix % more outfile
>OPSD_ABYKO O42294 RHODOPSIN (FRAGMENT). YLVNPAAYAALGAYMFLLILIGFPINFLTLYVTLEHKKLRTPLNYILLNLAVANLFMVLG GFTTTMYTSMHGYFVLGRLGCNLEAFFATLGGEIALWSLVVLAIERWIVVCKPISNFRFT EDHAIMGLAFTWVMALACAVPPLVGWSRYIPEGMQCSCGVDYYTRAEGFNNESFVIYMFI VHFLIPLSVIFFCYGRLLCAVKEAPAAQQESETTQRAEKEVSRMVVIMVIGFLVCWLPYA SVAWWIFCNQGSDFGPIFMTLPSFFAKSAAIYNPMIYICMNKQFRHCMI >OPSD_XENLA P29403 RHODOPSIN. MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMFLLILLGLPINFMTLF VTIQHKKLRTPLNYILLNLVFANHFMVLCGFTVTMYTSMHGYFIFGPTGCYIEGFFATLG GEVALWSLVVLAVERYIVVCKPMANFRFGENHAIMGVAFTWIMALSCAAPPLFGWSRYIP EGMQCSCGVDYYTLKPEVNNESFVIYMFIVHFTIPLIVIFFCYGRLLCTVKEAAAQQQES LTTQKAEKEVTRMVVIMVVFFLICWVPYAYVAFYIFTHQGSNFGPVFMTVPAFFAKSSAI YNPVIYIVLNKQFRNCLITTLCCGKNPFGDEDGSSAATSKTEASSVSSSQVSPA >OPSD_CAMAB Q17292 RHODOPSIN. MMSIASGPSHAAYTWASQGGGFGNQTVVDKVPPEMLHMVDAHWYQFPPMNPLWHALLGFV IGVLGVISVIGNGMVIYIFTTTKSLRTPSNLLVVNLAISDFLMMLCMSPAMVINCYYETW VLGPLFCELYGLAGSLFGCASIWTMTMIAFDRYNVIVKGLSAKPMTINGALIRILTIWFF TLAWTIAPMFGWNRYVPEGNMTACGTDYLTKDLFSRSYILIYSIFVYFTPLFLIIYSYFF IIQAVAAHEKNMREQAKKMNVASLRSAENQSTSAECKLAKVALMTISLWFMAWTPYLVIN YSGIFETTKISPLFTIWGSLFAKANAVYNPIVYGISHPKYRAALFQKFPSLACTTEPTGA DTMSTTTTVTEGNEKPAA >OPSD_CAMHU O18312 RHODOPSIN (FRAGMENT). LHMIHLHWYQYPPMNPMMYPLLLIFMLFTGILCLAGNFVTIWVFMNTKSLRTPANLLVVN LAMSDFLMMFTMFPPMMVTCYYHTWTLGPTFCQVYAFLGNLCGCASIWTMVFITFDRYNV IVKGVAGEPLSTKKASLWILSVWVLSTAWCIAPFFGWNHYVPEGNLTGCGTDYLSEDILS RSYLYIYSTWVYFLPLAITIYCYVFIIKAVAAHEKGMRDQAKKMGIKSLRNEEAQKTSAE CRLAKNAMTTVALWFIAWTPCLLINWVGMFARSYLSPVYTIWGYVFAKANAVYNPIVYAI S