backtranambig

 

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Back-translate a protein sequence to ambiguous nucleotide sequence

Description

backtranambig reads a protein sequence and writes the nucleic acid sequence it could have come from. It does this by using nucleotide ambiguity codes that represent all possible codons for each amino acid.

Algorithm

backtranambig needs a genetic code to generate an ambiguous codon for each amino acid. The default genetic code is the standard ("Universal") code, although many others are available via the '-table' qualifier. The codon usage tables correspdonding to these codes must exist in the EMBOSS data directory. See the section on "Data Files" below for more information.

Usage

Here is a sample session with backtranambig


% backtranambig 
Back-translate a protein sequence to ambiguous nucleotide sequence
Input (gapped) protein sequence(s): tsw:opsd_human
(gapped) nucleotide output sequence(s) [opsd_human.fasta]: 

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-sequence]          seqall     (Gapped) protein sequence(s) filename and
                                  optional format, or reference (input USA)
  [-outfile]           seqoutall  [.] (Aligned) nucleotide
                                  sequence set(s) filename and optional format
                                  (output USA)

   Additional (Optional) qualifiers:
   -table              menu       [0] Genetic code to use (Values: 0
                                  (Standard); 1 (Standard (with alternative
                                  initiation codons)); 2 (Vertebrate
                                  Mitochondrial); 3 (Yeast Mitochondrial); 4
                                  (Mold, Protozoan, Coelenterate Mitochondrial
                                  and Mycoplasma/Spiroplasma); 5
                                  (Invertebrate Mitochondrial); 6 (Ciliate
                                  Macronuclear and Dasycladacean); 9
                                  (Echinoderm Mitochondrial); 10 (Euplotid
                                  Nuclear); 11 (Bacterial); 12 (Alternative
                                  Yeast Nuclear); 13 (Ascidian Mitochondrial);
                                  14 (Flatworm Mitochondrial); 15
                                  (Blepharisma Macronuclear); 16
                                  (Chlorophycean Mitochondrial); 21 (Trematode
                                  Mitochondrial); 22 (Scenedesmus obliquus);
                                  23 (Thraustochytrium Mitochondrial))

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -osformat2          string     Output seq format
   -osextension2       string     File name extension
   -osname2            string     Base file name
   -osdirectory2       string     Output directory
   -osdbname2          string     Database name to add
   -ossingle2          boolean    Separate file for each entry
   -oufo2              string     UFO features
   -offormat2          string     Features format
   -ofname2            string     Features file name
   -ofdirectory2       string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-sequence]
(Parameter 1)
(Gapped) protein sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
[-outfile]
(Parameter 2)
(Aligned) nucleotide sequence set(s) filename and optional format (output USA) Writeable sequence(s) <*>.format
Additional (Optional) qualifiers Allowed values Default
-table Genetic code to use
0 (Standard)
1 (Standard (with alternative initiation codons))
2 (Vertebrate Mitochondrial)
3 (Yeast Mitochondrial)
4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
5 (Invertebrate Mitochondrial)
6 (Ciliate Macronuclear and Dasycladacean)
9 (Echinoderm Mitochondrial)
10 (Euplotid Nuclear)
11 (Bacterial)
12 (Alternative Yeast Nuclear)
13 (Ascidian Mitochondrial)
14 (Flatworm Mitochondrial)
15 (Blepharisma Macronuclear)
16 (Chlorophycean Mitochondrial)
21 (Trematode Mitochondrial)
22 (Scenedesmus obliquus)
23 (Thraustochytrium Mitochondrial)
0
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

Any DNA sequence USA.

Input files for usage example

'tsw:opsd_human' is a sequence entry in the example protein database 'tsw'

Database entry: tsw:opsd_human

ID   OPSD_HUMAN              Reviewed;         348 AA.
AC   P08100; Q16414; Q2M249;
DT   01-AUG-1988, integrated into UniProtKB/Swiss-Prot.
DT   01-AUG-1988, sequence version 1.
DT   16-JUN-2009, entry version 116.
DE   RecName: Full=Rhodopsin;
DE   AltName: Full=Opsin-2;
GN   Name=RHO; Synonyms=OPN2;
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC   Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RX   MEDLINE=84272729; PubMed=6589631; DOI=10.1073/pnas.81.15.4851;
RA   Nathans J., Hogness D.S.;
RT   "Isolation and nucleotide sequence of the gene encoding human
RT   rhodopsin.";
RL   Proc. Natl. Acad. Sci. U.S.A. 81:4851-4855(1984).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RA   Suwa M., Sato T., Okouchi I., Arita M., Futami K., Matsumoto S.,
RA   Tsutsumi S., Aburatani H., Asai K., Akiyama Y.;
RT   "Genome-wide discovery and analysis of human seven transmembrane helix
RT   receptor genes.";
RL   Submitted (JUL-2001) to the EMBL/GenBank/DDBJ databases.
RN   [3]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Retina;
RX   PubMed=17974005; DOI=10.1186/1471-2164-8-399;
RA   Bechtel S., Rosenfelder H., Duda A., Schmidt C.P., Ernst U.,
RA   Wellenreuther R., Mehrle A., Schuster C., Bahr A., Blocker H.,
RA   Heubner D., Hoerlein A., Michel G., Wedler H., Kohrer K.,
RA   Ottenwalder B., Poustka A., Wiemann S., Schupp I.;
RT   "The full-ORF clone resource of the German cDNA consortium.";
RL   BMC Genomics 8:399-399(2007).
RN   [4]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RX   PubMed=15489334; DOI=10.1101/gr.2596504;
RG   The MGC Project Team;
RT   "The status, quality, and expansion of the NIH full-length cDNA
RT   project: the Mammalian Gene Collection (MGC).";
RL   Genome Res. 14:2121-2127(2004).
RN   [5]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA] OF 1-120.
RX   PubMed=8566799; DOI=10.1016/0378-1119(95)00688-5;
RA   Bennett J., Beller B., Sun D., Kariko K.;
RT   "Sequence analysis of the 5.34-kb 5' flanking region of the human
RT   rhodopsin-encoding gene.";


  [Part of this file has been deleted for brevity]

FT                                /FTId=VAR_004816.
FT   VARIANT     209    209       V -> M (effect not known).
FT                                /FTId=VAR_004817.
FT   VARIANT     211    211       H -> P (in RP4).
FT                                /FTId=VAR_004818.
FT   VARIANT     211    211       H -> R (in RP4).
FT                                /FTId=VAR_004819.
FT   VARIANT     216    216       M -> K (in RP4).
FT                                /FTId=VAR_004820.
FT   VARIANT     220    220       F -> C (in RP4).
FT                                /FTId=VAR_004821.
FT   VARIANT     222    222       C -> R (in RP4).
FT                                /FTId=VAR_004822.
FT   VARIANT     255    255       Missing (in RP4).
FT                                /FTId=VAR_004823.
FT   VARIANT     264    264       Missing (in RP4).
FT                                /FTId=VAR_004824.
FT   VARIANT     267    267       P -> L (in RP4).
FT                                /FTId=VAR_004825.
FT   VARIANT     267    267       P -> R (in RP4).
FT                                /FTId=VAR_004826.
FT   VARIANT     292    292       A -> E (in CSNBAD1).
FT                                /FTId=VAR_004827.
FT   VARIANT     296    296       K -> E (in RP4).
FT                                /FTId=VAR_004828.
FT   VARIANT     297    297       S -> R (in RP4).
FT                                /FTId=VAR_004829.
FT   VARIANT     342    342       T -> M (in RP4).
FT                                /FTId=VAR_004830.
FT   VARIANT     345    345       V -> L (in RP4).
FT                                /FTId=VAR_004831.
FT   VARIANT     345    345       V -> M (in RP4).
FT                                /FTId=VAR_004832.
FT   VARIANT     347    347       P -> A (in RP4).
FT                                /FTId=VAR_004833.
FT   VARIANT     347    347       P -> L (in RP4; common variant).
FT                                /FTId=VAR_004834.
FT   VARIANT     347    347       P -> Q (in RP4).
FT                                /FTId=VAR_004835.
FT   VARIANT     347    347       P -> R (in RP4).
FT                                /FTId=VAR_004836.
FT   VARIANT     347    347       P -> S (in RP4).
FT                                /FTId=VAR_004837.
SQ   SEQUENCE   348 AA;  38893 MW;  6F4F6FCBA34265B2 CRC64;
     MNGTEGPNFY VPFSNATGVV RSPFEYPQYY LAEPWQFSML AAYMFLLIVL GFPINFLTLY
     VTVQHKKLRT PLNYILLNLA VADLFMVLGG FTSTLYTSLH GYFVFGPTGC NLEGFFATLG
     GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLAGWSRYIP
     EGLQCSCGID YYTLKPEVNN ESFVIYMFVV HFTIPMIIIF FCYGQLVFTV KEAAAQQQES
     ATTQKAEKEV TRMVIIMVIA FLICWVPYAS VAFYIFTHQG SNFGPIFMTI PAFFAKSAAI
     YNPVIYIMMN KQFRNCMLTT ICCGKNPLGD DEASATVSKT ETSQVAPA
//

Output file format

The output is a nucleotide sequence containing the most favoured back translation of the specified protein, and using the specified translation table (which defaults to human).

Output files for usage example

File: opsd_human.fasta

>OPSD_HUMAN P08100 Rhodopsin (Opsin-2)
ATGAAYGGNACNGARGGNCCNAAYTTYTAYGTNCCNTTYWSNAAYGCNACNGGNGTNGTN
MGNWSNCCNTTYGARTAYCCNCARTAYTAYYTNGCNGARCCNTGGCARTTYWSNATGYTN
GCNGCNTAYATGTTYYTNYTNATHGTNYTNGGNTTYCCNATHAAYTTYYTNACNYTNTAY
GTNACNGTNCARCAYAARAARYTNMGNACNCCNYTNAAYTAYATHYTNYTNAAYYTNGCN
GTNGCNGAYYTNTTYATGGTNYTNGGNGGNTTYACNWSNACNYTNTAYACNWSNYTNCAY
GGNTAYTTYGTNTTYGGNCCNACNGGNTGYAAYYTNGARGGNTTYTTYGCNACNYTNGGN
GGNGARATHGCNYTNTGGWSNYTNGTNGTNYTNGCNATHGARMGNTAYGTNGTNGTNTGY
AARCCNATGWSNAAYTTYMGNTTYGGNGARAAYCAYGCNATHATGGGNGTNGCNTTYACN
TGGGTNATGGCNYTNGCNTGYGCNGCNCCNCCNYTNGCNGGNTGGWSNMGNTAYATHCCN
GARGGNYTNCARTGYWSNTGYGGNATHGAYTAYTAYACNYTNAARCCNGARGTNAAYAAY
GARWSNTTYGTNATHTAYATGTTYGTNGTNCAYTTYACNATHCCNATGATHATHATHTTY
TTYTGYTAYGGNCARYTNGTNTTYACNGTNAARGARGCNGCNGCNCARCARCARGARWSN
GCNACNACNCARAARGCNGARAARGARGTNACNMGNATGGTNATHATHATGGTNATHGCN
TTYYTNATHTGYTGGGTNCCNTAYGCNWSNGTNGCNTTYTAYATHTTYACNCAYCARGGN
WSNAAYTTYGGNCCNATHTTYATGACNATHCCNGCNTTYTTYGCNAARWSNGCNGCNATH
TAYAAYCCNGTNATHTAYATHATGATGAAYAARCARTTYMGNAAYTGYATGYTNACNACN
ATHTGYTGYGGNAARAAYCCNYTNGGNGAYGAYGARGCNWSNGCNACNGTNWSNAARACN
GARACNWSNCARGTNGCNCCNGCN

Data files

The codon usage table is read by default from "Ehum.cut" in the 'data/CODONS' directory of the EMBOSS distribution. If the name of a codon usage file is specified on the command line, then this file will first be searched for in the current directory and then in the 'data/CODONS' directory of the EMBOSS distribution.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

The ambiguous nucleotide sequence generated by backtranambig can be translated to the original protein using transeq, which will recognise highly redundant codons (for example "WSN" for serine) as being produced by a program such as backtranambig.

References

None.

Warnings

None.

Diagnostic Error Messages

"Corrupt codon index file" - the codon usage file is incomplete or empty.

"The file 'drosoph.cut' does not exist" - the codon usage file cannot be opened.

Exit status

This program always exits with a status of 0, unless the codon usage table cannot be opened.

Known bugs

None.

See also

Program name Description
backtranseq Back-translate a protein sequence to a nucleotide sequence
charge Draw a protein charge plot
checktrans Reports STOP codons and ORF statistics of a protein
coderet Extract CDS, mRNA and translations from feature tables
compseq Calculate the composition of unique words in sequences
emowse Search protein sequences by digest fragment molecular weight
freak Generate residue/base frequency table or plot
iep Calculate the isoelectric point of proteins
mwcontam Find weights common to multiple molecular weights files
mwfilter Filter noisy data from molecular weights file
octanol Draw a White-Wimley protein hydropathy plot
pepinfo Plot amino acid properties of a protein sequence in parallel
pepstats Calculates statistics of protein properties
pepwindow Draw a hydropathy plot for a protein sequence
pepwindowall Draw Kyte-Doolittle hydropathy plot for a protein alignment
plotorf Plot potential open reading frames in a nucleotide sequence
prettyseq Write a nucleotide sequence and its translation to file
remap Display restriction enzyme binding sites in a nucleotide sequence
showorf Display a nucleotide sequence and translation in pretty format
showseq Displays sequences with features in pretty format
sixpack Display a DNA sequence with 6-frame translation and ORFs
transeq Translate nucleic acid sequences
wordcount Count and extract unique words in DNA sequence(s)

Author(s)

Alan Bleasby (ajb © ebi.ac.uk)
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

History

None

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None