tranalign |
Please help by correcting and extending the Wiki pages.
tranalign is a re-implementation in EMBOSS of the program mrtrans by Bill Pearson. It reads a set of (unaligned) nucleotide sequences and a corresponding set of aligned protein sequences which are the translations, and writes the coding regions to file as a nucleotide sequence alignment. The sequences must be in the same order in the input sets. Each nucleotide sequence is translated in all three forward frames using the specified genetic code and the translations compared to the corresponding protein sequence from input the alignment. The contiguous nucleotide sequence that coded the protein is written to file (it will not splice together different exons to produce a coding sequence).
The protein sequences will typically include gap (-) characters. These are ignored during sequence comparison but replaced by --- in the nucleotide sequence alignment output.
% tranalign ../data/tranalign.pep tranalign2.seq Generate an alignment of nucleic coding regions from aligned proteins |
Go to the input files for this example
Go to the output files for this example
Generate an alignment of nucleic coding regions from aligned proteins Version: EMBOSS:6.6.0.0 Standard (Mandatory) qualifiers: [-asequence] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) [-bsequence] seqset (Aligned) protein sequence set filename and optional format, or reference (input USA) [-outseq] seqoutset [ |
Qualifier | Type | Description | Allowed values | Default | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Standard (Mandatory) qualifiers | ||||||||||||||||||||||||||||||||||||||||
[-asequence] (Parameter 1) |
seqall | Nucleotide sequence(s) filename and optional format, or reference (input USA) | Readable sequence(s) | Required | ||||||||||||||||||||||||||||||||||||
[-bsequence] (Parameter 2) |
seqset | (Aligned) protein sequence set filename and optional format, or reference (input USA) | Readable set of sequences | Required | ||||||||||||||||||||||||||||||||||||
[-outseq] (Parameter 3) |
seqoutset | (Aligned) nucleotide sequence set filename and optional format (output USA) | Writeable sequences | <*>.format | ||||||||||||||||||||||||||||||||||||
Additional (Optional) qualifiers | ||||||||||||||||||||||||||||||||||||||||
-table | list | Code to use |
|
0 | ||||||||||||||||||||||||||||||||||||
Advanced (Unprompted) qualifiers | ||||||||||||||||||||||||||||||||||||||||
(none) | ||||||||||||||||||||||||||||||||||||||||
Associated qualifiers | ||||||||||||||||||||||||||||||||||||||||
"-asequence" associated seqall qualifiers | ||||||||||||||||||||||||||||||||||||||||
-sbegin1 -sbegin_asequence |
integer | Start of each sequence to be used | Any integer value | 0 | ||||||||||||||||||||||||||||||||||||
-send1 -send_asequence |
integer | End of each sequence to be used | Any integer value | 0 | ||||||||||||||||||||||||||||||||||||
-sreverse1 -sreverse_asequence |
boolean | Reverse (if DNA) | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-sask1 -sask_asequence |
boolean | Ask for begin/end/reverse | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-snucleotide1 -snucleotide_asequence |
boolean | Sequence is nucleotide | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-sprotein1 -sprotein_asequence |
boolean | Sequence is protein | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-slower1 -slower_asequence |
boolean | Make lower case | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-supper1 -supper_asequence |
boolean | Make upper case | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-scircular1 -scircular_asequence |
boolean | Sequence is circular | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-squick1 -squick_asequence |
boolean | Read id and sequence only | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-sformat1 -sformat_asequence |
string | Input sequence format | Any string | |||||||||||||||||||||||||||||||||||||
-iquery1 -iquery_asequence |
string | Input query fields or ID list | Any string | |||||||||||||||||||||||||||||||||||||
-ioffset1 -ioffset_asequence |
integer | Input start position offset | Any integer value | 0 | ||||||||||||||||||||||||||||||||||||
-sdbname1 -sdbname_asequence |
string | Database name | Any string | |||||||||||||||||||||||||||||||||||||
-sid1 -sid_asequence |
string | Entryname | Any string | |||||||||||||||||||||||||||||||||||||
-ufo1 -ufo_asequence |
string | UFO features | Any string | |||||||||||||||||||||||||||||||||||||
-fformat1 -fformat_asequence |
string | Features format | Any string | |||||||||||||||||||||||||||||||||||||
-fopenfile1 -fopenfile_asequence |
string | Features file name | Any string | |||||||||||||||||||||||||||||||||||||
"-bsequence" associated seqset qualifiers | ||||||||||||||||||||||||||||||||||||||||
-sbegin2 -sbegin_bsequence |
integer | Start of each sequence to be used | Any integer value | 0 | ||||||||||||||||||||||||||||||||||||
-send2 -send_bsequence |
integer | End of each sequence to be used | Any integer value | 0 | ||||||||||||||||||||||||||||||||||||
-sreverse2 -sreverse_bsequence |
boolean | Reverse (if DNA) | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-sask2 -sask_bsequence |
boolean | Ask for begin/end/reverse | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-snucleotide2 -snucleotide_bsequence |
boolean | Sequence is nucleotide | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-sprotein2 -sprotein_bsequence |
boolean | Sequence is protein | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-slower2 -slower_bsequence |
boolean | Make lower case | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-supper2 -supper_bsequence |
boolean | Make upper case | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-scircular2 -scircular_bsequence |
boolean | Sequence is circular | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-squick2 -squick_bsequence |
boolean | Read id and sequence only | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-sformat2 -sformat_bsequence |
string | Input sequence format | Any string | |||||||||||||||||||||||||||||||||||||
-iquery2 -iquery_bsequence |
string | Input query fields or ID list | Any string | |||||||||||||||||||||||||||||||||||||
-ioffset2 -ioffset_bsequence |
integer | Input start position offset | Any integer value | 0 | ||||||||||||||||||||||||||||||||||||
-sdbname2 -sdbname_bsequence |
string | Database name | Any string | |||||||||||||||||||||||||||||||||||||
-sid2 -sid_bsequence |
string | Entryname | Any string | |||||||||||||||||||||||||||||||||||||
-ufo2 -ufo_bsequence |
string | UFO features | Any string | |||||||||||||||||||||||||||||||||||||
-fformat2 -fformat_bsequence |
string | Features format | Any string | |||||||||||||||||||||||||||||||||||||
-fopenfile2 -fopenfile_bsequence |
string | Features file name | Any string | |||||||||||||||||||||||||||||||||||||
"-outseq" associated seqoutset qualifiers | ||||||||||||||||||||||||||||||||||||||||
-osformat3 -osformat_outseq |
string | Output seq format | Any string | |||||||||||||||||||||||||||||||||||||
-osextension3 -osextension_outseq |
string | File name extension | Any string | |||||||||||||||||||||||||||||||||||||
-osname3 -osname_outseq |
string | Base file name | Any string | |||||||||||||||||||||||||||||||||||||
-osdirectory3 -osdirectory_outseq |
string | Output directory | Any string | |||||||||||||||||||||||||||||||||||||
-osdbname3 -osdbname_outseq |
string | Database name to add | Any string | |||||||||||||||||||||||||||||||||||||
-ossingle3 -ossingle_outseq |
boolean | Separate file for each entry | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-oufo3 -oufo_outseq |
string | UFO features | Any string | |||||||||||||||||||||||||||||||||||||
-offormat3 -offormat_outseq |
string | Features format | Any string | |||||||||||||||||||||||||||||||||||||
-ofname3 -ofname_outseq |
string | Features file name | Any string | |||||||||||||||||||||||||||||||||||||
-ofdirectory3 -ofdirectory_outseq |
string | Output directory | Any string | |||||||||||||||||||||||||||||||||||||
General qualifiers | ||||||||||||||||||||||||||||||||||||||||
-auto | boolean | Turn off prompts | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-stdout | boolean | Write first file to standard output | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-filter | boolean | Read first file from standard input, write first file to standard output | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-options | boolean | Prompt for standard and additional values | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-debug | boolean | Write debug output to program.dbg | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-verbose | boolean | Report some/full command line options | Boolean value Yes/No | Y | ||||||||||||||||||||||||||||||||||||
-help | boolean | Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-warning | boolean | Report warnings | Boolean value Yes/No | Y | ||||||||||||||||||||||||||||||||||||
-error | boolean | Report errors | Boolean value Yes/No | Y | ||||||||||||||||||||||||||||||||||||
-fatal | boolean | Report fatal errors | Boolean value Yes/No | Y | ||||||||||||||||||||||||||||||||||||
-die | boolean | Report dying program messages | Boolean value Yes/No | Y | ||||||||||||||||||||||||||||||||||||
-version | boolean | Report version number and exit | Boolean value Yes/No | N |
The ID names of the nucleic acid and protein sequences are NOT checked to see if they correspond to each other. They can have any names.
There must be at least as many protein sequences as nucleic acid sequence - extra protein sequences are ignored.
Each of the nucleic acid sequences must have a corresponding protein sequence which is derived from the coding region of that nucleic acid sequence. The two sets of sequences must be in the same order.
>HSFAU1 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggccccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU2 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU3 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU4 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU5 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtaggccgcatgctttttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa |
>HSFAU1_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAG-PLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU2_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDALWASAGWRP >HSFAU3_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKGAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU4_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHEIASLEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVAGRMLARGKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU5_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVGRMLFG-GKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS |
>HSFAU1 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggc---cccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU2 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc--- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --------------------------------------- >HSFAU3 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU4 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU5 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtaggccgcatgctttttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct |
The output is the regions of the nucleic acid sequences which code for the corresponding protein sequence, with gap characters ('-') introduced so that they have the same alignment as the corresponding protein sequences.
In general, it is better to use protein sequences for multiple alignment, but to use DNA sequences for phylogeny, for example, when using the programs dnadist, dnapars, dnaml, etc in the PHYLIP package. Where one has a protein sequence alignment, it would be time consuming to remove gap characters before back-translating the proteins. tranalign helps by generating aligned cDNA sequences from a protein sequence alignment.
tranalign finds the coding regions for contiguous sequences only. It will not splice together different exons to produce a coding sequence. You should therefore use either mRNA sequences, or nucleic sequences which you have constructed to hold a contiguous coding region (maybe using extractseq or yank and union?).
The sequences must be in the same order in both input sets of sequences. Some alignment program (including clustalw/emma) will re-order their input sequences so as to group similar sequences together.
"Guide protein sequence xxx not found in nucleic sequence xxx" - the region of the nucleic sequence which codes for the protein was not found. The coding region in the nucleic acid sequence must be a single contiguous sequence. The protein sequence might not be the corresponding one for this nucleic acid sequence if they are out of order.
Program name | Description |
---|---|
edialign | Local multiple alignment of sequences |
emma | Multiple sequence alignment (ClustalW wrapper) |
infoalign | Display basic information about a multiple sequence alignment |
plotcon | Plot conservation of a sequence alignment |
prettyplot | Draw a sequence alignment with pretty formatting |
showalign | Display a multiple sequence alignment in pretty format |
tranalign was written in EMBOSS code using the
description of mrtrans as a guide by
Gary Williams formerly at:
MRC Rosalind Franklin Centre for Genomics Research
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK
Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.
tranalign written (March 2002) - Gary Williams