Next: Patterns, profiles and multiple Up: Introduction to Sequence Analysis Previous: Pairwise sequence alignment

Subsections

Protein analysis

This chapter will introduce you to a few of the EMBOSS applications that can be used to analyse protein sequences. Obviously, the pairwise sequence comparison methods illustrated in the previous chapter with nucleic acid sequences can also be used with protein sequences.

Identifying the ORF

In this section we'll show you some simple EMBOSS applications for translating your cDNA sequence into protein. You should be aware that gene structure prediction is a tough problem, and recognising exon/intron boundaries in genomic sequence is not easy; for now, rather than deal with that aspect of prediction, we'll use the cDNA sequence in our practical. First, we need to identify our open reading frame. We can get a rapid visual overview of the distribution of ORFs in the six frames of our sequence using the EMBOSS program plotorf.

Exercise: `plotorf`

unix % plotorf
Plot potential open reading frames
Input sequence: embl:xlrhodop
Graph type [x11]:

You will see a graphical output that shows the potential open reading frames (ORF) in all six frames:

$\begin{figure} \begin{center} \epsfig{figure=plotorf.ps,width=3in,angle=90}\end{center}\end{figure}$

The longest ORF is in frame 2 from around position 100 to 1200. We will now identify the exact start and end points for our translation. To do this, we can use the EMBOSS program getorf.

Exercise: `getorf`

unix % getorf -opt
Finds and extracts open reading frames (ORFs)
Input sequence: embl:xlrhodop
Output sequence [xlrhodop.orf]:
Genetic codes
0 : Standard
1 : Standard (with alternative initiation codons)
2 : Vertebrate Mitochondrial
3 : Yeast Mitochondrial
4 : Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
5 : Invertebrate Mitochondrial
6 : Ciliate Macronuclear and Dasycladacean
9 : Echinoderm Mitochondrial
10 : Euplotid Nuclear
11 : Bacterial
12 : Alternative Yeast Nuclear
13 : Ascidian Mitochondrial
14 : Flatworm Mitochondrial
15 : Blepharisma Macronuclear
Code to use [0]:
Minimum nucleotide size of ORF to report [30]:
Type of sequence to output
0 : Translation of regions between STOP codons
1 : Translation of regions between START and STOP codons
2 : Nucleic sequences between STOP codons
3 : Nucleic sequences between START and STOP codons
4 : Nucleotides flanking START codons
5 : Nucleotides flanking initial STOP codons
6 : Nucleotides flanking ending STOP codons
Type of output [0]: 3

Notice that you can specify the organism whose codon usage table is most appropriate for your sequence, and you can also choose the type of information that is reported to you. In our case, we are simply interested in the positions of the start and stop codons for this sequence.

plotorf is just a graphical representation of the textual information produced by getorf. Since we asked for all ORFs above a minimum size to be reported, getorf is telling us about a number of potential ORFs. We know from plotorf that our ORF will be in the region 100 to 1200, so scroll through the output file, xlrhodop.orf, until you identify this. What are the actual start and end positions?

unix % more xlrhodop.orf

>XLRHODOP_7 [110 - 1171] Xenopus laevis	rhodopsin mRNA,	complete cds.
atgaacggaacagaaggtccaaatttttatgtccccatgtccaacaaaactggggtggta
cgaagcccattcgattaccctcagtattacttagcagagccatggcaatattcagcactg

$\vdots$

Translating the sequence

From the previous exercise you should have found that the region to be translated is from 110 to 1171 in our cDNA sequence. Now we can use transeq to translate that region and use the translated peptide for some further analyses.

Exercise: `transeq`

Let's practice using command line flags (qualifiers) again. The new ones here are -sbegin and -send. These allow you to specify a subregion of your sequence; in this case we will ask transeq to translate only the part of embl:xlrhodop that we have identified as the coding region. You should remember -outseq from Chapter 2.

unix % transeq embl:xlrhodop -sbegin 110 -send 1171 -outseq xlrhodop.pep
Translate nucleic acid sequences

unix % more xlrhodop.pep

>XLRHODOP+1 Xenopus laevis rhodopsin mRNA, complete cds.
MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMFLLILLGLPINFMTLF
VTIQHKKLRTPLNYILLNLVFANHFMVLCGFTVTMYTSMHGYFIFGQTGCYIEGFFATLG
GEVALWSLVVLAVERYMVVCKPMANFRFGENHAIMGVAFTWIMALSCAAPPLFGWSRYIP
EGMQCSCGVDYYTLKPEVNNESFVIYMFIVHFTIPLIVIFFCYGRLLCTVKEAAAQQQES
ATTQKAEKEVTRMVVIMVVFFLICWVPYAYVAFYIFTHQGSNFGPVFMTVPAFFAKSSAI
YNPVIYIVLNKQFRNCLITTLCCGKNPFGDEDGSSAATSKTEASSVSSSQVSPA

We saw earlier that the SwissProt entry for this protein has the identifier opsd_xenla; test your understanding of EMBOSS so far by using needle to compare your translated product with the database sequence. Compare your findings with the SwissProt entry using SRS.

USA for partial sequences

As an alternative to -sbegin and -send you can specify start, end and whether to reverse complement as part of the sequence USA. The format to use is db:sequence[start:end] (or db:sequence[start:end:r] to reverse complement). Start must be smaller than end. If you want to use the actual start and end then use the value 0 instead of positions. If you want to count from the end of the sequence rather than the beginning then use negative numbers.

Examples

Residues 10-20	sw:opsd_xenla[10:20]
The last ten residues	sw:opsd_xenla[-10:0]
The last twenty residues bar 5	sw:opsd_xenla[-20:-6]
bases 134-458 reverse complement	embl:xlrhodop[134:458:r]

Secondary structure prediction

The question of how DNA sequence determines specific protein structure has been a constant source of fascination and speculation since the problem was identified. It remains an extremely difficult area; generally referred to as the ``folding problem'', it is one of the major outstanding questions in molecular biology. Many attempts have been made to predict the tertiary structure of a protein from its sequence. These fall into two distinct approaches:

One approach is to set up a realistic mechanical model of the protein chain and simulate the folding process.
Other approaches are empirical as they proceed by inference from known tertiary structures.

The approach to structure prediction based on mechanical models has the innate (possibly fatal) attraction that, in theory, it requires no prior knowledge of protein tertiary structure. If successful it could be applied uniformly to all sequences. By contrast, all methods based on inference from known structures are inherently limited in their applicability. They will only be appropriate for predicting structures similar to those which were used in the inference process. Fortunately there are often biophysical or biochemical clues that help make this decision and these are often integrated in the methods for structure prediction.

Currently the best way to achieve reasonable secondary structure predictions is to run a variety of prediction algorithms over your sequence and determine a consensus among the results. There are various web servers that will do these multiple analyses for you, including PIX at the HGMP and Jpred at the University of Dundee:

http://www.hgmp.mrc.ac.uk/Registered/Webapp/pix/
http://www.compbio.dundee.ac.uk/ $\tilde{}$ www-jpred

As yet, coverage of secondary structure prediction within EMBOSS is limited. More algorithms will be added to enable the conesensus approach described above. We'll take a look now at some of the predictions you can currently perform using EMBOSS.

pepinfo

pepinfo produces information on amino acid properties (size, polarity, aromaticity, charge etc). Hydrophobicity profiles are also available and are useful for locating turns, potential antigenic peptides and transmembrane helices. Various algorithms are employed including the Kyte and Doolittle hydropathy measure - this curve is the average of a residue-specific hydrophobicity index over a window of nine residues. When the line is in the upper half of the frame, it indicates a hydrophobic region, and when it is in the lower half, a hydrophilic region.

Exercise: `pepinfo`

unix % pepinfo xlrhodop.pep
Plots simple amino acid properties in parallel
Graph type [x11]:
Output file [pepinfo.out]:

You will see two screens (press return to move from the first to the second screen) that look like this:

$\begin{figure} \begin{center} \epsfig{figure=pepinfo1.ps,width=5in, height=3in}\end{center}\end{figure}$

$\begin{figure} \begin{center} \epsfig{figure=pepinfo2.ps,width=5in, height=3in}\end{center}\end{figure}$

Predicting transmembrane regions

The results from the pepinfo hydropathy plot showed seven highly hydrophobic regions within xlrhodop.pep. Could these be transmembrane domains? We can use the EMBOSS program tmap to investigate this possibility:

Exercise: `tmap`

unix % tmap
Displays membrane spanning regions
Sequences file to be read in: xlrhodop.pep
Graph type [x11]:

You will see a window that looks like this:

$\begin{figure} \begin{center} \epsfig{figure=tmap.ps,height=5in,angle=90}\end{center}\end{figure}$

The bars across the top represent areas where transmembrane segments are predicted. Taken in combination with the results from pepinfo, we can see that there may be seven transmembrane helices in this protein. This corresponds well with both the SwissProt entry for this sequence (opsd_xenla) and with some information we will gather about patterns and profiles in the next chapter.

There are various other programs you can use to analyse your peptide sequence - to find out what is available, try rerunning wossname as we did in the first chapter.

Next: Patterns, profiles and multiple Up: Introduction to Sequence Analysis Previous: Pairwise sequence alignment

Gary Williams
2003-04-29