If we can identify a relationship to a protein of known structure, it is possible to infer that the new protein shares a common structure with its relative and to assign its general fold. However, what if the homologue has no known structure? If its function has been identified then we might expect our unknown protein to have a similar or related function. However, exceptions do exist. A classic example is lysozyme, which shares around 50% sequence identity and 70% sequence similarity with -lactalbumin. The two proteins also share similar folds, but their functions are entirely different: the two key catalytic residues of lysozyme are not conserved in -lactalbumin, and the acidic calcium binding motif important to the function of -lactalbumin is not present in most lysoszymes. It is essential that you confirm any computer based predictions with benchwork.
What can you do if sequence similarity alone does not satisfactrily identify a relative? In this chapter we will show you a few more applications within EMBOSS that can help you predict the function of your sequence.
The EMBOSS program patmatmotifs looks for sequence motifs by searching with a pattern search algorithm through the given protein sequence for the patterns defined in the PROSITE database, compiled by Dr. Amos Bairoch at the University of Geneva. PROSITE is a database of protein families and domains, based on the observation that, while there are a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.
unix % patmatmotifs
Search a motif database with a protein sequence
Input sequence: xlrhodop.pep
Output file [xlrhodop_1.patmatmotifs]: xlrhodop.patmatmotifs
unix % more xlrhodop.patmatmotifs
Number of matches found in this Sequence = 1 Length of the sequence = 354 basepairs Start of match = position 123 of sequence End of match = position 139 of sequence Length of motif = 17 patmatmotifs of G_PROTEIN_RECEPTOR with XLRHODOP+1 from 123 to 139\\ TLGGEVALWSLVVLAVERYMVVCKPMA | | 123 139 Number of matches found in this Sequence = 1 Length of the sequence = 354 basepairs Start of match = position 290 of sequence End of match = position 306 of sequence Length of motif = 17 patmatmotifs of OPSIN with XLRHODOP+1 from 290 to 306 PVFMTVPAFFAKSSAIYNPVIYIVLNK | | 290 306
In our case we already know that our sequence is a rhodopsin. However, if you had an unknown sequence, we hope you can see that identifying motifs might provide you with information to help you plan further experiments.
Many of the EMBOSS programs produce reports as output. These can take
various formats (and are user selectable). So if, instead of the somewhat
pictorial display of motifs seen in the previous exercise, one wanted a
listfile so that the individual sequence matches could be retrieved
for some later purpose, the report format can be specified using the
-rformat qualifier. Illustrating this with the previous example:
unix % patmatmotifs xlrhodop.pep -rformat listfile
Search a PROSITE motif database with a protein sequence
Output report [xlrhodop_1.patmatmotifs]:
unix % more xlrhodop_1.patmatmotifs
######################################## # Program: patmatmotifs # Rundate: Fri Feb 21 13:37:58 2003 # Report_format: listfile # Report_file: xlrhodop_1.patmatmotifs ######################################## #======================================= # # Sequence: sw-id:OPSD_XENLA from: 1 to: 354 # HitCount: 2 # # Full: No # Prune: Yes # Data_file: /site/share/EMBOSS/data/PROSITE/prosite.lines # #======================================= sw-id:OPSD_XENLA[123:139] sw-id:OPSD_XENLA[290:306] #--------------------------------------- #---------------------------------------
You can now retrieve these sequences using e.g. seqret with xlrhodop_1.patmatmotifs as a listfile.
There are other report formats (including feature table style formats). The EMBOSS web page has up to date documentation on the formats that are available.
unix % pscan
Scans proteins using PRINTS
Input sequence: xlrhodop.pep
Minimum number of elements per fingerprint [2]:
Maximum number of elements per fingerprint [20]:
Output file [xlrhodop_1.pscan]: xlrhodop.pscan
Scanning XLRHODOP+1...
unix % more xlrhodop.pscan
CLASS 1 Fingerprints with all elements in order Fingerprint GPCRRHODOPSN Elements 7 Accession number PR00237 Rhodopsin-like GPCR superfamily signature Element 1 Threshold 54% Score 61% Start position 39 Length 25 Element 2 Threshold 49% Score 49% Start position 72 Length 22 Element 3 Threshold 48% Score 55% Start position 117 Length 23 Element 4 Threshold 50% Score 69% Start position 152 Length 22 Element 5 Threshold 51% Score 82% Start position 204 Length 24 Element 6 Threshold 42% Score 72% Start position 250 Length 25 Element 7 Threshold 46% Score 68% Start position 288 Length 27 CLASS 2 All elements match but not all in the correct order Fingerprint RHODOPSIN Elements 6 Accession number PR00579 Rhodopsin signature Element 1 Threshold 80% Score 100% Start position 3 Length 19 Element 2 Threshold 76% Score 94% Start position 22 Length 17 Element 3 Threshold 53% Score 90% Start position 85 Length 17 Element 4 Threshold 71% Score 100% Start position 191 Length 17 Element 5 Threshold 56% Score 97% Start position 271 Length 19 Element 6 Threshold 81% Score 95% Start position 319 Length 14 CLASS 3 Not all elements match but those that do are in order CLASS 4 Remaining partial matches
One of the most popular programs for performing multiple sequence alignments is clustalw ( []). EMBOSS has an interface to clustal called emma clustal (and thus emma) creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also produce a dendogram showing the clustering relationships used to create the alignment. The dendogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment, but it is not an evolutionary tree, although the length of the branches is related to the relative distance of the sequences. clustal finds global optimal alignments. The alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences.Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments that include increasingly dissimilar sequences and clusters, until all sequences have been included in the final pairwise alignment. When gaps are inserted into a sequence to produce an alignment, they are inserted at the same position in all the sequences of the cluster. Each pairwise alignment uses the method of Needleman and Wunsch extended for use with clusters of aligned sequences.
pscan has told us that our sequence belongs to the
rhodopsin family. This is a very large family of sequences - for
example, you can see the Pfam entry for rhodopsin by doing a keyword
search at
http://www.sanger.ac.uk/Software/Pfam
We will now retrieve some further members of the family from SwissProt and produce a multiple alignment; we'll then use this multiple alignment to produce a profile of this group of sequences and use that to align them all to our original sequence.
First, let's retrieve the sequences using seqret:
unix % seqret
Reads and writes (returns) a set of sequences all at once
Input sequence: sw:ops2_*
Output sequence [ops2_drome.fasta]: ops2.fasta
Note our use of the wild card character * to retrieve all swissprot sequences whose identifiers begin ops2_.
unix % emma
Multiple alignment program - interface to ClustalW program
Input sequence: ops2.fasta
Output sequence [ops2_drome.aln]: ops2.aln
Output file [ops2_drome.dnd]: ops2.dnd
..clustalw -infile=21665A -outfile=21665B -align
-type=protein -output=gcg -pwmatrix=blosum -pwgapopen=10.000
-pwgapext=0.100 -newtree=21665C -matrix=blosum -gapopen=10.000
-gapext=5.000 -gapdist=8 -hgapresidues=GPSNDQEKR -maxdiv=30..
CLUSTAL W (1.74) Multiple Sequence Alignments
Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: OPS2_DROME 381 aa
Sequence 2: OPS2_DROPS 381 aa
Sequence 3: OPS2_HEMSA 377 aa
Sequence 4: OPS2_LIMPO 376 aa
Sequence 5: OPS2_PATYE 399 aa
Sequence 6: OPS2_SCHGR 380 aa
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 91
Sequences (1:3) Aligned. Score: 37
Sequences (1:4) Aligned. Score: 48
Sequences (1:5) Aligned. Score: 20
Sequences (1:6) Aligned. Score: 32
Sequences (2:3) Aligned. Score: 37
Sequences (2:4) Aligned. Score: 48
Sequences (2:5) Aligned. Score: 22
Sequences (2:6) Aligned. Score: 31
Sequences (3:4) Aligned. Score: 40
Sequences (3:5) Aligned. Score: 23
Sequences (3:6) Aligned. Score: 32
Sequences (4:5) Aligned. Score: 20
Sequences (4:6) Aligned. Score: 34
Sequences (5:6) Aligned. Score: 18
Guide tree file created: [21665C]
Start of Multiple Alignment
There are 5 groups
Aligning...
Group 1: Sequences: 2 Score:6084
Group 2: Sequences: 3 Score:3046
Group 3: Sequences: 4 Score:2772
Group 4: Sequences: 5 Score:2489
Group 5: Delayed
Sequence:5 Score:2819
Alignment Score 11778
GCG-Alignment file created [21665B]
We have aligned ops2 sequences from two fruit fly species, two crab species, locust and scallop. Let's see what emma made of them:
unix % more ops2.aln
>OPS2_DROME MERSHLPETPFDLAHSGPRFQAQSSGNGSVLD-NVLPDMAHLVNPYWSRFAPMDPMMSKI LGLFTLAIMIISCCGNGVVVYIFGGTKSLRTPANLLVLNLAFSDFCMMASQSPVMIINFY Y-ETWVLGPLWCDIYAGCGSLFGCVSIWSMCMIAFDRYNVIVKGINGTPMTIKTSIMKIL FIWMMAVFWTVMPLIGWSAYVPEGNLTACSIDYMTRMWNPRSYLITYSLFVYYTPLFLIC YSYWFIIAAVAAHEKAMREQAKKMNVKSLRSSEDCDK-SAEGKLAKVALTTISLWFMAWT PYLVICYFGLFKIDG-LTPLTTIWGATFAKTSAVYNPIVYGISHPKYRIVLKEKCPMCVF GNTDEPKPDAPASDTETTSEADSKA----------------------------------- --------------------------- >OPS2_DROPS MERSLLPEPPLAMALLGPRFEAQTGGNRSVLD-NVLPDMAPLVNPHWSRFAPMDPTMSKI LGLFTLVILIISCCGNGVVVYIFGGTKSLRTPANLLVLNLAFSDFCMMASQSPVMIINFY Y-ETWVLGPLWCDIYAACGSLFGCVSIWSMCMIAFDRYNVIVKGINGTPMTIKTSIMKIA FIWMMAVFWTIMPLIGWSSYVPEGNLTACSIDYMTRQWNPRSYLITYSLFVYYTPLFMIC YSYWFIIATVAAHEKAMRDQAKKMNVKSLRSSEDCDK-SAENKLAKVALTTISLWFMAWT PYLIICYFGLFKIDG-LTPLTTIWGATFAKTSAVYNPIVYGISHPNDRLVLKEKCPMCVC GTTDEPKPDAPPSDTETTSEAESKD----------------------------------- --------------------------- >OPS2_LIMPO ----------MANQLSYSSLGWPYQPNASVVD-TMPKEMLYMIHEHWYAFPPMNPLWYSI LGVAMIILGIICVLGNGMVIYLMMTTKSLRTPTNLLVVNLAFSDFCMMAFMMPTMASNCF A-ETWILGPFMCEVYGMAGSLFGCASIWSMVMITLDRYNVIVRGMAAAPLTHKKATLLLL FVWIWSGGWTILPFFGWSRYVPEGNLTSCTVDYLTKDWSSASYVIIYGLAVYFLPLITMI YCYFFIVHAVAEHEKQLREQAKKMNVASLRANADQQKQSAECRLAKVAMMTVGLWFMAWT PYLIIAWAGVFSSGTRLTPLATIWGSVFAKANSCYNPIVYGISHPRYKAALYQRFPSLAC GSGESGSDVKSEASATMTMEEKPKSPEA-------------------------------- --------------------------- >OPS2_HEMSA ---MTNATGPQMAYYGAASMDFGYPEGVSIVD-FVRPEIKPYVHQHWYNYPPVNPMWHYL LGVIYLFLGTVSIFGNGLVIYLFNKSAALRTPANILVVNLALSDLIMLTTNVPFFTYNCF SGGVWMFSPQYCEIYACLGAITGVCSIWLLCMISFDRYNIICNGFNGPKLTTGKAVVFAL ISWVIAIGCALPPFFGWGNYILEGILDSCSYDYLTQDFNTFSYNIFIFVFDYFLPAAIIV FSYVFIVKAIFAHEAAMRAQAKKMNVSTLRSNEADAQ-RAEIRIAKTALVNVSLWFICWT PYALISLKGVMGDTSGITPLVSTLPALLAKSCSCYNPFVYAISHPKYRLAITQHLPWFCV HETETKSNDDSQSNSTVAQDKA-------------------------------------- --------------------------- >OPS2_SCHGR ------MVNTTDFYPVPAAMAYESSVGLPLLGWNVPTEHLDLVHPHWRSFQVPNKYWHFG LAFVYFMLMCMSSLGNGIVLWIYATTKSIRTPSNMFIVNLALFDVLMLLEMPMLVVSSLF Y-QRPVGWELGCDIYAALGSVAGIGSAINNAAIAFDRYRTISCPIDGRLTQGQVLALIAG TWVWTLPFTLMPLLRIWSRFTAEGFLTTCSFDYLTDDEDTKVFVGCIFAWSYAFPLCLIC CFYYRLIGAVREHEKMLRDQAKKMNVKSLQSNADTEAQSAEIRIAKVALTIFFLFLCSWT PYAVVAMIGAFGNRAALTPLSTMIPAVTAKIVSCIDPWVYAINHPRFRAEVQKRMKWLHL GEDARSSKSDTSSTATDRTVGNVSASA--------------------------------- --------------------------- >OPS2_PATYE ---------------------------------------MPFPLNRTDTALVISPSEFRI IGIFISICCIIGVLGNLLIIIVFAKRRSVRRPINFFVLNLAVSDLIVALLGYPMTAASAF S-NRWIFDNIGCKIYAFLCFNSGVISIMTHAALSFCRYIIICQYGYRKKITQTTVLRTLF SIWSFAMFWTLSPLFGWSSYVIEVVPVSCSVNWYGHGLGDVSYTISVIVAVYVFPLSIIV FSYGMIL-----QEKVCKDSRKNGIRAQQRYTPRFIQ-DIEQRVTFISFLMMAAFMVAWT PYAIMSALAIGSFNV--ENSFAALPTLFAKASCAYNPFIYAFTNANFRDTVVEIMAPWTT RRVGVSTLPWPQVTYYPRRRTSAVNTTDIEFPDDNIFIVNSSVNGPTVKREKIVQRNPIN VRLGIKIEPRDSRAATENTFTADFSVI
The sequences are very similar, but there are some differences - note the gaps that have been inserted. Also note that since this is a global alignment algorithm, gaps have been inserted to make all the sequences the same length.
Differences in alignment can be very difficult to see in this format. The program prettyplot can enhance visualisation of your results, by aligning the sequences on top of one another.
unix % prettyplot
Displays aligned sequences, with colouring and boxing
Input sequence set: ops2.aln
Graph type [x11]:
A graphic display will appear on your screen detailing your alignment. Identical residues are shown in red, and similar residues in green. This type of display can given you a first impression regions of conservation.
As with all EMBOSS graphical programs you can capture the output in a file rather than just viewing it on screen. The output is controlled by the -graph family of associated qualifiers (type prettyplot -help -verbose to get a complete listing of options.
We will save our pretty plot to a file rhodopsin.ps in colour postscript format.
To do this we use -graph cps and -goutfile rhodopsin.
unix % prettyplot ops2.aln -goutfile rhodopsin -graph cps
Displays aligned sequences, with colouring and boxing
Created rhodopsin.ps
This has created a file rhodopsin.ps that can be printed on a postscript printer or turned into a PDF document with ps2pdf (not an EMBOSS program but commonly found on many UNIX/Linux systems). PDF documents can then be viewed with a PDF viewer such as Acrobat Reader.
To adjust the output of prettyplot (e.g to increase the number of residues per line) there are a number of options that can be set. Read the help file and try to plot with/without a consensus, different numbers of residues per line and so on. (hint: prettyplot -help)
prophecy is an EMBOSS program for creating a profile from a set of multiply aligned sequences. We'll use our ops2 alignment to show you prophecy
unix % prophecy
Creates matrices/profiles from multiple alignments
Input sequence: ops2.aln
Profile type
F : Frequency
G : Gribskov
H : Henikoff
Select type [F]: g
Enter a name for the profile [My matrix]: ops2 sequences
Scoring matrix [Epprofile]:
Gap opening penalty [3.0]:
Gap extension penalty [0.3]:
Output file [outfile.prophecy]: ops2.prophecy
Now let's use the profile we just created to align xlrhodop.pep to our opsin2 sequences.
unix % prophet
Gapped alignment for profiles
Input sequence(s): xlrhodop.pep
Profile or matrix file: ops2.prophecy
Gap opening coefficient [1.0]:
Gap extension coefficient [0.1]:
Output file [ops2.prophet]:
unix % more ops2.prophet
Local: Consensus vs OPSD_XENLA Score: 2189.00 Consensus 1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33 Consensus 39 MAPLVHPHWSRF.APMNPMWHKILGLFTLILGII.SCLG.NGLVI 80 ::: ::: : : :|: :::|: ::::|::: |: | ::: OPSD_XENLA 34 YPQ.Q.QYYL.LAE..EPWQYSALAAYMFLLILLGL.LPINFMTL 72 Consensus 81 YI.FA.GTKSLRTPANLLVLNLAFSD..FCMMASMSPV.MAINCF 120 :: :: :: |||| |:::|||:|:: : |:: :::| | :::: OPSD_XENLA 73 FVTIQHKKL.LRTPLNYILLNLVFANHFM.MVLCGFTVTMYTSMH 115 Consensus 121 YGETWVLGPLGC..D.IYAAL.GSLFGCVSIWSMCMIAFDRYNVI 161 : :::|| || : ::|:| | | |::||::::|::|| |: OPSD_XENLA 116 G.GYFIFGPTGCYIEGFFATLGG...GEVALWSLVVLAVERYIVV 156 Consensus 162 VKGINGTPLTIKTAILKALFIWMM.AVFW.TIMPLFGWSRYVPEG 204 :|:::: ::::: ||: ::|:|:| : :: : :||||||||:||| OPSD_XENLA 157 CKPMANFRFGENHAIMGVAFTWIMAL.LSCAAPPLFGWSRYIPEG 200 Consensus 205 NLTSCSIDYLT.R.DWNPRSYL.ITYFLFV.YFFPLFIICYSY.W 244 : :||::||:| : : |: |:: |:::| : :||::|:::| : OPSD_XENLA 201 MQCSCGVDYYTLKPEVNNESFVIY.YMFIVHFTIPLIVIFFCYGR 244 Consensus 245 FIIAAVAAHEKAMRDQAKKMNVKSLRSNEDCDKQSAEI.R.LAKV 287 ::::: :|:|:::|:: : :::::::: : |: | :: | OPSD_XENLA 245 LLCTVK..KEAAAQQQESLT..TTQKAEKE..E...EVTRMVV.V 279 Consensus 288 ALTTISLWFMAWTPYAIIAY.FGLFGIDGA.LTP.LTT.IWGALF 328 :::: :::::|:|||::|: : :|: :|: ::| ::| :|:| OPSD_XENLA 280 IMVVF.FFLICWVPYAYVAFYI.IFTHQGSNFGPVFMTVP.PAFF 321 Consensus 329 AKASSCYNPIVYAISHPKYRA.ALKEKCPMCVCGETD.EPSPDAP 371 ||:|::|||::| : ::::| :: :: ::||::: ::::::: OPSD_XENLA 322 AKSSAIYNPVIYIVLNKQFRNCLI...ITTLCCGKNPFGDEDGSS 363 Consensus 372 QSDATTTSEAAS..KAPAAI.EFPD 393 |:||:::||:| ::: :: : |: OPSD_XENLA 364 .SAATSKTEASSVSSSQ.QVSP.PA 385
The vertical bars (|) represent residues that are identical between the ops2 consensus and our rhodopsin, while the colons (:) represent conservative substitutions. We hope you can see that aligning members of a family can reveal conserved regions that may be important for structure and/or function.