Next: Conclusion Up: Introduction to Sequence Analysis Previous: Protein analysis

Subsections

Patterns, profiles and multiple sequence alignment

We have not covered BLAST or FASTA searching in this tutorial because they are not currently part of EMBOSS; these searches are offered at many web sites worldwide. However, database searches are an important part of the bioinformatician's arsenal. When we screen a new sequence against a database of known sequences, we are trying to answer the following questions:

Is there any protein of known structure that has sufficient similarity to the sequence of the unknown protein to suggest a familial relationship?
If not, which sequence of any known proteins is most similar to the sequence of the unknown protein?

If we can identify a relationship to a protein of known structure, it is possible to infer that the new protein shares a common structure with its relative and to assign its general fold. However, what if the homologue has no known structure? If its function has been identified then we might expect our unknown protein to have a similar or related function. However, exceptions do exist. A classic example is lysozyme, which shares around 50% sequence identity and 70% sequence similarity with $\alpha$ -lactalbumin. The two proteins also share similar folds, but their functions are entirely different: the two key catalytic residues of lysozyme are not conserved in $\alpha$ -lactalbumin, and the acidic calcium binding motif important to the function of $\alpha$ -lactalbumin is not present in most lysoszymes. It is essential that you confirm any computer based predictions with benchwork.

What can you do if sequence similarity alone does not satisfactrily identify a relative? In this chapter we will show you a few more applications within EMBOSS that can help you predict the function of your sequence.

Pattern matching

In a number of cases, the active site of a protein can be recognized by a specific ``fingerprint'' or ``template'', a fairly small set of residues that are unique to a family of proteins. An example is the sequence GXGXXG (where G=glycine and X=any amino acid) which defines a GTP binding site. Searching for a (rather loose) predefined string of characters in a sequence is called Pattern Matching.

The EMBOSS program patmatmotifs looks for sequence motifs by searching with a pattern search algorithm through the given protein sequence for the patterns defined in the PROSITE database, compiled by Dr. Amos Bairoch at the University of Geneva. PROSITE is a database of protein families and domains, based on the observation that, while there are a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.

Exercise: `patmatmotifs`

unix % patmatmotifs
Search a motif database with a protein sequence
Input sequence: xlrhodop.pep
Output file [xlrhodop_1.patmatmotifs]: xlrhodop.patmatmotifs

unix % more xlrhodop.patmatmotifs

Number of matches found	in this	Sequence = 1

Length of the sequence = 354 basepairs
Start of match = position 123 of sequence
End of match = position	139 of sequence
Length of motif	= 17

patmatmotifs of	G_PROTEIN_RECEPTOR with	XLRHODOP+1 from	123 to 139\\
TLGGEVALWSLVVLAVERYMVVCKPMA
     |		     |
   123		     139

Number of matches found	in this	Sequence = 1

Length of the sequence = 354 basepairs
Start of match = position 290 of sequence
End of match = position	306 of sequence
Length of motif	= 17

patmatmotifs of	OPSIN with XLRHODOP+1 from 290 to 306
PVFMTVPAFFAKSSAIYNPVIYIVLNK
     |		     |
   290		     306

In our case we already know that our sequence is a rhodopsin. However, if you had an unknown sequence, we hope you can see that identifying motifs might provide you with information to help you plan further experiments.

Report formats

Many of the EMBOSS programs produce reports as output. These can take various formats (and are user selectable). So if, instead of the somewhat pictorial display of motifs seen in the previous exercise, one wanted a listfile so that the individual sequence matches could be retrieved for some later purpose, the report format can be specified using the -rformat qualifier. Illustrating this with the previous example:

unix % patmatmotifs xlrhodop.pep -rformat listfile
Search a PROSITE motif database with a protein sequence
Output report [xlrhodop_1.patmatmotifs]:
unix % more xlrhodop_1.patmatmotifs

########################################
# Program: patmatmotifs
# Rundate: Fri Feb 21 13:37:58 2003
# Report_format: listfile
# Report_file: xlrhodop_1.patmatmotifs
########################################

#=======================================
#
# Sequence: sw-id:OPSD_XENLA     from: 1   to: 354
# HitCount: 2
#
# Full: No
# Prune: Yes
# Data_file: /site/share/EMBOSS/data/PROSITE/prosite.lines
#
#=======================================

sw-id:OPSD_XENLA[123:139]
sw-id:OPSD_XENLA[290:306]

#---------------------------------------
#---------------------------------------

You can now retrieve these sequences using e.g. seqret with xlrhodop_1.patmatmotifs as a listfile.

There are other report formats (including feature table style formats). The EMBOSS web page has up to date documentation on the formats that are available.

Protein fingerprints

PRINTS is a database that defines functional protein families, identifying each domain by a number of short, particularly well conserved sequences. A full match to one of these "fingerprints" will match all the relevant short sequences in the correct order. A partial match is recorded if some are missing or if they occur in an incorrect order. The PRINTS database can be searched using the pscan program which is available within EMBOSS.

Exercise: `pscan`

unix % pscan
Scans proteins using PRINTS
Input sequence: xlrhodop.pep
Minimum number of elements per fingerprint [2]:
Maximum number of elements per fingerprint [20]:
Output file [xlrhodop_1.pscan]: xlrhodop.pscan

Scanning XLRHODOP+1...

unix % more xlrhodop.pscan

CLASS 1
Fingerprints with all elements in order

Fingerprint GPCRRHODOPSN Elements 7
    Accession number PR00237
    Rhodopsin-like GPCR	superfamily signature
  Element 1 Threshold 54% Score	61%
	     Start position 39 Length 25
  Element 2 Threshold 49% Score	49%
	     Start position 72 Length 22
  Element 3 Threshold 48% Score	55%
	     Start position 117	Length 23
  Element 4 Threshold 50% Score	69%
	     Start position 152	Length 22
  Element 5 Threshold 51% Score	82%
	     Start position 204	Length 24
  Element 6 Threshold 42% Score	72%
	     Start position 250	Length 25
  Element 7 Threshold 46% Score	68%
	     Start position 288	Length 27

CLASS 2
All elements match but not all in the correct order

Fingerprint RHODOPSIN Elements 6
    Accession number PR00579
    Rhodopsin signature
  Element 1 Threshold 80% Score	100%
	     Start position 3 Length 19
  Element 2 Threshold 76% Score	94%
	     Start position 22 Length 17
  Element 3 Threshold 53% Score	90%
	     Start position 85 Length 17
  Element 4 Threshold 71% Score	100%
	     Start position 191	Length 17
  Element 5 Threshold 56% Score	97%
	     Start position 271	Length 19
  Element 6 Threshold 81% Score	95%
	     Start position 319	Length 14

CLASS 3
Not all	elements match but those that do are in	order


CLASS 4
Remaining partial matches

Multiple Sequence Analysis

The simultaneous alignment of many nucleotide or amino acid sequences is now an essential tool in molecular biology. Multiple alignments are used to find diagnostic patterns to characterize protein families; to detect or demonstrate homology between new sequences and existing families of sequences; to help predict the secondary and tertiary structures of the new sequences; to suggest oligonucleotide primers for PCR; and as an essential prelude to molecular evolutionary analysis.

One of the most popular programs for performing multiple sequence alignments is clustalw ( []). EMBOSS has an interface to clustal called emma clustal (and thus emma) creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also produce a dendogram showing the clustering relationships used to create the alignment. The dendogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment, but it is not an evolutionary tree, although the length of the branches is related to the relative distance of the sequences. clustal finds global optimal alignments. The alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences.Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments that include increasingly dissimilar sequences and clusters, until all sequences have been included in the final pairwise alignment. When gaps are inserted into a sequence to produce an alignment, they are inserted at the same position in all the sequences of the cluster. Each pairwise alignment uses the method of Needleman and Wunsch extended for use with clusters of aligned sequences.

pscan has told us that our sequence belongs to the rhodopsin family. This is a very large family of sequences - for example, you can see the Pfam entry for rhodopsin by doing a keyword search at
http://www.sanger.ac.uk/Software/Pfam

We will now retrieve some further members of the family from SwissProt and produce a multiple alignment; we'll then use this multiple alignment to produce a profile of this group of sequences and use that to align them all to our original sequence.

First, let's retrieve the sequences using seqret:

Exercise: Retrieving a set of sequences

unix % seqret
Reads and writes (returns) a set of sequences all at once
Input sequence: sw:ops2_*
Output sequence [ops2_drome.fasta]: ops2.fasta

Note our use of the wild card character * to retrieve all swissprot sequences whose identifiers begin ops2_.

Exercise: `emma`

unix % emma
Multiple alignment program - interface to ClustalW program
Input sequence: ops2.fasta
Output sequence [ops2_drome.aln]: ops2.aln
Output file [ops2_drome.dnd]: ops2.dnd
..clustalw -infile=21665A -outfile=21665B -align
-type=protein -output=gcg -pwmatrix=blosum -pwgapopen=10.000
-pwgapext=0.100 -newtree=21665C -matrix=blosum -gapopen=10.000
-gapext=5.000 -gapdist=8 -hgapresidues=GPSNDQEKR -maxdiv=30..

CLUSTAL W (1.74) Multiple Sequence Alignments

Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: OPS2_DROME 381 aa
Sequence 2: OPS2_DROPS 381 aa
Sequence 3: OPS2_HEMSA 377 aa
Sequence 4: OPS2_LIMPO 376 aa
Sequence 5: OPS2_PATYE 399 aa
Sequence 6: OPS2_SCHGR 380 aa
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 91
Sequences (1:3) Aligned. Score: 37
Sequences (1:4) Aligned. Score: 48
Sequences (1:5) Aligned. Score: 20
Sequences (1:6) Aligned. Score: 32
Sequences (2:3) Aligned. Score: 37
Sequences (2:4) Aligned. Score: 48
Sequences (2:5) Aligned. Score: 22
Sequences (2:6) Aligned. Score: 31
Sequences (3:4) Aligned. Score: 40
Sequences (3:5) Aligned. Score: 23
Sequences (3:6) Aligned. Score: 32
Sequences (4:5) Aligned. Score: 20
Sequences (4:6) Aligned. Score: 34
Sequences (5:6) Aligned. Score: 18
Guide tree file created: [21665C]
Start of Multiple Alignment
There are 5 groups
Aligning...
Group 1: Sequences: 2 Score:6084
Group 2: Sequences: 3 Score:3046
Group 3: Sequences: 4 Score:2772
Group 4: Sequences: 5 Score:2489
Group 5: Delayed
Sequence:5 Score:2819
Alignment Score 11778
GCG-Alignment file created [21665B]

We have aligned ops2 sequences from two fruit fly species, two crab species, locust and scallop. Let's see what emma made of them:

unix % more ops2.aln

>OPS2_DROME
MERSHLPETPFDLAHSGPRFQAQSSGNGSVLD-NVLPDMAHLVNPYWSRFAPMDPMMSKI
LGLFTLAIMIISCCGNGVVVYIFGGTKSLRTPANLLVLNLAFSDFCMMASQSPVMIINFY
Y-ETWVLGPLWCDIYAGCGSLFGCVSIWSMCMIAFDRYNVIVKGINGTPMTIKTSIMKIL
FIWMMAVFWTVMPLIGWSAYVPEGNLTACSIDYMTRMWNPRSYLITYSLFVYYTPLFLIC
YSYWFIIAAVAAHEKAMREQAKKMNVKSLRSSEDCDK-SAEGKLAKVALTTISLWFMAWT
PYLVICYFGLFKIDG-LTPLTTIWGATFAKTSAVYNPIVYGISHPKYRIVLKEKCPMCVF
GNTDEPKPDAPASDTETTSEADSKA-----------------------------------
---------------------------
>OPS2_DROPS
MERSLLPEPPLAMALLGPRFEAQTGGNRSVLD-NVLPDMAPLVNPHWSRFAPMDPTMSKI
LGLFTLVILIISCCGNGVVVYIFGGTKSLRTPANLLVLNLAFSDFCMMASQSPVMIINFY
Y-ETWVLGPLWCDIYAACGSLFGCVSIWSMCMIAFDRYNVIVKGINGTPMTIKTSIMKIA
FIWMMAVFWTIMPLIGWSSYVPEGNLTACSIDYMTRQWNPRSYLITYSLFVYYTPLFMIC
YSYWFIIATVAAHEKAMRDQAKKMNVKSLRSSEDCDK-SAENKLAKVALTTISLWFMAWT
PYLIICYFGLFKIDG-LTPLTTIWGATFAKTSAVYNPIVYGISHPNDRLVLKEKCPMCVC
GTTDEPKPDAPPSDTETTSEAESKD-----------------------------------
---------------------------
>OPS2_LIMPO
----------MANQLSYSSLGWPYQPNASVVD-TMPKEMLYMIHEHWYAFPPMNPLWYSI
LGVAMIILGIICVLGNGMVIYLMMTTKSLRTPTNLLVVNLAFSDFCMMAFMMPTMASNCF
A-ETWILGPFMCEVYGMAGSLFGCASIWSMVMITLDRYNVIVRGMAAAPLTHKKATLLLL
FVWIWSGGWTILPFFGWSRYVPEGNLTSCTVDYLTKDWSSASYVIIYGLAVYFLPLITMI
YCYFFIVHAVAEHEKQLREQAKKMNVASLRANADQQKQSAECRLAKVAMMTVGLWFMAWT
PYLIIAWAGVFSSGTRLTPLATIWGSVFAKANSCYNPIVYGISHPRYKAALYQRFPSLAC
GSGESGSDVKSEASATMTMEEKPKSPEA--------------------------------
---------------------------
>OPS2_HEMSA
---MTNATGPQMAYYGAASMDFGYPEGVSIVD-FVRPEIKPYVHQHWYNYPPVNPMWHYL
LGVIYLFLGTVSIFGNGLVIYLFNKSAALRTPANILVVNLALSDLIMLTTNVPFFTYNCF
SGGVWMFSPQYCEIYACLGAITGVCSIWLLCMISFDRYNIICNGFNGPKLTTGKAVVFAL
ISWVIAIGCALPPFFGWGNYILEGILDSCSYDYLTQDFNTFSYNIFIFVFDYFLPAAIIV
FSYVFIVKAIFAHEAAMRAQAKKMNVSTLRSNEADAQ-RAEIRIAKTALVNVSLWFICWT
PYALISLKGVMGDTSGITPLVSTLPALLAKSCSCYNPFVYAISHPKYRLAITQHLPWFCV
HETETKSNDDSQSNSTVAQDKA--------------------------------------
---------------------------
>OPS2_SCHGR
------MVNTTDFYPVPAAMAYESSVGLPLLGWNVPTEHLDLVHPHWRSFQVPNKYWHFG
LAFVYFMLMCMSSLGNGIVLWIYATTKSIRTPSNMFIVNLALFDVLMLLEMPMLVVSSLF
Y-QRPVGWELGCDIYAALGSVAGIGSAINNAAIAFDRYRTISCPIDGRLTQGQVLALIAG
TWVWTLPFTLMPLLRIWSRFTAEGFLTTCSFDYLTDDEDTKVFVGCIFAWSYAFPLCLIC
CFYYRLIGAVREHEKMLRDQAKKMNVKSLQSNADTEAQSAEIRIAKVALTIFFLFLCSWT
PYAVVAMIGAFGNRAALTPLSTMIPAVTAKIVSCIDPWVYAINHPRFRAEVQKRMKWLHL
GEDARSSKSDTSSTATDRTVGNVSASA---------------------------------
---------------------------
>OPS2_PATYE
---------------------------------------MPFPLNRTDTALVISPSEFRI
IGIFISICCIIGVLGNLLIIIVFAKRRSVRRPINFFVLNLAVSDLIVALLGYPMTAASAF
S-NRWIFDNIGCKIYAFLCFNSGVISIMTHAALSFCRYIIICQYGYRKKITQTTVLRTLF
SIWSFAMFWTLSPLFGWSSYVIEVVPVSCSVNWYGHGLGDVSYTISVIVAVYVFPLSIIV
FSYGMIL-----QEKVCKDSRKNGIRAQQRYTPRFIQ-DIEQRVTFISFLMMAAFMVAWT
PYAIMSALAIGSFNV--ENSFAALPTLFAKASCAYNPFIYAFTNANFRDTVVEIMAPWTT
RRVGVSTLPWPQVTYYPRRRTSAVNTTDIEFPDDNIFIVNSSVNGPTVKREKIVQRNPIN
VRLGIKIEPRDSRAATENTFTADFSVI

The sequences are very similar, but there are some differences - note the gaps that have been inserted. Also note that since this is a global alignment algorithm, gaps have been inserted to make all the sequences the same length.

Differences in alignment can be very difficult to see in this format. The program prettyplot can enhance visualisation of your results, by aligning the sequences on top of one another.

Exercise: `prettyplot`

unix % prettyplot
Displays aligned sequences, with colouring and boxing
Input sequence set: ops2.aln
Graph type [x11]:

A graphic display will appear on your screen detailing your alignment. Identical residues are shown in red, and similar residues in green. This type of display can given you a first impression regions of conservation.

As with all EMBOSS graphical programs you can capture the output in a file rather than just viewing it on screen. The output is controlled by the -graph family of associated qualifiers (type prettyplot -help -verbose to get a complete listing of options.

We will save our pretty plot to a file rhodopsin.ps in colour postscript format. To do this we use -graph cps and -goutfile rhodopsin.

unix % prettyplot ops2.aln -goutfile rhodopsin -graph cps
Displays aligned sequences, with colouring and boxing
Created rhodopsin.ps

This has created a file rhodopsin.ps that can be printed on a postscript printer or turned into a PDF document with ps2pdf (not an EMBOSS program but commonly found on many UNIX/Linux systems). PDF documents can then be viewed with a PDF viewer such as Acrobat Reader.

To adjust the output of prettyplot (e.g to increase the number of residues per line) there are a number of options that can be set. Read the help file and try to plot with/without a consensus, different numbers of residues per line and so on. (hint: prettyplot -help)

Profiles

A very powerful technique for characterizing the putative structure and function of a sequence is the Profile Analysis ( []). Profile analysis is a sequence comparison method for finding and aligning distantly related sequences. The comparison allows a new sequence to be aligned optimally to a family of similar sequences. The comparison uses a scoring matrix and an existing optimal alignment of two or more similar protein sequences. The group or ``family'' of similar sequences are first aligned together to create a multiple sequence alignment. The information in the multiple sequence alignment is then represented as a table of position-specific symbol comparison values and gap penalties. This table is called a profile. The similarity of new sequences to an existing profile can be tested by comparing each new sequence to the profile using a modification of the Smith/Waterman algorithm.

Exercise: `prophecy`

prophecy is an EMBOSS program for creating a profile from a set of multiply aligned sequences. We'll use our ops2 alignment to show you prophecy

unix % prophecy
Creates matrices/profiles from multiple alignments
Input sequence: ops2.aln
Profile type
F : Frequency
G : Gribskov
H : Henikoff
Select type [F]: g
Enter a name for the profile [My matrix]: ops2 sequences
Scoring matrix [Epprofile]:
Gap opening penalty [3.0]:
Gap extension penalty [0.3]:
Output file [outfile.prophecy]: ops2.prophecy

Exercise: `prophet`

Now let's use the profile we just created to align xlrhodop.pep to our opsin2 sequences.

unix % prophet
Gapped alignment for profiles
Input sequence(s): xlrhodop.pep
Profile or matrix file: ops2.prophecy
Gap opening coefficient [1.0]:
Gap extension coefficient [0.1]:
Output file [ops2.prophet]:

unix % more ops2.prophet

Local: Consensus vs OPSD_XENLA
Score: 2189.00

Consensus	1	 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38
			 | : : :  || : ::::: :	   |: |	::|:  :	   | :
OPSD_XENLA	1	 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33

Consensus	39	 MAPLVHPHWSRF.APMNPMWHKILGLFTLILGII.SCLG.NGLVI 80
			 :::   :::  : :	 :|: :::|: ::::|:::   |: | :::
OPSD_XENLA	34	 YPQ.Q.QYYL.LAE..EPWQYSALAAYMFLLILLGL.LPINFMTL 72

Consensus	81	 YI.FA.GTKSLRTPANLLVLNLAFSD..FCMMASMSPV.MAINCF 120
			 :: :: ::  ||||	|:::|||:|::  : |:: :::|	| ::::
OPSD_XENLA	73	 FVTIQHKKL.LRTPLNYILLNLVFANHFM.MVLCGFTVTMYTSMH 115

Consensus	121	 YGETWVLGPLGC..D.IYAAL.GSLFGCVSIWSMCMIAFDRYNVI 161
			   : :::|| ||  : ::|:| |   | |::||::::|::|| |:
OPSD_XENLA	116	 G.GYFIFGPTGCYIEGFFATLGG...GEVALWSLVVLAVERYIVV 156

Consensus	162	 VKGINGTPLTIKTAILKALFIWMM.AVFW.TIMPLFGWSRYVPEG 204
			 :|::::	::::: ||: ::|:|:| : :: : :||||||||:|||
OPSD_XENLA	157	 CKPMANFRFGENHAIMGVAFTWIMAL.LSCAAPPLFGWSRYIPEG 200

Consensus	205	 NLTSCSIDYLT.R.DWNPRSYL.ITYFLFV.YFFPLFIICYSY.W 244
			 : :||::||:| : : |: |::	  |:::|	: :||::|:::| :
OPSD_XENLA	201	 MQCSCGVDYYTLKPEVNNESFVIY.YMFIVHFTIPLIVIFFCYGR 244

Consensus	245	 FIIAAVAAHEKAMRDQAKKMNVKSLRSNEDCDKQSAEI.R.LAKV 287
			 :::::	 :|:|:::|:: :  ::::::::	 :   |:	| :: |
OPSD_XENLA	245	 LLCTVK..KEAAAQQQESLT..TTQKAEKE..E...EVTRMVV.V 279

Consensus	288	 ALTTISLWFMAWTPYAIIAY.FGLFGIDGA.LTP.LTT.IWGALF 328
			  :::: :::::|:|||::|: :	:|: :|:	::| ::|	  :|:|
OPSD_XENLA	280	 IMVVF.FFLICWVPYAYVAFYI.IFTHQGSNFGPVFMTVP.PAFF 321

Consensus	329	 AKASSCYNPIVYAISHPKYRA.ALKEKCPMCVCGETD.EPSPDAP 371
			 ||:|::|||::| :	::::|  ::   :: ::||::: :::::::
OPSD_XENLA	322	 AKSSAIYNPVIYIVLNKQFRNCLI...ITTLCCGKNPFGDEDGSS 363

Consensus	372	 QSDATTTSEAAS..KAPAAI.EFPD		       393
			  |:||:::||:|  ::: :: :	|:
OPSD_XENLA	364	 .SAATSKTEASSVSSSQ.QVSP.PA		       385

The vertical bars (|) represent residues that are identical between the ops2 consensus and our rhodopsin, while the colons (:) represent conservative substitutions. We hope you can see that aligning members of a family can reveal conserved regions that may be important for structure and/or function.

Next: Conclusion Up: Introduction to Sequence Analysis Previous: Protein analysis

Gary Williams
2003-04-29