PMDG-1

Astra Charnwood

18 October 1995

10.30 Introduction
Welcome by Dr Phil Marshall

10.45 PRINTS: a database of protein sequence fingerprints
Dr T K Attwood UCL

11.30 Approaches to the modelling of GPCRs
Dr F E Blaney SKB

12.15 Structural instances of low complexity sequence segments
Dr M A S Saqi GLAXO WELLCOME

1.00 LUNCH

2.00 The Birkbeck Principles of Protein Structure Course
Dr A Mills BIRKBECK COLLEGE

2.30 The Internet and Molecules
Dr P Murray-Rust GLAXO WELLCOME

3.00 Discussion

3.30 Close of Meeting
Tea
Tour of Chemistry Building


The PRINTS protein fingerprint database

T.K.Attwood

Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, London WC1E 6BT, UK

Abstract


PRINTS is a compendium of protein `fingerprints' derived from the OWL database. Fingerprints are groups of motifs within sequence alignments whose conserved nature allows them to be used as signatures of family membership. To date, 400 fingerprints have been deposited in PRINTS, version 9.0 encoding ~2000 motifs, covering a range of globular and membrane proteins, modular polypeptides, and so on. Fingerprinting inherently offer improved diagnostic reliability over single motif methods by virtue of the mutual context provided by motif neighbours. PRINTS thus provides a useful adjunct to the PROSITE dictionary of patterns. The database is now accessible via the Database Browser on the UCL Bioinformatics server: http://www.biochem.ucl.ac.uk/bsm/dbbrowser.

Introduction

A first step in the analysis of protein sequences is usually to search the primary databases using a pairwise similarity search algorithm (e.g., BLAST [1]). This frequently allows outright identification of the query, but often this is impossible, either because there are no related sequences in the databases, or because the target sequences are only partially similar and the relationship is lost in the `twilight zone' [2]. In such cases, it is important to bring a range of techniques to bear on the analysis in order to improve the chances of making a meaningful identification.

To this end, it is becoming standard practice also to search a range of secondary databases, which distill sequence information into a variety of potent family descriptors (including patterns, profiles, etc.). Of these, regular expression patterns are the easiest to derive, involving the reduction of conserved motifs into single consensus expressions. This is the basis of PROSITE, which has thus become the most comprehensive and widely-used database of its kind, version 12.2 encoding 785 patterns, rules and profiles [3].

A draw-back of patterns is their binary nature - i.e, a sequence will either match a pattern or not, regardless of how similar it may be. Thus, more powerful discriminators (profiles) are being incorporated into PROSITE in order to handle the more divergent protein families for which the derivation of patterns is not practicable. Profiles are highly complex descriptors, usually encoding the full sequence and allowing gap insertion in pairwise alignments between profile and target sequence. Such is their complexity that, to date only 4 profiles have been included in PROSITE.

We use a different pattern recognition method, which is simple to apply. Groups of conserved motifs are excised from alignments and used as fingerprints of family membership. Sequence information is maximised through iterative database scanning, so diagnostic performance increases with each cycle. The advantage of this approach is that residue mismatches are tolerated within motifs, and where a motif is not matched, the framework provided by neighbouring motifs still allows reliable identification. To facilitate sequence analysis and complement the PROSITE pattern/profile resource, we have recently made a range of unique fingerprints available in the PRINTS database [4,5].

Source Database and Methods

The database used to derive fingerprints is OWL [6], a non-redundant composite of SWISS-PROT [7], PIR [8], GenBank (translation) [9] and NRL-3D [10]. Fingerprinting begins with sequence alignment and excision of conserved motifs using SOMAP [11]. The individual motifs are used to dredge OWL using the ADSP analysis package, a suite of procedures for iterative database scanning and hit-list correlation [12]. The scanning algorithm interprets the aligned motifs essentially as a series of frequency matrices - i.e., identity searches are made, with no mutation or other similarity data to weight the results.

Applications

Fingerprinting evolved from a study of G-protein-coupled receptors (GPCRs). The vast growth of the rhodopsin-like family (>800 members are now known) created a need for a more reliable analysis method than regular expression pattern searching. Exploiting the fact that the transmembrane (TM) regions are highly conserved, all 7 TM motifs were used to build a characteristic signature.

From an alignment of opsins, the TM regions were excised and used to scan OWL iteratively, resulting in the unambiguous identification of all known rhodopsin-like GPCRs in that version of OWL [13]. In subsequent releases, however, the fingerprint identified numerous partial matches: some of these were fragments, but others were full sequences that failed to match one or more motifs. In particular, the olfactory receptors showed clear differences in TM domains 4 and 6 [14]. This result was important for 2 reasons: (i) the method could clearly identify partial matches (i.e., no information was thrown away); (ii) it could pinpoint the specific elements of the fingerprint that differed, allowing these regions to be selected for later, more detailed analyses.

The power of fingerprints is readily appreciated in relation to plots of their profiles against given query sequences. In a profile, the x-axis denotes the sequence and the y- axis the percent score of each fingerprint element, a peak marking a match between a motif and that sequence. Sharp peaks appearing in a systematic order, above the level of noise, indicate a positive hit. Even when some fingerprint elements are not well-matched, the context provided by their neighbours still allows diagnosis: e.g., the sequence YN84_CAEEL matches the rhodopsin-like GPCR fingerprint even though 2 motifs score only at the level of noise - this relationship is not identified by PROSITE's GPCR pattern because it hinges on one of the poorly-matched TM domains.

Database format

An important aspect of PRINTS is the manner in which fingerprints are stored - i.e., as aligned motifs. This means that the alignments themselves may be analysed further. Thus, for the GPCRs, if variability is plotted for each position of the 7 TM motifs, 2 clear structural signals are given: (i) the most conserved positions appear with a periodicity consistent with an alpha-helical arrangement; (ii) the most conserved region of the molecule is at its cytoplasmic end, presumably denoting the location of the ligand-binding site. A wealth of structural information is thus retrievable from PRINTS.

WWW Access

To provide interactive access to OWL, PRINTS and ALIGN (the compendium of alignments used to create fingerprints), we have launched a Database Browser at http://www.biochem.ucl.ac.uk/bsm/dbbrowser. Facilities are available to interrogate OWL and PRINTS by keyword searching of database code, accession number, text, sequence, etc., or more complex queries can be made using the logical operator functions provided by their query languages.

Perhaps more important is the facility to search PRINTS and PROSITE simultaneously, offering an instant diagnosis of any query sequence: the user supplies either the known database code, or may cut-and-paste a sequence from a file, and the result is returned as a fingerprint profile.

Where results are of particular interest, the full database entry may be retrieved from PRINTS to discover more about the matched fingerprint. Each entry contains many links to related databases (including PROSITE, BLOCKS [15], ProDom [16], SBASE [17], GCRDb [18], etc.), so further information can be retrieved at the click of a mouse button.

Current contents

Release 9.0 of PRINTS (which is about half the size of PROSITE) contains 400 entries, one third of which do not have PROSITE equivalents. Searching both databases is thus more comprehensive, and in some cases more effective, since an alternative means of analysis is provided where regular expressions fail. The complete contents list is available from the distribution sites and on the PRINTS home page.

Conclusion

Fingerprinting offers a powerful approach to the analysis of protein sequences: it inherently offers improved reliability over single-motif methods by virtue of the mutual context provided by motif neighbours, and it allows rapid and striking visual diagnosis. In creating PRINTS, we recognised the importance of multiple sequence information and, accordingly, results are stored in the form of multiply aligned motifs - these can be the subject of subsequent structure/function analyses, in a manner that is not possible with abstractions of alignments such as patterns, profiles and weight matrices.

Acknowledgements

PRINTS is built and maintained at UCL with support from the Royal Society; it is compiled with assistance from Michael Beck and Kirill Degtyarenko in Leeds. The sequence analysis software was written by David Parry-Smith (Leeds), and the indexing software and query language by Alan Bleasby at Daresbury.

References

1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) JMB, 215, 403-410.
2. Doolittle, R.F. (1985) Sci.Am., 253, 88-99.
3. Bairoch, A. and Bucher, P. (1994) NAR, 22, 3583-3589.
4. Attwood, T.K., Beck, M.E., Bleasby, A.J. and Parry-Smith, D.J. (1994) NAR, 22, 3590-3596.
5. Attwood, T.K. and Beck, M.E. (1994) Prot.Engng, 7, 841-848.
6. Bleasby, A.J., Akrigg, D. and Attwood, T.K. (1994) NAR, 22, 3574-3577.
7. Bairoch, A. and Boeckmann, B. (1994) NAR, 22, 3578-3580.
8. George, D.G., Barker, W.C., Mewes, H.-W., Pfieffer, F. and Tsugita, A. (1994) NAR, 22, 3569-3573.
9. Benson, D.A., Boguski, M., Lipman, D.J. and Ostell, J. (1994) NAR, 22, 3441-3444.
10. Pattabiraman, N., Namboodiri, K., Lowrey, A. and Gaber, B.P. (1990) Protein Seq. Data Anal., 3, 387-405.
11. Parry-Smith, D.J. and Attwood, T.K. (1991) CABIOS, 7, 233-235.
12. Parry-Smith, D.J. and Attwood, T.K. (1992) CABIOS, 8, 451-459.
13. Attwood, T.K. and Findlay, J.B.C. (1993) Prot.Engng, 6, 167-176.
14. Attwood, T.K. and Findlay, J.B.C. (1994) Prot.Engng, 7, 195-203.
15. Henikoff, S. and Henikoff, J.G. (1991) NAR, 19, 6565-6572.
16. Sonhammer, E.L.L. and Kahn, D. (1994) Protein Science, 3, 482-492.
17. Pongor, S., Hatsagi, Z., Degtyarenko, K., Fabian, P., Skerl, V., Hegyi, H., Murvai, J. and Bevilacqua, V. (1994) NAR, 22, 3610-33615.
18. Kolakowski, L.F. (1994) Receptors and Channels, 2, 1-7.


Approaches to the modelling of GPCRs

Dr Frank Blaney

SmithKline Beecham Pharmaceuticals
New Frontiers Science Park
Third Avenue
Harlow, Essex, CM19 5AW

The G-Protein coupled receptor superfamily comprise a large number of membrane bound proteins which have been implicated in many neurotransmitter and hormone related disease states. They are therefore of immense importance to the pharmaceutical industry as therapeutic targets. Although no 3- dimensional structure is available from experimental data, there is ample evidence that the transmembrane regions of the GPCRs exist as seven alpha-helical bundles (from hydropathic analysis, sequence conservation and limited homology with the bacteriorhodopsin structure).

A brief review of the published approaches to the construction of GPCR models was presented. These have either been based on a perceived structural or sequence homology with bacteriorhodopsin or have been based on de novo methods of packing the 7 helical bundle. In our own laboratory both approaches have been tried.

An early method based on hydrophobic energy calculations was found to give helical bundles which had extensive cavities and this was eventually discarded as being unrealistic. Models of neurotransmitter GPCRs based on bacteriorhodopsin have been used for several years to rationalise existing SARs and to design novel ligands. However the recent low resolution electron diffraction map of bovine rhodopsin has cast doubt on the veracity of these models.

We have therefore in recent years gone back to a de novo method. Using multiple sequence alignments, an algorithm has been developed which allows the graphical depiction of the 7-helical bundle as coded 2D helical wheels. Information is also displayed about the conservation moment (defined by Donnelly) and the exposed hydrophobic surface. These 2D helices are manually packed on the screen and the resulting array can then be converted to a 3D helical bundle. The tilt of the helices can be further adjusted manually by visual maximisation of the hydrophobic potential which is mapped on to the idealised helices.

These techniques have been built in to a general computer program for the rapid construction of GPCR models based on templates. This was briefly described.


Structural Instances of Low Complexity Sequence Segments

Dr Mansoor Saqi

Bioinformatics Group
Dept. of Biomolecular Structure
Glaxo Medicines Research Centre
Gunnels Wood Road,
Stevenage, Herts, SG1 2NY.

mass15599@ggr.co.uk
+44 (0)1438 763231

Amino acid sequence databases contain many low complexity compositionally biased sequence segments but only a limited number of relatively short instances of these segments occur in proteins of known structure. An analysis is presented of structural instances of low complexity sequence segmnets in the Brookhaven protein databank with regard to preferences for sequence composition secondary structure conformation and local atomic environment. The complexity varies almost linearly with segment length relecting the absence of very long low complexity segments in the structural database. The low complexity segments that are identified are not disordered and have temperature factors that are generally the same as the rest of the protein. It is observed that these segments are predominantly exposed and either helical or coil in excess of what would be expected by chance. Secondary structure prediction methods perform well in correctly predicting those low complexity segments which are helix but poorly in correctly predicting segments that are strand.


Principles of Protein Structure - An Internet-based Course


Alan Mills

Venus Internet at Birkbeck College
email: alan@venus.bbk.ac.uk
Tel: 0171 631 6810 Fax: 0171 924 1266

Alan Mills of Birkbeck College's Crystallography Department (and latterly Venus Internet Ltd) described the distance education course on The Principles of Protein Structure that he coordinated together with visiting Professor Peter Murray-Rust of Glaxo-Wellcome during the first half of 1995. This experiment in collaborative, open teaching, undertaken together with the Globewide Network Academy, employed several innovative technologies, with all the course material delivered on the World Wide Web, and most interaction taking place "virtually" using email and discussion lists. Following on-line registration about 250 participants from 25 countries were enrolled, and about 70 Certificates of Participation were awarded at the end of the course in June. It is believed to be the first global multimedia distance education course.

MIME-activated RasMol and MAGE molecular viewer software was automagically invoked to encourage students to manipulate molecular structures on their screens. A collaborative hyper-glossary was developed as the course progressed. The course also used the BioMOO virtual meeting place run from a computer at the Weizmann Institute. The whole enterprise was conducted in public cyberspace.

The course is being run again as an enhanced formally accredited Advanced Certificate Course from January 1996, and PMDG members are invited to involve themselves as consultants (email offers to pps2@www.cryst.bbk.ac.uk).

See URL http://www.cryst.bbk.ac.uk/PPS2/index.html

Attention of members was also drawn to another virtual course that is being planned by Peter Murray-Rust and John Overington. Structure-Based Drug Design is to be sponsored by pharmaceutical companies in the UK, and inquiries are invited.


Molecules and Internet

Dr Peter Murray-Rust

GLAXO WELLCOME
Medicines Research Centre
Gunnels Wood Road
Stevenage
Herts SG1 2NY

It is proposed to develop a distance-education short course on the World Wide Web & Internet on the subject of Structure-Based Drug Design, for the benefit of, and with participation of UK pharmaceutical and biochemical research communities, both commercial and academic. The primary focus of the Course will be to equip industrial organisations with the expertise and training required to successfully implement structure-based drug design projects. The course will be run from late Spring'96.

This 3-month course will cover many of the topics and advances in this rapidly- developing interface between the disciplines of Pharmacy, Chemistry, Computing, Bioinformatics, Biochemistry and Structural Biology. Prospective students will be expected to have a basic grounding in some of these areas and have access to the Internet. Interactivity will be built in, with active on-line collaboration and participation being the norm. The course will be coordinated and run from Birkbeck College. The whole thing will be run on the Internet and World Wide Web.

The purpose of this initial announcement is to attract further interest during these planning stages, so that the coordination may begin. If you or your company / organisation might wish to be involved, or you simply wish to receive further postings relating tothis Course, please contact Peter Murray-Rust at GLAXO- Wellcome or to John Overington at Pfizer.

View Poster Abstracts