The local physical environment of an amino acid in a folded polypeptide chain primarily determines the acceptability of mutations at that site. The physical restraints have been characterised by the statistical analysis of a large database of aligned homologous structures. The solvent accessibility of a residue was found to be the primary determinant of exchange, and this was particularly acute for buried polar residues, which can be considered as 'kinetically trapped' in the core of the protein.
The applications of these data are then considered, with particular emphasis placed on improving current comparative modelling techniques. The data can improve the discrimination of very distantly related sequences given a structure as a template for database searching, and the substitution matrices are amongst the most sensitive available, irrespective of any additional structural information. When structural data is added to the alignment process, significantly more accurate sequence-structure alignments are obtained. The alignment of the 'modelled sequence' to the 'template structure' is one of the key stages of any comparative modelling study, since errors will never be corrected by subsequent model refinement.
A simplified geomtry model of proteins and peptides has been developed for ab initio protein folding studies based on the C-alpha virtual bond approximation.
This model has been parameterised using a variety of techniques to develop a suitable force field approximation for protein folding. The current model and potential is capable of sustaining the a native protein fold at 300K for over 1/2 million steps (2.5 nanoseconds). We have now developed a new molecular dynamics method, Selectively Enhanced Molecular Dynamics (SEMD), which moves energy into the low frequency conformational modes of motion using real-time filtering techniques.
This has been tested on all-atom peptides and has been shown to generate a wide range of different conformations compared to standard high temperature dynamics conformational searching. This method is now being applied to ab initio protein folding using the model and potentials derived above.
A strategy is presented for protein fold recognition from secondary structure assignments. The method can detect similarities between protein folds in the absence of sequence similarity. MAP first determines all matches (maps) between a query string of secondary structures and the secondary structures of protein domains of known 3D structure. The maps are then passed through a series of structural filters to remove those that do not obey simple rules of protein structure. The surviving maps are ranked by scores from the alignment of predicted and experimental accessibilities. Searches made with secondary structure assignments for a test set of eleven fold-families show a significant improvement over THREADER in the ability to place a correct fold in the first rank, with comparable sequence to structure alignment accuracy. Searches performed with published secondary structure predictions, and making use of experimental information show how the method can be used with human insight to provide accurate predictions of protein folds.
A common problem in the conversion of molecular data file formats is the annotation of amino acid and nucleic acid residues not explicitly represented in `small molecule' file formats describing only element type and 3D co-ordinates or atomic connectivity. This problem has limited the interoperability between chemical information processing programs and has led to the situation where molecular graphics programs currently treat the same molecule differently depending upon the file format that it it is stored in. An algorithm has been developed to rapidly identify polypeptides and nucleic acids from simple connectivity that can assign standard atom names, residue names, residue numbers and chain identifiers to each atom. It can also be used to assign bond orders if only simple connectivity is known. One of the features of the developed algorithm is a very efficient method for identifing a sidechain from a set of rooted graphs, which has running time linear in the number of atoms of the sidechain. Because this method is independent of the size of the monomer set, it has obvious applications in the field of combinatorial chemistry and chemical subgraph matching.
Glycolate oxidase is one of the enzymes involved in the photorespiration cycle in plants. Photorespiration is known to be an essential metabolic pathway: inhibitors of other photorespiration enzymes are known to be phytotoxic, so it is reasonable to believe that inhibition of glycolate oxidase might also have a herbicidal effect.
In 1989, the crystal structure of glycolate oxidase isolated from spinach was published by Lindqvist and Branden. This was one of the first enzymes of potential agrochemical interest to have its structure solved. Work to design inhibitors using this structure began at Jealott's Hill soon after it became available in 1990.
Initial work focused on assessing the suitability of the structure for ligand design work, and using the structure to rationalise how the substrate and existing inhibitors of the enzyme might bind. This work enabled us to postulate the key interaction sites that needed to be satisfied for ligand binding. A series of structurally diverse molecules were then designed by manually docking candidate structures into the active site using a graphics terminal, and refining the structures as necessary. Some of the more synthetically tractable designs arising from this process were then synthesised, and tested for in vitro activity on an enzyme assay of spinach glycolate oxidase. Two of these designs were found to be active on the enzyme assay with sub-micromolar Ki values - better than the best previously known glycolate oxidase inhibitors. Further optimisation of the designs yielded a compound with a Ki value of 15 nM.
At this point in the project, a collaboration was set up with Ylva Lindqvist at Uppsala University to try to co-crystallise the designed inhibitors with glycolate oxidase. This at first proved to be problematic, but eventually a crystal structure of a protein- inhibitor complex was solved. This structure reveals that the inhibitor does indeed make the interactions with the protein envisaged at the design stage, but interestingly, some movement of the protein residues has occurred to allow other favourable interactions to take place.
Although herbicidal interest in glycolate oxidase inhibitors has now ceased at Jealott's Hill, we are still collaborating with Ylva Lindqvist's group in an attempt to produce other inhibitor complexes of this enzyme.
The proliferation of all kinds of DNA and protein sequence information, both in public domain and proprietary sequence databases, has lead to a new perspective on the approach best used to evaluate the data. Traditionally, only fully sequenced and validated cDNAs or genomic DNAs were thought fit to deposit in databases such as Genbank and the responsibility for the quality of the data rested with the author. Now rapid gene discovery approaches have lead to the deposition of ~300,000 short cDNA sequences that act as similarity tags for many of the genes that are expressed in various cells and tissues in a number of organisms. This data is single pass sequence data that is not error corrected. Some of the issues involved in dealing with this data are discussed during the talk.
The use of the fingerprinting technique (exemplified by the PRINTS database [1]) has been extended to the analysis of ESTs at Pfizer. A fingerprint can be defined as a set of position based weight matrices, generated from sequence alignment data only, that can be used to predict the function of an unknown sequence. ESTs being typically short (<150 residues in translation) present special problems in interpretation which will be the principal focus of this presentation.
1. Attwood, T.K., Beck, M.E., Bleasby, A.J., Degtyarenko, K. and Parry-Smith, D.J. (1996) Nucleic Acids Research, 24, 182-188.