Modeling  of  loops  in  protein  structures


                      Andras Fiser, Richard Kinh Gian Do and Andrej Sali


                                 Laboratories of Molecular Biophysics


                 Pels Family Center for Biochemistry and Structural Biology


                                         The Rockefeller University


                             1230 York Ave, New York, NY 10021, USA


                                         Correspondence to Andrej Sali


                                    tel: (212) 327 7550; fax: (212) 327 7540


                                            e-mail: sali@rockefeller.edu


                           Running title: Modeling of loops in protein structures


                         Manuscript length: 60 pages incl. 2 tables and 13 figures


                 Electronic version: http://guitar.rockefeller.edu/ftp/protsci/


                  Word processing: The manuscript is prepared in LaTex under Linux


           Abbreviations used: 3D, three-dimensional; DRMS, distance root-mean-square;


NMR, nuclear magnetic resonance; PDB, Protein Data Bank; RMSD, root-mean-square deviation.


                                                   May 25, 2000


                                                           1


                                                       Abstract


       Comparative protein structure prediction is limited mostly by the errors in alignment and

   loop modeling.  We describe here a new automated modeling technique that significantly im-

   proves the accuracy of loop predictions in protein structures. The positions of all non-hydrogen

   atoms of the loop are optimized in a fixed environment with respect to a pseudo energy func-

   tion.  The energy is a sum of many spatial restraints that include the bond length, bond angle,

   and improper dihedral angle terms from the Charmm      -22 force field, statistical preferences for

   the mainchain and sidechain dihedral angles, and statistical preferences for non-bonded atomic

   contacts that depend on the two atom types, their distance through space, and separation in

   sequence.  The energy function is optimized with the method of conjugate gradients combined

   with molecular dynamics and simulated annealing.  Typically, the predicted loop conformation

   corresponds to the lowest energy conformation among 500 independent optimizations.  Predic-

   tions were made for 40 loops of known structure at each length from 1 to 14 residues.  The

   accuracy of loop predictions is evaluated as a function of thoroughness of conformational sam-

   pling, loop length, and structural properties of native loops.  When accuracy is measured by

   local superposition of the model on the native loop, 100, 90 and 30% of 4, 8 and 12-residue

   loop predictions, respectively, had less than 2A  RMSD     error for the mainchain N, C ff, C, and

   O atoms; the average accuracies were 0:59  0:05, 1:16  0:10 and 2:61  0:16 A , respectively.

   To simulate real comparative modeling problems, the method was also evaluated by predict-

   ing loops of known structure in only approximately correct environments with errors typical

   of comparative modeling without misalignment.  When the RMSD     distortion of the mainchain

   stem atoms is 2.5A , the average loop prediction error increased by 180, 25 and 3% for 4, 8 and

   12-residue loops, respectively.  The accuracy of the lowest energy prediction for a given loop

   can be estimated from the structural variability among a number of low energy predictions.

   The relative value of the present method is gauged by (i) comparing it with one of the most

   successful previously described methods, and (ii) describing its accuracy in recent blind predic-

   tions of protein structure. Finally, it is shown that the average accuracy of prediction is limited

   primarily by the accuracy of the energy function rather than by the extent of conformational

   sampling.


Keywords: loop modeling / comparative or homology protein structure modeling


                                                            1


Introduction


Functional characterization of a protein sequence is one of the most frequent and challenging prob-


lems in biology.  This task is usually facilitated by accurate three-dimensional (3D) structures of


the studied protein and corresponding ligand complexes.  In the absence of an experimentally de-


termined structure, comparative or homology modeling can sometimes provide a useful 3D model


for a protein (target) that is related to at least one known protein structure (template) (Browne


et al., 1969; Blundell et al., 1987; Marti-Renom et al., 2000).


     Comparative modeling is limited in functional studies by its ability to predict accurately struc-


tural determinants of protein function, as well as by the conformational changes induced by ligand


binding (Wang et al., 1999).  In the absence of an induced fit, the function of a protein is generally


determined by shape, dynamics and physiochemical properties of its solvent exposed molecular sur-


face.  Likewise, functional differences between the members of the same protein family are usually


a  consequence  of  the  structural  differences  on  the  protein  surface.  In  a  given  fold  family,  struc-


tural variability is a result of substitutions, insertions and deletions of residues between members


of the family.  Such changes frequently correspond to exposed loop regions that connect elements


of secondary structure in the protein fold.  Thus, loops often determine the functional specificity of


a given protein framework.  They contribute to active and binding sites.  Examples include bind-


ing of metal ions by metal-binding proteins (Lu & Valentine, 1997), small protein toxins by their


receptors (Wu & Dean, 1996), antigens by immunoglobulins (Bajorath & Sheriff, 1996), mononu-


cleotides  by  a  variety  of  proteins  (Kinoshita  et  al.,  1999),  protein  substrates  by  serine  proteases


(Perona & Craik, 1995), and DNA by DNA-binding proteins (Jones et  al., 1999).  Consequently,


the accuracy of loop modeling is a major factor determining the usefulness of comparative models


in studying interactions between the protein and its ligands.  This includes the use of models for


recognizing ligand binding sites (Jones & Thornton, 1997; Fetrow et al., 1998; Russell et al., 1998;


Kleywegt, 1999; Wei et al., 1999; Kasuya & Thornton, 1999) and for ligand docking computations


(Kick  et  al.,  1997).  Unfortunately,  as  was  concluded  at  the  meetings  on  Critical  Assessment  of


Techniques for Protein Structure Prediction (CASP), no generally reliable method is available for


constructing loops longer than 5 residues (Mosimann et  al.,  1995;  Martin et  al.,  1997),  although


recently some progress has been made (van Vlijmen & Karplus,  1997;  Rufino et  al.,  1997;  Oliva


et al., 1997; Samudrala & Moult, 1998; Rapp & Friesner, 1999).


                                                                 2


     The impact of an accurate loop modeling method would be large.  Currently, approximately 40%


of all protein sequences can have at least one domain modeled on a related, known protein structure


(Rychlewski et al., 1998; Huynen et al., 1998; Jones, 1999;aSnchez & Sali, 1999).  At least two thirds


of the comparative modeling cases are based on less than 40% sequence identity between the target


and the templates, and thus generally require loop modeling (Sanchez & Sali, 1998).  Since there


are over 500,000 protein sequences deposited in GENBANK and only approximately 12,000 protein


structures in the Protein Data Bank (PDB) (Abola et al., 1987; http://www.rcsb.org/pdb), the


number of proteins whose structure can be modeled by comparative modeling is more than an order


of magnitude larger than the number of currently known protein structures (Sali, 1998).  This gap


is likely to increase because the genome sequencing projects are producing a few hundred thousand


protein sequences each year, while only a few thousand of them have their structures determined


by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy.


     Loop modeling can be seen as a mini protein folding problem.  The correct conformation of a


given segment of a polypeptide chain has to be calculated mainly from the sequence of the segment


itself.  However,  loops  are  generally  too  short  to  provide  sufficient  information  about  their  local


fold.  Segments  of  up  to  9  residues  sometimes  have  entirely  unrelated  conformations  in  different


proteins (Sander & Schneider, 1991; Cohen et al., 1993; Mezei, 1998).  Thus, the conformation of a


given segment is also influenced by the core stem regions that span the loop and by the structure


of the rest of a protein that cradles the loop.


     Many  loop  modeling  procedures  have  been  described.   Similarly  to  the  prediction  of  whole


protein structures, there are both the ab initio    methods (Moult & James, 1986; Fine et al., 1986;


Bruccoleri  &  Karplus,  1987)  and  the  database  search  techniques  (Greer,  1980;  Jones  &  Thirup,


1986;  Chothia & Lesk,  1987).  There are also procedures that combine the two basic approaches


(Chothia et al., 1986; Martin et al., 1989; Mas et al., 1992; van Vlijmen & Karplus, 1997).


     The ab initio loop prediction is based on a conformational search or enumeration of conforma-


tions in a given environment, guided by a scoring or energy function.  There are many such methods,


exploiting different protein representations, energy function terms, and optimization or enumera-


tion algorithms.  The search algorithms include sampling of mainchain dihedral angles biased by


their  distributions  in  known  protein  structures  (Moult  &  James,  1986),  minimum  perturbation


random  tweak  method  (Fine  et  al.,  1986;  Shenkin  et  al.,  1987;  Smith  &  Honig,  1994),  system-


                                                                 3


atic conformational search (Bruccoleri & Karplus, 1987; Bruccoleri et al., 1988; Bruccoleri, 1993;


Brower et al., 1993), global energy minimization by mapping a trajectory of local minima (Dudek


&  Scheraga,  1990;  Dudek  et  al.,  1998),  importance-sampling  by  local  minimization  of  randomly


generated  conformations  (Lambert  &  Scheraga,  1989a;  Lambert  &  Scheraga,  1989b;  Lambert  &


Scheraga, 1989c), local energy minimization (Mattos et al., 1994), molecular dynamics simulations


(Bruccoleri  &  Karplus,  1990;  Tanner  et  al.,  1992;  Rao  &  Teeter,  1993;  Nakajima  et  al.,  2000),


genetic  algorithms  (Ring  &  Cohen,  1994;  McGarrah  &  Judson,  1993),  biased  probability  Monte


Carlo search (Abagyan & Totrov, 1994; Evans et al., 1995; Thanki et al., 1997), Monte Carlo with


simulated annealing (Higo et al., 1992; Collura et al., 1993; Carlacci & Englander, 1993; Vasmatzis


et al., 1994; Carlacci & Englander, 1996), Monte Carlo and molecular dynamics (Rapp & Friesner,


1999), extended-scaled-collective-variable Monte Carlo (Kidera, 1995), scaling relaxation and mul-


tiple copy sampling (Zheng et al., 1993a; Zheng et al., 1993b; Rosenfeld et al., 1993; Zheng & Kyle,


1994; Zheng et al., 1994; Rosenbach & Rosenfeld, 1995; Zheng & Kyle, 1996), searching through


discrete conformations by dynamic programming (Vajda & DeLisi, 1990; Finkelstein & Reva, 1992),


random sampling of conformations relying on dimers from known protein structures (Sudarsanam


et al., 1995), self-consistent field optimization (Koehl & Delarue, 1995), and an enumeration based


on the graph theory (Samudrala & Moult, 1998).  A variety of representations were used, such as


unified atoms, all non-hydrogen atoms, non-hydrogen and "polar" hydrogen atoms, all atoms, as


well as implicit and explicit solvent models.  The optimized degrees of freedom include Cartesian


coordinates and internal coordinates, such as dihedral angles, optimized in continuous or discrete


spaces.


     The second, database approach to loop prediction consists of finding a segment of mainchain


that fits the two stem regions of a loop (Greer, 1980;  Jones & Thirup, 1986;  Cohen et  al., 1986;


Chothia & Lesk, 1987; Tramontano et al., 1989; Chothia et al., 1989; Summers & Karplus, 1990;


Levitt,  1992;  Tramontano  &  Lesk,  1992;  Topham  et  al.,  1993;  Koehl  &  Delarue,  1995;  Reczko


et al., 1995; Lessel & Schomburg, 1994; Fechteler et al., 1995; Mandal et al., 1996; Wintjens et al.,


1996;  Kwasigroch  et  al.,  1996;  Martin  &  Thornton,  1996;  Pellequer  &  Chen,  1997;  Oliva  et  al.,


1997; Donate et al., 1996; Rufino et al., 1997; Debnath, 1997; Wojcik et al., 1999; Shepherd et al.,


1999; Deane & Blundell, 2000).  The stems are defined as the mainchain atoms that precede and


follow the loop,  but are not part of it.  They span the loop and are part of the core of the fold.


                                                                 4


The search is performed through a database of many known protein structures, not only homologs


of  the  modeled  protein.   Usually,  many  different  alternative  segments  that  fit  the  stem  residues


are  obtained,  and  possibly  sorted  according  to  geometric  criteria  or  sequence  similarity  between


the template and target loop sequences.  The selected segments are then superposed and annealed


on the stem regions.  These initial crude models are often refined by optimization of some energy


function.


     The database search approach to loop modeling is accurate and efficient when a specific set of


loops is created to address the modeling of that class of loops, such as fi-hairpins (Sibanda et al.,


1989)  and  the  hypervariable  regions  in  immunoglobulins  (Chothia  &  Lesk,  1987;  Chothia  et  al.,


1989).  For immunoglobulins,  an analysis of the hypervariable regions in known immunoglobulin


structures resulted in rules with high prediction accuracy for other members of the family.  These


rules are possible because of the relatively small number of conformations for each loop and because


of the dependence of loop conformation on loop length and certain key residues.  The accuracy of


the approach was demonstrated by a blind, validated prediction of most BR96 antibody residues


involved in antigen binding (Bajorath & Sheriff, 1996).  There are attempts to classify loop confor-


mations into more general categories, thus extending the impressive performance of the key residues


approach  to  more  cases  (Ring  et  al.,  1992;  Rufino  et  al.,  1997;  Oliva  et  al.,  1997;  Wojcik  et  al.,


1999).


     The database methods are limited by the exponential increase in the number of geometrically


possible  conformations  as  a  function  of  loop  length.   Consequently,  only  segments  of  7  residues


or  less  had  most  of  their  conceivable  conformations  present  in  the  database  of  known  protein


structures (Fidelis et al., 1994).  In contrast, 8 and 9-residue segments occurred more than once in


less than 70% and 40% of the cases, respectively.  These estimates depend strongly on the criteria


for selecting matching conformations.  When slightly stricter criteria are used, only segments of up


to 4 residues have most of their conformations defined in a database of known protein structures


(Lessel & Schomburg,  1994).  This limits the applicability of the database search methods.  The


limitation is made worse by the requirement for an overlap between at least one residue in each


stem and the database segment used for loop modeling.  Thus, the completeness of the database for


7-residue segments allows the modeling of only up to 5-residue insertions (Claessens et al., 1989).


While only few insertions in a family of homologous proteins are longer than 9 residues, there are


                                                                 5


many  insertions  that  are  longer  than  5  residues  (Pascarella  &  Argos,  1992;  Benner  et  al.,  1993;


Flores et al., 1993).


     The  problem  of  database  completeness  has  recently  been  ameliorated  by  restrained  energy


minimization  of  the  candidate  loops  obtained  from  a  database  search  (van  Vlijmen  &  Karplus,


1997).  Both  the  internal  conformation  and  global  orientation  relative  to  the  rest  of  the  protein


were  optimized.   It  was  concluded  that  the  candidate  segments  from  a  database  were  suitable


starting points for modeling loops up to 9 residues long, but extensive optimization was required


for loops longer than 4 residues.


     In this paper,  we take the optimization-based approach to loop modeling.  The main reasons


are the generality and conceptual simplicity of energy minimization, as well as the limitations on


the database approach imposed by a relatively small number of known protein structures.  Loop


prediction by optimization is in principle applicable to simultaneous modeling of several loops and


loops  interacting  with  ligands,  which  is  not  straightforward  for  the  database  search  approaches.


Also,  the  optimization  approach  in  principle  allows  for  an  improvement  based  on  the  physics  of


protein structure, rather than on the growth of the database.  Moreover, even the database approach


requires  both  a  scoring  function  to  sift  through  the  many  alternative  loop  conformations  fitting


the  stems  (Tramontano  &  Lesk,  1992)  and  an  optimization  procedure  for  relaxing  the  annealed


database  segments.   Thus,  loop  prediction  may  as  well  rely  solely  on  optimization  of  an  energy


function, without any dependence on loop segments from a database.


     We  describe  and  extensively  evaluate  a  loop  modeling  protocol  that  optimizes  the  positions


of all non-hydrogen atoms of a loop in a fixed environment.  The optimization relies on conjugate


gradients and molecular dynamics with simulated annealing.  The optimized pseudo energy function


is  a  sum  of  many  terms,  including  some  terms  from  the  Charmm      -22  molecular  mechanics  force


field (MacKerell et al., 1998) and spatial restraints based on distributions of distances (Sippl, 1990)


and dihedral angles (Cheng et al., 1996) in known protein structures.  The paper is organized as


follows.  In Theory and Algorithms, the technique is described in detail, as are the loops and criteria


selected for testing the method.  The Results and Discussion begin by evaluating the accuracy of


loop predictions as a function of thoroughness of conformational sampling, loop length, environment


distortion,  and  structural  properties  of  native  loops.  Evaluation  is  based  on  the  modeling  of  40


loops  of  known  structure  at  each  length  from  1  to  14  residues.   A  way  to  predict  the  accuracy


                                                                 6


of  the  best  loop  prediction  is  also  described.  This  is  followed  by  an  evaluation  of  the  modeling


method by (i) comparing its predictions with those by a recently published, extensively evaluated


and powerful loop modeling method (van Vlijmen & Karplus, 1997), and (ii) by summarizing its


performance in recent blind predictions of protein structure at CASP3.  Next, it is showed that the


method is limited mostly by the accuracy of the energy function rather than the thoroughness of


the optimizer.  Finally, the implications for future work are discussed (Conclusion).


Theory  and  Algorithms


Modeling  of  a  loop


The method for modeling a loop in a given environment is described here by specifying its three


main components:  (i) the representation of a protein, (ii) the restraints that define the objective or


"energy" function, and (iii) the method for optimizing the energy function.  The modeling method is


entirely automated and is implemented in the program Modeller       -5 (URL http://guitar.rocke-


feller.edu).  While the most frequent application of the method is to predict single loops, it is


also  technically  suitable  for  modeling  any  set  of  contiguous  or  non-contiguous  residues  or  atoms


(e.g. , several loops, a loop with a ligand, a cluster of sidechains) in the fixed environment created


by the rest of the protein.


Representation of a protein


The protein is represented by all non-hydrogen atoms.  An explicit treatment of all hydrogen atoms


as encoded in Charmm        (MacKerell et al., 1998) was also tested, but did not improve prediction


accuracy in our hands.  No explicit solvent molecules or ligands are included in general, although


they  could  be  added  in  special  cases.   The  degrees  of  freedom  in  model  optimization  are  the


Cartesian coordinates of the loop atoms.  The loop atoms "feel" the other atoms in the protein,


but the atoms in the "environment" of the loop do not move during optimization.


Energy function


The main aim was to maximize the accuracy of loop prediction, not to describe the physics of loop


structures in proteins.  Thus, the emphasis during development of the "energy" or scoring function


                                                                 7


was  on  statistical  preferences  of  atoms  for  different  geometries  as  obtained  from  the  database


of  known  protein  structures,  rather  than  on  a  reductionist  model  of  physical  interactions.   The


stereochemical features (i.e., chemical bonds, bond angles, etc ) are captured through the use of the


Charmm        molecular mechanics force field (MacKerell et  al.,  1998).  The non-bonded interactions


and solvation are approximated by a statistical potential of mean force for pairs of protein atoms


(Sippl,  1990).  In  addition,  the  accuracy  of  the  scoring  function  is  improved  by  using  statistical


preferences for the mainchain and sidechain dihedral angles (Sali & Blundell, 1993).  The formalism


for  combining  and  using  these  diverse  types  of  information  from  both  physics  and  statistics  is


provided by protein structure modeling by satisfaction of spatial restraints, where each individual


energy term or a statistical preference is represented by a conditional probability density function


for a restrained spatial feature, such as a distance between two atoms (Sali & Blundell, 1993).


     The energy function for loop modeling is a sum of simple restraints or pseudo-energy terms,


each one of which depends on a distance, angle, dihedral angle, improper dihedral angle, or a pair


of  dihedral  angles  defined  by  two,  three,  four,  or  eight  atoms.   Many  combinations  of  different


restraint types were evaluated for their performance in loop modeling.  The best energy function so


far is described in detail below.  The energy function is


                         X                     X                         X
              F    =           kb(b - b)2  +         kff(ff - ff)2  +            |kOE| - bOEcos (nOE + ffi)  +

                        bonds                angles                   dihedrals


                            X                        X                        X
                                    ki( -  )2  -              ln ps(O=R)  -           ln p  (=R)  -
                        impropers                 sidechain                 residues
                                                   torsions


                           X                            X
                                  ln pm  (; =R)  +                   E(a; a0; d; i) + S(r; r0; d)                               (1)

                        residues                    non-bonded
                                                    atom pairs


where b is a bond length, ff is a covalent bond angle, OE is a dihedral angle other than the mainchain


, ,  and sidechain O dihedral angles,  is an improper dihedral angle, and R is the residue type.


For the non-bonded atom pairs,  is a scaling factor (usually 1), a and a0 are the types of the atoms


in the pair, d is the distance between them, i is the difference between the corresponding residue


indices, and r  and r0 are the atomic van der Waals radii (Sali & Blundell, 1993).  The sums run


                                                                 8


over all bonds, angles, dihedral angles, improper dihedral angles, and non-bonded distances that


involve at least one of the loop atoms.  The non-bonded atom pairs at a distance larger than 4A


are ignored.  The force constants kx  and mean values x  for bonds b, angles ff, dihedral angles OE,


and improper dihedral angles  were obtained from the May 1993 version of the Charmm       -22 force


field  (Brooks  et  al.,  1983;  MacKerell  et  al.,  1998),  as  were  the  phase  shift  ffi  and  the  periodicity


parameter n for dihedral angles OE.  The improper dihedral angles are used to restrain the planarity


of peptide bonds and sidechain rings, as well as the chirality of chiral and pro-chiral centers (e.g. ,


C ff atoms of all residues but Gly, C fi atoms of Val and Thr).


     The probability distributions for all sidechain dihedral angles (up to four per residue), ps(O=R),


depend on the residue type (Ponder & Richards, 1987) and were obtained from a non-redundant


set of known protein structures (Sali & Blundell, 1993).  They are modeled by a weighted sum of


Gaussian functions as described previously (Eq. 26 and Table 5 in Sali & Blundell, 1993):


                                             X            1              1   (O; Oi )
                                   ps(O=R) =       !i _________p_exp  [- __( ____________)2 ]                                   (2)
                                               i      oeb   2ss          2       oei

                                                                                        P
where !i is the probability that the restrained sidechain dihedral angle is in class i (   i !i = 1) and


(x; y) is the shortest path around the 360O circle from angle x to angle y.  Most sidechain dihedral


angles can be in up to three classes, depending on the residue type.  Some residues, however, have


a smaller number of possible classes; for example, O2  in His has only two classes.


     Similarly,  the restraints on the mainchain dihedral angle ,  p  ,  were represented by a single


Gaussian function centered on 180O  with the standard deviation of 5O .  The cis -proline states have


not been modeled.  In the set of 40 test loops of 8 residues, there are 2 and 13 cis - and trans  -proline


residues, respectively.


     For each loop residue, the restraint on the mainchain dihedral angles  and  is


                                                          Xm             Xm                  1
                                             pm  (; =R) =      !i pi  =       !i ____________________________q_.                (3)
                                                          i=1            i=1     2ssoe;i  oe;i    (1 - ae2i)


            (              "                                                                                     #  )
                    1        1 - cos ( - i  )            sin( - i  )     sin( - i  )       1 - cos ( - i  )
       exp    - ___________  _____________________ - aei ________________________________+ ______________________
                (1 - ae2i)           oe2;i                    oe;i            oe;i                 oe2;i


where m is the number of mainchain conformation classes in the Ramachandran plot (Figure 1a),


! is the weight of the corresponding conformation class (Figure 1b), the bar indicates the average


                                                                 9


  and    values,  oe  is  the  standard  deviation,  and  ae  is  the  correlation  coefficient  between    and


  (-1    ae    1).   The  parameters  !,    ,    ,  oe,  and  ae  were  obtained  from  a  representative  set


of  1,000  protein  structures  that  shared  less  than  60%  sequence  identity  to  each  other  and  were


determined by X-ray crystallography at resolution of 2.3A   or better.  The total number of residues


was 217,807.  The Ramachandran plot spanned by the  and  dihedral angles (Ramachandran


et  al.,  1963)  was  divided  into  5O x 5O  bins.  The  frequency  of  residues  in  each  bin  was  obtained


separately for each of the 20 standard residue types.  The peaks and valleys in the Ramachandran


plots  guided  the  partitioning  of  each  plot  into  a  few  conformation  classes  (2-5  classes),  again


separately for each residue type; the residue types differ mostly in the boundary between the two


"fi"  classes  with  the  negative    values  and  positive    values.  The  weights  !  correspond  to  the


relative frequencies of each residue in each of its classes.  Next,  the analytic model in Eq. 3 was


fitted to the Ramachandran plots by a least-squares method (Press et al., 1992), resulting into the


optimal  values  for    ,    ,  oe,  and  ae.  A  comparison  between  the  Ramachandran  plot  and  a  fitted


model for four representative residue types is shown in Figure 2.  The values of all parameters can


be obtained from the Modeller         library files mnch.lib, mnch1.lib, and af_mnchdef.lib.


     The first part of the non-bonded term, E, is taken from Melo & Feytmans, 1997, where it was


derived as described by Sippl, 1990.  E  is an atomistic, distance-dependent statistical potential of


mean  force.  Atoms  in  amino  acid  residues  are  classified  into  one  of  40  atom  type  groups.  The


potential was obtained from those atom pairs in known protein structures that were separated by


11 or more residues in sequence.  The original potential in the histogram form (Melo & Feytmans,


1997) was converted into cubic splines (Press et  al., 1992) to allow the use of first derivatives in


optimization (Sali et al., 1999).  The second part of the non-bonded term, S, is a harmonic lower


bound on non-bonded atom-atom distances (Eq. 22 in Sali & Blundell, 1993):


                                              8
                                              ><  k               0     2                 0
                                                   n [d - fl(r + r )]      ;   d  fl(r + r )
                                S(r; r0; d) = >                                                                                 (4)
                                              :   0 ;                          d > fl(r + r0)


where d is the distance between the two atoms, kn  is the force constant (usually 59 kcal/mol), and


fl is a constant (usually 0.83).  S is used to compensate for noise in E at short distances of less than


3A  .


                                                                 10


Optimization of the energy function


Optimization begins by generating an initial loop conformation (Figure 3).  The atoms of the loop


are positioned with uniform spacing on the line that connects the mainchain carbonyl oxygen and


amide  nitrogen  atoms  of  the  N-  and  C-terminal  anchor  regions,  respectively.   Next,  the  atomic


positions are randomized by adding a random number distributed uniformly from -5 to 5A  to each


of the Cartesian coordinates.  One loop prediction consists of optimizing independently a number


of  such  randomized  initial  structures,  and  picking  as  the  final  model  the  conformation  that  has


the lowest value of the energy function.  A good compromise between efficiency and performance is


achieved by 50-500 independent optimizations (Results and Discussion).


     It is tempting to generate initial conformations for loop atoms that are more protein-like and


not random.  However, the accuracy of the loop models obtained from random initial conformations


is limited mostly by the accuracy of the scoring function, not the power of the optimizer (Figure 12).


Thus, starting a loop prediction with random positions for loop atoms does not decrease the final


accuracy of the loop models.  Nevertheless, more realistic starting conformations may increase the


efficiency of the loop modeling method by decreasing the need for exhaustive optimization of each


starting conformation.


     The procedure for optimizing a single initial conformation begins with a conjugate gradients


minimization, continues with molecular dynamics with simulated annealing, and finishes by conju-


gate gradients again (Figure 3).  The details about the optimization schedule can be found in the


Modeller         file  __loop.top.  Briefly,  the  first  conjugate  gradients  phase  is  designed  to  relax  the


system and consists of five successive minimizations of up to 200 steps each, gradually increasing


the scaling factors   for the non-bonded restraints from 0,  0.01,  0.1,  0.5,  to 1.0,  respectively.  In


this phase, the atoms are allowed to pass very near each other without having to surmount large


energy barriers.  This stage is followed by a relatively fast heating up of the system consisting of


200  4  fs  steps  of  "molecular  dynamics"  at  150,  250,  400,  700,  and  1000  K.  The  heating  stage  is


followed by the main optimization stage that consists of gradual cooling by molecular dynamics of


600 4 fs steps at 1000, 800, 600, 500, 400, and 300 K. Finally, the optimization is completed by a


conjugate gradients relaxation consisting of up to 1000 steps.  There are in fact two cycles of the


conjugate gradients, molecular dynamics with simulated annealing, and conjugate gradients phases:


In the first cycle, only those non-bonded atom pairs are considered that contain loop atoms alone


                                                                 11


(i.e., the loop does not "feel" its environment).  In the second cycle, the atom pairs that contain


up to one environment atom are also included in the energy function (i.e., the loop does "feel" its


environment).  It was found empirically that neglecting the environment in the first cycle results in


lower final energy values than including the environment from the beginning of the optimization.


     500 independent optimizations of an 8-residue loop takes from 8 to 30 hours of CPU time on


an R10000-190 SGI workstation.  Predictions for many loops under many different conditions were


performed.  This was made possible by running the computations in parallel on a cluster of SGI and


PC Linux computers.  An efficient and robust use of the cluster of processors was made possible


by the program CLUSTOR (TurboLinux, San Francisco; http://www.turbolinux.com).  With 30


processors,  a  single  8-residue  loop  prediction  involving  500  independent  optimizations  takes  less


than one hour.


Test  loop  sets


Protein structures that share less than 60% sequence identity to each other and were determined


by  X-ray  crystallography  at  resolution  of  2.0A   or  better  were  extracted  from  PDB.  Helices  and


strands were assigned by the DSSP program (Kabsch & Sander, 1983).  The regions outside helices


and  strands  were  defined  as  loops  (Figure  4).   These  loops  were  filtered  to  obtain  14  test  sets


of  loops.  Each  test  set  contained  40  loops  of  the  same  length,  spanning  the  range  from  1  to  14


residues.  The test sets were obtained by random selection applying the following criteria:  (i) no


two loops in the test set for a given length are from the same structure; (ii) there are no overlaps


between any two loops from any two sets;  (iii) the N- and C-termini are not used as test loops;


and (iv) also excluded are loops that span more than 9A   between their terminal C  ffatoms.  The


last criterion was applied to maximize the number of geometrically feasible conformations for each


test loop (Chan & Dill, 1989), thus maximizing the difficulty of loop modeling.  The distribution of


the end-to-end C ff-C ff distance for all 8-residue loops is approximately Gaussian with the mean of


13A   and standard deviation of 4A  .  Only 1% of 8-residue loop-loop alignments have more than 2


positions with the same residue type in both loops in our test set.


                                                                 12


Criteria  for  accuracy  of  loop  predictions


There will generally be a wide spread in the accuracy of the predictions for different loops.  Thus,


it is necessary to evaluate the accuracy of a method by testing it on many different loops.  We use


test sets of 40 loops for each length from 1 to 14 residues, resulting in 560 test loops in total.  The


average and standard deviation of the accuracy for 40 test loops of the same length are the most


frequently used measures of the method accuracy in this paper.


     The accuracy of a single loop prediction is evaluated by comparing it with the native conforma-


tion.  A large variety of reasonable criteria for comparing loop conformations exist.  They include


RMSD      and DRMS      measures (Levitt, 1983) for different sets of atoms, such as C  ff, mainchain, and


all atoms.  The RMSD      error can be calculated from the superposition of the whole structures ex-


cluding the loop ("global" superposition) or from the superposition of the compared loop atoms


only ("local" superposition).  In addition to Cartesian coordinates, dihedral angles, dihedral angle


classes, and mainchain conformation classes can also be compared.  It is not practical to use all of


these criteria.  Fortunately, it is also unnecessary because there is a statistical correlation between


reasonable loop comparison measures (see below).  Thus, a statistical description of the accuracy


by  one  measure  and  the  statistical  relationship  between  that  measure  and  all  other  interesting


measures provides a description of the accuracy for all the measures.


     We calculate RMSD      for Cartesian coordinates only, and annotate the symbol by subscripts and


arguments for exact definition,  RMSD    part;atom_types (part; atom_types     ).  The two subscripts indicate


the part of the protein structure and the atom types that were used for least-squares superposition,


and the two arguments indicate the part of the structure and the atom types that were compared


to calculate the RMSD     .  For example, the "global" RMSD      for the loop mainchain atoms (N, C  ff,


C,  O)  after  superposition  of  the  mainchain  atoms  in  the  stem  residues  on  each  side  of  the  loop


(usually  three  residues)  is  indicated  by  RMSD    global   =  RMSD    stem;NCffCO   (loop; NCffCO     ).  The  "lo-


cal" RMSD      for the mainchain loop atoms superposed on themselves is indicated by RMSD     local  =


RMSD     loop;NCffCO  (loop; NCffCO     ).  Distortion  of  the  two  stems  is  indicated  by  the  global  and  local


superposition of the mainchain atoms in the stems, RMSD     core;NCffCO  (stem; NCffCO      ) and


RMSD     stem;NCffCO  (stem; NCffCO      ), respectively.  "Core" refers to the whole protein excluding the loop,


the stems and the distorted environment.  RMSD     global   for a loop in the native structure is obvi-


ously the same for the superposition of either the stem atoms, core atoms, or all non-loop atoms.


                                                                 13


However, in the case of evaluating loop models in the context of approximate structures, the choice


of the "core" influences the RMSD     global   value.  In this paper, the superposition of the loop stem


residues is used to obtain RMSD     global  .


     The  RMSD    local   was  picked  as  the  primary  measure  of  loop  accuracy.  Its  correlation  with  a


number of other measures is strong, corresponding to Pearson correlation coefficients larger than 0.8


(Figure 5).  The RMSD     local for local superposition of all atoms in an 8-residue loop is approximately


twice that for the superposition of the mainchain atoms only.  The RMSD     global   for the mainchain


atoms is about 1.5 times the RMSD     local .  The DRMS      for the mainchain atoms is approximately


0.8  times  the  RMSD    local .   The  RMSD    local   for  C ff atoms  is  almost  the  same  as  that  for  all  the


mainchain  atoms.  The  RMSD    local   for  N,  C ff,  and  C  is  always  slightly  smaller  than  that  for  all


the mainchain atoms.  These correlations were obtained from a comparison of the loop modeling


predictions with the corresponding native loop structures.  Thus, the correlations can be used to


asses  the  present  method  (for  8-residue  long  loops)  by  all  the  criteria  plotted  in  Figure  5,  even


when only the RMSD     local  numbers are available.


     Loop  modeling  errors  can  be  approximately  deconvoluted  into  two  contributions:  (i)  errors


in conformation and (ii) errors in orientation of the loop relative to the rest of the protein (van


Vlijmen & Karplus, 1997; Martin et al., 1997).  While RMSD     local  is a good measure of the accuracy


of conformation,  it does not depend on the relative orientation of the loop.  On the other hand,


RMSD     global  depends on both the conformational and orientational accuracies.  However, the choice


of the superposed atoms before the RMSD     global   calculation is generally arbitrary.  For example,


when a small domain containing the loop is shifted relative to the rest of the protein, the RMSD     global


of the loop will indicate a large error even when the loop is modeled perfectly in the context of the


small domain.  For this reason, RMSD     local  was chosen as the primary evaluation measure, although


the main evaluation results are also reported using RMSD     global  (Figures 8-11, Table 2).  In practical


terms,  for the present method applied to 8-residues long loops in their native environments,  the


RMSD     global   is approximately 1.5 times the RMSD     local .


     Another useful characterization of accuracy of a method is the fraction of loop predictions that


fall  in  the  good,  medium  and  bad  accuracy  class  (Figure  6).  A  good  prediction  has  RMSD    local


smaller than 1A  , a bad prediction has RMSD     local  larger than 2A  , and a medium prediction falls


between  these  two  extremes.   Examples  of  an  8-residue  loop  model  in  each  of  the  three  classes


                                                                 14


are shown in Figure 6.  In the good class, the mainchain carbonyl oxygen atoms almost invariably


point in the correct direction and most of the residue mainchain conformation classes are predicted


correctly.   The  error  approaches  that  in  the  medium  resolution  X-ray  analysis.   In  the  medium


class,  there  are  occasional  flips  of  the  mainchain  oxygen  atoms  as  well  as  errors  in  the  residue


mainchain conformation classes.  However, the error is not larger than the dynamic fluctuations of


most loops at room temperature (Fushman et al., 1997).  Both good and medium loop predictions


are informative when using comparative protein structure models.


Data  deposition


The  modeling  program  Modeller        as  well  as  the  list  of  the  loops  in  each  of  the  test  sets  are


available  at  the  URL  http://guitar.rockefeller.edu/.   The  test  sets  contain  40  loops  each.


There are 14 test sets corresponding to the loop lengths from 1 to 14 residues.


Results  and  Discussion


In  Theory  and  Algorithms,  we  describe  a  technique  for  modeling  of  loops  in  protein  structures.


The method predicts the positions of all non-hydrogen atoms of a given loop in a fixed environment


by optimizing a scoring or "energy" function.  Many different energy functions and optimization


schedules were explored.  The current energy function contains terms from a molecular mechanics


force field as well as restraints based on statistical distributions derived from known protein struc-


tures (Sippl, 1990; Cheng et al., 1996).  Bonds, angles, some dihedral angles, and improper dihedral


angles are restrained by the corresponding terms in the Charmm       -22 potential function (MacKerell


et al., 1998).  The mainchain and sidechain dihedral angles as well as non-bonded atom pairs are


restrained by statistical potentials.  The energy function is optimized by a combination of conjugate


gradients and molecular dynamics with simulated annealing.


     This section begins by a description of the average accuracy of the method as a function of degree


of conformational sampling, loop length from 1 to 14 residues, and properties of the native loop, such


as the mobility and compactness.  To simulate real comparative modeling problems, the method was


also evaluated by predicting loops of known structure in only approximately correct environments


with errors typical of comparative modeling without misalignment.  Next, the accuracy of the lowest


energy prediction for a given loop is estimated from the structural variability among a number of


                                                                 15


low  energy  predictions.  The  relative  value  of  the  present  method  is  gauged  by  (i)  comparing  it


with one of the most successful previously described methods, and (ii) describing its accuracy in


recent  blind  predictions  of  protein  structures.  The  latter  evaluation  also  revealed  limitations  in


practical loop modeling, originating from problems other than those of the loop modeling method


itself.  Finally, to indicate future developments, it is shown that the average accuracy of prediction is


limited primarily by the accuracy of the energy function rather than by the extent of conformational


sampling.


Accuracy  of  loop  predictions  as  a  function  of  thoroughness  of  conformational


sampling


As described in Theory and Algorithms, loop modeling consists of independent energy optimizations


of many random initial loop conformations.  The final loop prediction is the optimized conformation


that  has  the  lowest  energy  among  all  the  independent  loop  optimizations.   The  scatter  plots  of


the energy and the RMSD     local  error (Theory and Algorithms) for independent optimizations in a


successful  and  unsuccessful  prediction  are  shown  in  Figures  7a  and  b,  respectively.  The  success


corresponds to a strong correlation between the energy and RMSD     .  In such cases, most of the low


energy predictions are accurate.  On the other hand, the failure corresponds to a lack of a positive


correlation between the energy and RMSD     .  Most of the low energy predictions are different not


only from the native loop, but also from each other.  Even if a geometrically good conformation is


encountered during sampling, its energy is high and is thus not predicted to be the native loop.


     An important methodological consideration is the amount of conformational sampling performed


for each loop prediction.  This is directly proportional to the number of independent loop optimiza-


tions.  The larger the number of loop optimizations, the better is the average accuracy (Figure 8).


For 4-residue loops, improvement beyond 50 independent optimizations is negligible (Figure 8a).


On the other hand, for 8-residue loops, the average RMSD     local  error decreases from 1:40  0:12A


to 1:16  0:10A   when the number of independent optimizations is increased from 50 to 500.  The


average accuracy of loops longer than approximately 6 residues is likely to improve marginally even


beyond 500 independent optimizations (Figures 8b and c).  To be able to sample the 14 test sets of


40 loops each, it was necessary to limit most subsequent evaluations of the method accuracy to 50


independent optimizations.


                                                                 16


Accuracy  of  loop  predictions  as  a  function  of  loop  length


The  difficulty  of  the  loop  modeling  problem  increases  with  loop  length.  For  longer  loops,  there


are more incorrect conformations that increase the demands on the optimizer to generate a good


model and on the energy function to identify it.  The average RMSD     local  error and its standard


deviation rise almost linearly with loop length (Figure 9).  For 50 independent optimizations, the


average RMSD     local  error is 0.59, 1.40, and 2.96A   for 4, 8, and 12-residue loops, respectively.  All


predictions of 4-residue loops fall into either the good or medium accuracy class (Table 1, Figure 4).


For 8-residue loops, 50, 40, and 10% of models are in the good, medium, and bad class, respectively,


when 500 independent optimizations are performed.  For 12-residue loops, only 30% of predictions


are in the good or medium class when 500 independent optimizations are performed.  In conclusion,


it  makes  sense  to  model  even  loops  12  residues  long,  if  the  environment  error  is  small  and  30%


chance of obtaining either a good or a medium loop model is acceptable.


     During the course of this project, the optimization method and the energy function were im-


proved iteratively by relying on the test set of 40 8-residue loops.  Thus, it could be that the final


reported results for these loops are misleadingly favorable.  To check for such bias, additional 40


8-residue  loops  were  selected  as  described  in  Theory  and  Algorithms  and  predicted  by  the  final


version of the method only.  The average accuracy for 50 independent optimizations differed from


the  first  test  set  only  in  the  second  decimal  place.  Also,  the  accuracy  for  8-residue  loops  is  not


an outlier relative to other lengths (Figure 9).  Further confidence in the statistical robustness of


the presented results is provided by comparing our method with another method (van Vlijmen &


Karplus, 1997) and by the accuracy of the loop models submitted to CASP3 (see below).  In both


cases, the accuracy of the method is consistent with the evaluation based on the test loops.


Accuracy  of  loop  predictions  as  a  function  of  environment  distortion


In real comparative modeling when the native structure of the target sequence is not known, loops


are  not  modeled  in  the  perfectly  correct  environment.   At  best,  only  an  approximately  correct


environment  is  available.   This  complication  in  loop  modeling  mimics  the  situation  in  sidechain


modeling,  which  needs  to  be  performed  on  an  approximate,  not  exact  backbone.  In  the  case  of


sidechain modeling, the accuracy of the predicted sidechain packing drops rapidly when the core


backbone distortion increases beyond 1A   (Chung & Subbiah,  1996).  Thus,  it was expected that


                                                                 17


the  average  accuracy  of  the  loop  modeling  method  would  also  be  worse  than  indicated  by  the


evaluations in the native environment described in the preceding section.  To quantify the impact


of  the  environment  errors  on  the  accuracy  of  loop  modeling,  we  prepared  test  sets  for  4,  8,  and


12-residue  loops  in  distorted  environments.   Each  set  contained  the  40  test  loops  in  5  distorted


environments each.  The environment of a loop is defined in this section to include six stem residues


on each side of the loop as well as all the atoms that are within 15A   of at least one loop or stem


atom  in  the  native  structure.   The  size  of  the  distortion  RMSD    core;NCffCO  (stem; NCffCO      )  was  up


to almost 3A   (Fig 10).  Environment errors of less than 3A   are typical of comparative models of


exposed regions based on alignments without errors.  The test sets were obtained by subjecting the


environment atoms in the native structure to a molecular dynamics simulation at 600 K guided by


the energy function in Eq. 1 and restrained by the rest of the native protein structure.


     For 8-residue loops in an environment with the stem mainchain RMSD      distortion of 2.5A  , the


average error of loop prediction increases for 25% from 1.40A   to 1.75A  , as measured by the least-


squares  line  through  the  data  points  (Figure  10).   Nevertheless,  75%  of  the  predictions  remain


in  the  good  and  medium  classes,  even  when  only  50  independent  optimizations  are  performed


(Table  1,  Figure  4).   The  impact  of  environment  errors  is  smaller  for  long  loops  than  it  is  for


short loops, presumably because the prediction accuracy for long loops is already low in the native


environment.  For example, for 12-residue loops, almost no decrease in prediction accuracy occurs


with environment distortion.  On the other hand, for 4-residue loops, the increase in the error is


large, from 0.43A   to 1.21A  .  However, all the 4-residue loop predictions are still in the good and


medium accuracy classes, presumably due to a relatively small number of different conformations


for such short loops.


     The  negative  impact  of  the  environment  errors  on  the  loop  accuracy  might  be  decreased  by


optimizing the positions of the environment atoms as well as the loop atoms.  However, our pre-


liminary attempts to do so have not improved the average accuracy of the loop models (data not


shown).  It appears that the optimization algorithm and/or the energy function are overwhelmed by


the large number of degrees of freedom corresponding to the loop combined with its environment.


For example, even when a short distance of 6A   is used to define the loop environment, there are


from 17 to 36 environment residues for the 40 8-residue loops.  This might correspond in difficulty


at  least  to  the  standard  25  residue  "loop"  modeling  problem  (8  +  17),  far  beyond  the  range  of


                                                                 18


reliable performance for any modeling method (Table 1, Figure 4).


Accuracy  of  loop  predictions  as  a  function  of  loop  properties


Loops  usually  occur  on  the  surface  of  a  protein  globule  and  have  relatively  few  contacts  with


the  rest  of  the  fold.   Consequently,  some  of  the  loops  are  structurally  least  well  defined  parts


of  the  protein  mainchain.  Conformational  heterogeneity  of  loops  includes  multiple  local  minima


with  rare  transitions  between  the  minima  (static  disorder)  as  well  as  large  fluctuations  around


a  single  minimum  (dynamic  disorder).  Both  of  these  phenomena  are  reflected  in  relatively  high


crystallographic  isotropic  temperature  factors,  Biso.   The  atomic  Biso  are  generally  determined


accurately in protein structures refined at resolution of 2A   or better, such as the structures from


which the test loops were collected.  It was expected that a loop with a high average Biso will tend


to be predicted less accurately than a loop with a low average Biso.  However, no such correlation


is observed in a comparison of the normalized average Biso  values for different loops as a function


of the prediction accuracy (Figure 11a).  One possible explanation is that most of the loops with


high  average  temperature  factors  fluctuate  around  their  equilibrium  conformations,  not  between


several  different  local  minima.   Thus,  the  average  structure  determined  by  crystallography  and


the  equilibrium  structure  approximated  by  the  prediction  will  tend  to  correspond  to  the  same


conformation, eliminating the correlation between prediction accuracy and high mobility.


     There is also no correlation between the prediction error and the number of atomic contacts


within the loop (Figure 11b).  The compact loops are predicted equally well as the non-compact


loops.  This might indicate problems with the non-bonded terms in the energy function.


     It was expected that the prediction accuracy will increase with the number of atomic contacts


between the loop atoms and the environment in the native conformation, due to the moulding of


the loop by the fixed environment.  However, there is almost no such correlation (Figure 11c).  This


indicates problems with the non-bonded terms in the energy function, resulting in relatively little


useful information provided by the environment.  Another explanation, not mutually exclusive with


the first one, is that the internal mainchain preferences of the loop residues tend to be consistent


with the conformation preferred by the environment (G"o & Abe, 1984).


     The  number  of  contacts  between  the  loop  atoms  and  the  neighboring  molecules  in  the  crys-


tal  also  does  not  have  a  significant  impact  on  the  prediction  accuracy  (Figure  11c).  This  was  a


                                                                 19


surprise:  Loops with a higher number of inter-molecular contacts were expected to be predicted


less accurately because the inter-molecular contacts are completely ignored in loop modeling.  It


appears that the native loop conformations tend to be consistent with the inter-molecular packing


(G"o & Abe, 1984).


Estimating  the  accuracy  of  a  loop  prediction


In  the  absence  of  a  perfect  loop  modeling  method,  it  is  useful  to  have  an  estimate  of  the  error


of  a  given  loop  prediction.   Methods  for  detecting  errors  in  protein  structure  models  have  been


reviewed (Sanchez & Sali, 1997b).  One popular approach is based on energy profiles (L"uthy et al.,


1992; Sippl, 1993).  In this approach, a region is predicted to be in error when its energy is above a


certain cutoff.  This rule is not expected to work well in the case of our loop modeling method since


the best available energy function is optimized to obtain the model in the first place.  There is an


additional concern about the theoretical validity of the energy profiles for detecting regional errors


in models.  It is likely that the contributions of the individual residues to the overall free energy


of  folding  vary  widely,  even  when  normalized  by  the  number  of  atoms  or  interactions  made.   If


this is correct, the correlation between the prediction errors and energy peaks is greatly weakened,


resulting in the loss of predictive power of the energy profile.  Despite these concerns, error profiles


have been useful in some applications (Sanchez & Sali, 1997a; Guenther et al., 1997).


     The following rational is used here to estimate an error of a given loop prediction.  We take a


case of a loop with one dominant native conformation.  In such a case, there are degrees of freedom


for which the true energy function has a deep, global free energy minimum, in addition to many


local minima.  The global free energy minimum is the native conformation.  The energy function


and degrees of freedom that are explored in modeling are only an approximation of the true energy


function and its degrees of freedom.  However, it is plausible that the more pronounced is the free


energy minimum in the modeling function, the less likely it is that the errors in the function moved


the minimum away from that in the true energy function.  In other words, a given fractional error


in the energy surface may not move a deep minimum, while it is likely to move a shallow minimum.


The many independently optimized loop conformations in a single loop prediction make it possible


to  estimate  how  pronounced  the  lowest  free  energy  minimum  is.   If  the  free  energy  surface  has


multiple comparable minima without a dominant minimum, the loop modeling method will result


                                                                 20


in multiple, significantly different conformations.  When there is a dominant free energy minimum,


the loop modeling method produces similar low energy conformations if the optimizer works well.


Conversely, the more similar are the lowest energy conformations, the more pronounced must the


corresponding minimum be, and the less likely it is that the best prediction has a large error.  The


more different are the lowest energy conformations, the more rugged is the modeling energy surface


and the less confidence one has in the lowest energy solution.


     This idea was tested by analyzing the "ensembles" of 500 independent optimizations of 40 8-


residue loops.  The structural variation among the lowest energy solutions was defined as the average


RMSD     local  for all pairs of the 20 lowest energy conformations, out of the 500 conformations in total.


Similar results are obtained when the best 5 to 50 conformations are used (data not shown).  The


scatter  plot  of  the  RMSD    local   error  of  the  lowest  energy  prediction  and  the  structural  variation


of  the  lowest  energy  solutions  is  shown  for  the  40  test  loops  in  Figure  12.  When  the  structural


variation among the lowest energy models is low, the error of the lowest energy prediction is indeed


small.  When the variation is large,  the error can be either small or large.  In this case,  it is not


possible to distinguish between the failure of the optimizer, errors in the energy function, or a truly


promiscuous loop.  There are no cases of low variation and large error, indicating that the optimizer


does not often get trapped into a local minimum.  For 8-residue loops, a good upper bound on the


error of loop prediction is the variation among the lowest energy models multiplied by two.


Comparison  with  previous  results


The accuracy of any new prediction method has to be compared with previous results.  There are


a  great  many  existing  loop  modeling  methods  (Introduction)  and  it  is  not  practical  to  consider


all of them.  Thus, we chose to compare the present results with only one, but carefully selected


previous  study.   The  requirements  were  that  the  reference  method  be  recent,  well  documented,


automated, tested under realistic conditions, evaluated with a reasonable number of test loops of


varying length, and that it compared favorably with other prior methods.  One such method is that


of van Vlijmen & Karplus, 1997 (the "VK" method).  The VK method first selects approximately


1,000  loop  candidates  from  PDB,  based  on  stem  fitting,  and  then  subjects  these  candidates  to


independent energy minimizations in vacuo to obtain the final prediction.  The Charmm       -19 energy


function without electrostatics, applied to the mainchain atoms N, C  ff, C, and the first sidechain


                                                                 21


atom  C fi,  is  supplemented  by  strong  mainchain  dihedral  angle  restraints  to  focus  minimization


on the conformations relatively close to the template segments.  Optimization results in both the


global  movement  of  the  loop  relative  to  the  rest  of  the  protein  and  local  relaxation  of  the  loop


conformation.  The VK method compared favorably with several other optimization and database


approaches (Moult & James, 1986; Fidelis et al., 1994; Zheng & Kyle, 1996).


     14  loops  from  4  to  9  residues  long  were  predicted  by  the  VK  method  in  a  manner  that  is


appropriate for comparison with the present method (Table 2).  Because the original evaluation of


the VK method was given in terms of RMSD     stem;NCffC  (loop; NCffC    ), the comparison here also had


to rely on this measure, not RMSD     global   or RMSD     local .  In any case, RMSD     stem;NCffC  (loop; NCffC    )


tends  to  be  only  slightly  smaller  than  RMSD    global   (Figure  5f,  Table  2).   For  3  out  of  14  loops


(2act, 2fb4, 2fbj), the VK method produced a more accurate model than the present method.  For


two  loops  (8abp  and  3dfr_120-124),  both  methods  produced  indistinguishably  good  models.  For


the remaining 9 loops, the present method produced more accurate models than the VK method.


In  the  three  cases  where  the  VK  method  produced  a  better  model,  a  good  model  was  sampled


by the present method, but was not recognized as such by the energy function.  This comparison


may not be entirely fair to the VK method.  The reason is that the VK method would probably


perform better at this point in time than it did a few years ago because the database of known


protein structures, on which the VK method relies, is now significantly larger.  On the other hand,


as  the  authors  themselves  pointed  out,  the  published  results  of  the  VK  method  were  perhaps  a


little better than would be obtained in realistic applications because the candidate loops for energy


minimization were 50 loops with minimal RMSD     loop;NCffC (loop; NCffC    ) from the target conformation,


not 1,000 loops with best fitting stems.


     It  is  informative  to  examine  the  rank  of  the  lowest  RMSD    local   conformation  in  the  list  of


500  independently  optimized  conformations  sorted  by  energy,  as  well  as  the  rank  of  the  lowest


energy conformation in the list of conformations sorted by RMSD     local  (Table 2).  Conformational


sampling almost always generates an extremely good solution.  In only one of the 14 cases is the


lowest RMSD     global   significantly larger than 1A   (2act).  The lowest energy prediction is frequently


(i.e., in 8 out 14 cases) within 10% of the geometrically best conformations.  Even when the lowest


energy structure has a low rank in the list of structures sorted by RMSD     global , it can still be a good


prediction (e.g. , 3dfr_20-23), indicating that most independent optimizations ended in the vicinity


                                                                 22


of the native structure.


Loop  modeling  at  CASP3


Protein structure modeling methods, including comparative modeling, are tested at the bi-annual


CASP  meetings  (Moult  et  al.,  1999).  The  last  meeting,  CASP3,  was  held  in  December  of  1998


(Jones & Kleywegt, 1999).  Protein modelers are challenged to model sequences with unknown 3D


structure  and  to  submit  their  predictions  to  the  organizers  before  the  meeting.  In  parallel,  the


3D  structures  of  the  prediction  targets  are  being  determined  by  X-ray  crystallography  or  NMR


methods.  They only become available after the models are calculated and submitted.  Thus, a bona


fide evaluation of protein structure modeling methods is possible.


     One major limitation of comparative modeling is that it is frequently not possible to lower the


mainchain RMSD      error of a model below the RMSD      between the template and target structures


(Sanchez & Sali, 1997a).  There are several ways to overcome this limitation in some cases.  The


first way is to use more than one template structure (Sanchez & Sali, 1997a).  The second way of


"moving" the model closer from the template to the actual target structure is to calculate good


models for insertions (i.e., loops).  The third, ambitious way is to rely on a loop modeling method to


predict relatively small shifts and distortions in a segment that is aligned to the template structure.


Even in correct alignments, the aligned regions can have an RMSD     global  of approximately 3A  .  Thus,


a good loop modeling method applied to such segments is expected to improve the model relative to


the template.  For our loop modeling method, for example, 90% of loops of 8 residues are modeled


with  less  than  2A   RMSD    local   error  when  the  environment  is  correct.  The  application  of  a  loop


modeling method to a fully aligned segment is referred to as "distortion" modeling,  rather than


loop modeling.


     A precursor of the present loop prediction method was used to model both insertions and fully


aligned segments in our submissions to CASP3.  A detailed description of our experience will be


published elsewhere (A. Fiser,  R.aSnchez,  F. Melo,  and A. Sali,  in preparation).  Here,  we only


summarize the main lessons as they pertain to the modeling of loops and distortions.


     The method applied to insertions performed according to the evaluation in this paper,  when


errors in the loop environment were relatively small (i.e., not caused by the alignment errors).  A


successful prediction of an 8-residue loop is shown in Figure 13a.  However, for most of the CASP3


                                                                 23


targets, the alignment was the main challenge, frequently resulting in misaligned stem residues or


pieces of protein structure occupying loop's space.  Even a perfect loop modeling method will of


course  fail  in  such  circumstances.   The  implication  for  comparative  modeling  users  is  that  loop


modeling  methods  are  currently  most  valuable  for  modeling  insertions  in  the  easy  and  medium


difficulty cases, corresponding to target-template sequence identity of at least 30%.


     The  method  applied  to  distortion  modeling  also  performed  as  evaluated  in  this  paper.   A


"successful"  prediction  of  a  6-residue  segment  is  shown  in  Figure  13b.  It  was  modeled  with  the


RMSD     local  error of 1.09A  .  However, although the prediction was accurate by the loop modeling


standards, it was a failure by the CASP3 standards.  The reason was that the template segment


was in fact not divergent structurally, but closer to the actual target structure than our prediction,


despite a large difference between their sequences (Figure 13b).  Thus, the risk taken to move the


model  from  the  template  structure  closer  to  the  target  structure  backfired,  even  though  a  good


model of the segment was produced.  Clearly, distortion modeling is useful only if aligned segments


that diverged beyond the error of the loop modeling method can be identified.  This is currently


an unsolved problem (Abagyan et al., 1997).  It may be as difficult to predict that a region is in


error as it is to calculate its correct model.  The implication for comparative modeling users is that


distortion modeling is probably not worth the risk at this time.


Is  accuracy  limited  by  the  optimizer  or  the  energy  function?


Any protein structure prediction consists of two main parts.  The first part is generation of candidate


structures, either by enumeration or conformational sampling.  The second part is selection of the


best  candidate  structure  based  on  some  scoring  function.   In  protein  structure  prediction  based


on "energy" minimization, which includes the present loop modeling method, structure generation


and selection are intertwined because the energy function partly determines what conformations


are sampled by the optimization algorithm.  In principle, such optimization methods can fail for one


or both of the following reasons:  (i) the optimizer samples the conformational space inadequately


and never generates a correct conformation; or (ii) the energy function does not identify a correct


conformation even if it is generated by the optimizer.  In the case of our loop modeling method,


the optimizer was expected to contribute significantly to the modeling errors; it is the stochastic


behavior of the optimizer that results in the need for many independent optimizations of the same


                                                                 24


loop,  each  one  starting  with  a  different  initial  conformation.   We  show  here,  however,  that  the


present loop modeling method is limited primarily by the accuracy of the energy function rather


than the robustness of the optimizer.


     The first indication of an inadequate energy function is provided by examples of a successful


and unsuccessful loop prediction (Figure 7).  For the 500 independent optimizations used to make


the correct prediction in Figure 7a, the correlation between the energy and the RMSD     local  error


is strong.  Any of the approximately 30 lowest energy conformations are close to the native loop.


In contrast, there is no correlation between energy and error in the case of an incorrect prediction


(Figure 7b).  Many conformations different from each other and from the native loop have an equally


low  energy  score.  Even  though  a  conformation  that  is  close  to  the  native  structure  (RMSD     local


= 0.84A  ) was sampled, the lowest energy model had one of the largest errors among the sampled


models (3.17A  ).


     The second indication of a reasonable performance of the optimizer is the data on its convergence


properties (Figure 8).  Loop models were predicted from 1 to 500 independent optimizations for sets


of 40 4, 8, and 12-residue loops each.  There is essentially no improvement in the average accuracy


of loop prediction when more than 50 independent optimizations are performed for 4-residue loops.


For 8 and 12-residue loops, there is a small improvement in the average accuracy from 100 to 500


optimizations.   This  indicates  that  conformational  sampling  is  essentially  complete.   Thus,  it  is


likely that the failure of loop prediction is due to the energy function, not the optimizer.


     The  most  direct  evidence  that  loop  prediction  is  limited  primarily  by  the  energy  function  is


provided by the data on how close to the native structure are the closest encountered conformations,


irrespective of their energy (Figure 8).  These data were obtained for the three different loop lengths


by calculating the average over 40 loops of the minimal RMSD     local  error among the independently


optimized  conformations  of  the  same  loop.   This  average  minimal  RMSD    local   error  is  of  course


always lower than the average RMSD     local  error for the lowest energy predictions.  Moreover, the


average minimal RMSD     local  error is very low indeed.  For example, it is less than 1.0 and 2.0A   for


8 and 12-residue loops, respectively, even when only 50 independent optimizations are performed.


The corresponding average RMSD     local  errors are 1.40 and 2.92 A .  If only were the energy function


able to recognize the geometrically best sampled conformation, the loop modeling problem would


be essentially solved for loops up to 12 residues long.


                                                                 25


     The weakness of the potential function is also illustrated by the fact that the energies of the


native  loop  and  a  relaxed  native  loop,  within  0.2A   RMSD     local  of  the  native  conformation,  are


always higher than that of the lowest energy model (data not shown).


     The  final  indications  that  the  optimizer  performs  reasonably  well,  are  provided  by  the  high


average  accuracy  of  the  loop  predictions  based  on  an  artificially  improved  energy  function  (see


below) and the lack of frequent trapping into the same bad local minimum (Figure 12).


     Although  we  have  emphasized  inadequacy  of  our  energy  function,  the  current  optimizer  is


not perfect.  There is some improvement in the average prediction accuracy for loops longer than


approximately 6 residues even when 500 independent optimizations are performed (Figure 8).  Thus,


there is a need for a faster optimizer with a greater radius of convergence.  This would open new


applications for loop modeling, such as loop modeling on a genomic scale (Sanchez & Sali, 1998).


A better optimizer might also be needed to optimize new, improved, and possibly more complex


energy functions.


Conclusion


We described a completely automated loop modeling algorithm that consists of an optimization of


a defined segment of protein structure in a fixed environment, guided by a pseudo energy function.


The method was tested on a statistically meaningful number of loops of known structure, both in the


native and near-native environments.  The evaluation indicated that loops of 8 residues predicted


in the native environment have 90% chance to be modeled with useful accuracy (RMSD     local  less


than 2A  ).  Even 12-residue loops are modeled with useful accuracy in 30% of the cases.  When the


RMSD      distortion of the mainchain stem atoms is 2.5A  , the average loop prediction error increased


by  180,  25  and  3%  for  4,  8  and  12-residue  loops,  respectively.  It  is  not  anymore  too  optimistic


to expect useful models for loops as long as 12 residues, if the environment of the loop is at least


approximately correct.  It is possible to estimate whether or not a given loop prediction is correct,


based on the structural variability of the independently derived lowest energy loop conformations.


     The method is flexible and can model any subset of atoms.  It is technically applicable to mod-


eling of loops with bound ligands and several interacting loops, although it has not been evaluated


in such contexts yet.  Moreover, the method can incorporate additional structural information, such


as restraints implied by disulfide bonds and metal binding sites.


                                                                 26


     The CASP3 meeting revealed that practical applications of loop modeling are severely limited


by the need to model loops in an approximately correct environment.  This is necessary because the


environment has to be fixed during loop modeling; if it were not, the number of optimized degrees


of freedom would increase several fold,  making the modeling problem generally too hard for the


current methods.  If there are alignment errors at the stem residues or at the other environment


residues, loop modeling is not likely to result in an accurate model.  This means that loop modeling


is most useful for target sequences that share more than 30% sequence identity with the template


structures.  Another limitation exposed by our experience at CASP3 is the need for a method that


determines the regions in the target sequence to be modeled as loops.  While the insertions have to


fall into this category, there are generally additional regions that are aligned with templates but


could benefit from an accurate "loop" modeling.  These regions cannot be reliably identified at the


present.


     It  was  shown  that  the  accuracy  of  the  predictions  is  limited  primarily  by  the  accuracy  of


the  energy  function,  not  the  thoroughness  of  the  optimizer.   The  improvements  of  the  energy


function  that  are  most  likely  to  result  in  better  predictions  include  explicit  modeling  of  the  cis-


peptide states,  a more accurate representation of the ;  angles,  as well as more accurate non-


bonded terms.  Better ;  restraints might be obtained by expressing them with 2D cubic splines


instead of bivariate Gaussian functions,  and by explicitly taking into account the dependence of


the mainchain conformation of a residue on the preceding and subsequent residues in the sequence.


The non-bonded terms would probably benefit from a description of hydrogen bonds and solvent


that  is  more  physical  than  the  current  statistical  potential.  For  example,  an  all-hydrogen  atom


and solvent representations, such as the generalized Born model (Rapp & Friesner, 1999; Dominy


& Brooks III, 1999), may be necessary.  To test some of these suggestions, we plan to evaluate the


latest generation of molecular mechanics force fields (MacKerell et al., 1998; Cornell et al., 1995) for


loop modeling.  We will begin by a detailed analysis of the correlations of the prediction accuracy


with the properties and interactions of the predicted and native loops.


     The differences in length and conformation of loop regions in a family of related proteins are


frequently  responsible  for  the  specificity  of  ligand  binding.   Thus,  accurate  modeling  of  loops  is


essential  for  structure-based  prediction  of  function  from  sequence.   For  example,  a  comparative


model  can  sometimes  be  used  with  computational  ligand  docking  to  find  a  putative  ligand  or


                                                                 27


resolve preferences within a limited set of ligands (Ring et  al.,  1993;  Xu et  al.,  1996).  A serious


complication is that the ligand may induce conformational changes in loops with which it interacts.


Thus, it is not always sufficient to be able to model loops on their own, as addressed in this paper.


Instead, the protein and the ligand should ideally be modeled simultaneously.  Nevertheless, induced


fit is frequently small.  As a result, modeling of ligand binding loops in the apo state can still be


useful for studying functional differences within a family of proteins, as well as for the "ab initio  "


prediction  of  protein  function  by  recognition  of  ligand  binding  sites  (Jones  &  Thornton,  1997;


Fetrow  et  al.,  1998;  Russell  et  al.,  1998;  Kleywegt,  1999;  Wei  et  al.,  1999;  Kasuya  &  Thornton,


1999).  The current method presents a significant improvement in the modeling of loops in protein


structures.


Acknowledgment


     We are grateful to Jeff Tsai for help in the initial stages of this project, and to Drs.  Azat Badret-


dinov, Ram Samudrala, and especially Francisco Melo for several non-bonded energy functions.  AF


is a Burroughs Wellcome Fellow.  AS is a Sinsheimer Scholar and an Alfred P. Sloan Research Fellow.


This work has also been aided by grants from NIH (GM 54762) and NSF (BIR-9601845).


                                                                 28


  References


Abagyan R, Batalov S, Cardozo T, Totrov M, Webber J, Zhou Y. 1997.  Homology modeling with


      internal  coordinate  mechanics:  deformation  zone  mapping  and  improvements  of  models  via


      conformational search.  Proteins Suppl. 1:29-37.


Abagyan R, Totrov M. 1994.  Biased probability Monte Carlo conformational searches and electro-


      static calculations for peptides and proteins.  J. Mol. Biol. 235:983-1002.


Abola EE, Bernstein FC, Bryant SH, Koetzle T, Weng J. 1987.  Protein data bank.  In Crystal-


      lographic  databases  _  Information,  content,  software  systems,  scientific  applications,  (Allen


      FH, Bergerhoff G, Sievers R, eds), pp. 107-132. Data Commission of the International Union


      of Crystallography Bonn/Cambridge/Chester.


Bajorath J, Sheriff S. 1996. Comparison of an antibody model with an X-ray structure; the variable


      fragment of BR96.  Proteins 24:152-157.


Benner  SA,  Gonnet  GH,  Cohen  MA.  1993.   Empirical  and  structural  models  for  insertions  and


      deletions in the divergent evolution of proteins.  J. Mol. Biol. 229:1065-1082.


Blundell  TL,  Sternberg  MJE,  Sibanda  BL,  Thornton  JM.  1987.   Knowledge-based  prediction  of


      protein structures and the design of novel molecules.  Nature 326:347-352.


Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M. 1983. CHARMM: a


      program for macromolecular energy minimization and dynamics calculations. J. Comp. Chem.


      4:187-217.


Brower  RC,  Vasmatzis  G,  Silverman  M,  DeLisi  C.  1993.  Exhaustive  conformational  search  and


      simulated annealing for models of lattice peptides.  Biopolymers 33:329-334.


Browne WJ, North ACT, Phillips DC, Brew K, Vanaman TC, Hill RC. 1969.  A possible three-


      dimensional structure of bovine ff-lactalbumin based on that of hen's egg-white lysozyme.  J.


      Mol. Biol. 42:65-86.


                                                                 29


Bruccoleri  BR,  Karplus  M.  1990.   Conformational  sampling  using  high  temperature  molecular


      dynamics.  Biopolymers 29:1847-1862.


Bruccoleri RE. 1993.  Application of systematic conformational search to protein modeling.  Molec-


      ular Simulation 10:151-174.


Bruccoleri RE, Haber E, Novotny J. 1988.  Structure of antibody hypervariable loops reproduced


      by a conformational search algorithm.  Nature 335:564-568.


Bruccoleri RE, Karplus M. 1987. Prediction of the folding of short polypeptide segments by uniform


      conformational sampling.  Biopolymers 26:137-168.


Carlacci L, Englander SW. 1993.  The loop problem in proteins:  Monte Carlo simulated annealing


      approach.  Biopolymers 33:1271-1286.


Carlacci  L,  Englander  SW.  1996.  Loop  problem  in  proteins:  developments  on  the  Monte  Carlo


      simulated annealing approach.  Comp. Chem. 17:1002-1012.


Chan HS, Dill KA. 1989. Intrachain loops in polymers:  effects of excluded volume. J. Chem. Phys.


      90:492-509.


Cheng B, Nayeem A, Scheraga HA. 1996. From secondary structure to three-dimensional structure:


      improved dihedral angle probability distribution function for use with energy searches for native


      structures of polypeptides and proteins.  J. Comp. Chem. 17:1453-1480.


Chothia C, Lesk AM. 1987. Canonical structures for the hypervariable regions of immunoglobulins.


      J. Mol. Biol. 196:901-917.


Chothia  C,  Lesk  AM,  Levitt  M,  Amit  AG,  Mariuzza  RA,  Phillips  SEV,  Poljak  RJ.  1986.   The


      predicted  structure  of  immunoglobulin  d1.3  and  its  comparison  with  the  crystal  structure.


      Science 233:755-758.


Chothia  C,  Lesk  AM,  Tramontano  A,  Levitt  M,  Smith  Gill  SJ,  Air  G,  Sheriff  S,  Padlan  EA,


      Davies D, Tulip WR, Colman PM, Spinelli S, Alzari PM, Poljak RJ. 1989.  Conformation of


      immunoglobulin hypervariable regions.  Nature 342:877-883.


Chung  SY,  Subbiah  S.  1996.  A  structural  explanation  for  the  twilight  zone  of  protein  sequence


      homology.  Structure 4:1123-1127.


                                                                 30


Claessens  M,  Cutsem  EV,  Lasters  I,  Wodak  S.  1989.  Modelling  the  polypeptide  backbone  with


      `spare parts' from known protein structures.  Protein Eng. 4:335-345.


Cohen BI, Presnell SR, Cohen FE. 1993. Origins of structural diversity within sequentially identical


      hexapeptides.  Protein Sci. 2:2134-2145.


Cohen  FE,  Abarbanel  RM,  Kuntz  ID,  Fletterick  RJ.  1986.  Turn  prediction  in  proteins  using  a


      pattern-matching approach.  Biochemistry 25:266-275.


Collura V, Higo J, Garnier J. 1993. Modeling of protein loops by simulated annealing. Protein Sci.


      2:1502-1510.


Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KMJ, Fergusson DM, Spellmeyer DC, Fox DC,


      Caldwell JW, Kollman PA. 1995. A second generation force field for the simulation of proteins


      and nucleic acids.  J. Am. Chem. Soc. 117:5179-5197.


Deane CM, Blundell TL. 2000. A novel exhaustive search algorithm for predicting the conformation


      of polypeptide segments in proteins.  Proteins 40:135-144.


Debnath AK. 1997. A `fragment fitting approach' to model disulfide loops by utilizing homologous


      peptide fragments from unrelated proteins of known structures:  Application to the V3 loop of


      the HIV-1 envelope glycoprotein gp120.  J. Mol. Model. 3:31-47.


Dominy BN, Brooks III CL. 1999.  Development of a generalized Born model parametrization for


      proteins and nucleic acids.  J. Phys. Chem. B 103:3765-3773.


Donate LE, Rufino SD, Canard LHJ, Blundell TL. 1996. Conformational analysis and clustering of


      short and medium size loops connecting regular secondary structures:  a database for modeling


      and prediction.  Protein Science 5:2600-2616.


Dudek MJ, Ramnarayan K, Ponder JW. 1998.  Protein structure prediction using a combination


      of sequence homology and global energy minimization:  II, Energy functions.  J. Comp. Chem.


      19:548-573.


Dudek  MJ,  Scheraga  HA.  1990.   Protein  structure  prediction  using  a  combination  of  sequence


      homology and global energy minimization. I. Global energy minimization of surface loops.  J.


      Comp. Chem. 11:121-151.


                                                                 31


Evans JS, Mathiowetz AM, Chan SI, Goddard III WA. 1995.  De novo prediction of polypeptide


      conformations using dihedral probability grid Monte Carlo methodology.  Protein Sci. 4:1203-


      1216.


Fechteler T, Dengler U, Schomburg D. 1995.  Prediction of protein three-dimensional structures in


      insertion and deletion regions:  a procedure for searching data bases of representative protein


      fragments using geometric scoring criteria.  J. Mol. Biol. 253:114-131.


Fetrow  JS,  Godzik  A,  Skolnick  J.  1998.   Functional  analysis  of  the  escherichia  coli  genome  us-


      ing  the  sequence-to-structure-to-function  paradigm:  identification  of  proteins  exhibiting  the


      glutaredoxin/thioredoxin disulfide oxidoreductase activity.  J. Mol. Biol. 282:703-711.


Fidelis  K,  Stern  PS,  Bacon  D,  Moult  J.  1994.   Comparison  of  systematic  search  and  database


      methods for constructing segments of protein structure.  Protein Eng. 7:953-960.


Fine RM, Wang H, Shenkin PS, Yarmush DL, Levinthal C. 1986. Predicting antibody hypervariable


      loop conformations. II: Minimization and molecular dynamics studies of MCP603 from many


      randomly generated loop conformations.  Proteins 1:342-362.


Finkelstein AV, Reva BA. 1992.  Search for the stable state of a short chain in a molecular field.


      Prot. Eng. 5:617-624.


Flores TP, Orengo CA, Moss DS, Thornton JM. 1993. Comparison of conformational characteristics


      in structurally similar protein pairs.  Protein Sci. 2:1811-1826.


Fushman D, Cahill S, ,  Cowburn D. 1997.  The main chain dynamics of the dynamin pleckstrin


      homology (PH) domain in solution:  Analysis of 15N relaxation with monomer/dimer equili-


      bration.  J. Mol. Biol. 266:173-194.


Go N, Abe H. 1984.  The consistency principle in protein structure and pathways of folding.  Adv.


      Biophysics 18:149-164.


Greer J. 1980.  Model for haptoglobin heavy chain based upon structural homology.  Proc.  Natl.


      Acad. Sci. USA 77:3393-3397.


Guenther B, Onrust R, Sali A, ODonnell M, Kuriyan J. 1997.  Crystal structure of the ffi' subunit


      of the clamp-loader complex of E. coli DNA polymerase III.  Cell 91:335-345.


                                                                 32


Higo J, Collura V, Garnier J. 1992. Development of an extended simulated annealing method: appli-


      cation to the modeling of complementary determining regions of immunoglobulins. Biopolymers


      32:33-43.


Huynen M, Doerks T, Eisenhaber F, Orengo C, Sunyaev S, Yuan Y, Bork P. 1998. Homology-based


      fold predictions for Mycoplasma genitalium proteins.  J. Mol. Biol. 280:323-326.


Janin J, Rodier F. 1995.  Protein-protein interaction at crystal contacts.  Proteins 23:580-587.


Jones DT. 1999. Genthreader:  an efficient and reliable protein fold recognition method for genomic


      sequences.  J. Mol. Biol. 287:797-815.


Jones S, Thornton JM. 1997.  Prediction of protein-protein interaction sites using patch analysis.


      J. Mol. Biol. 272:133-143.


Jones S, van Heyningen P, Berman HM, Thornton JM. 1999. Protein-DNA interactions: a structural


      analysis.  J. Mol. Biol. 287:877-896.


Jones TA, Kleywegt GJ. 1999.  Casp3 comparative modeling evaluation.  Proteins Suppl. 3:30-46.


Jones TH, Thirup S. 1986.  Using known substructures in protein model building and crystallogra-


      phy.  EMBO J. 5:819-822.


Kabsch  W,  Sander  C.  1983.   Dictionary  of  protein  secondary  structure:  pattern  recognition  of


      hydrogen-bonded and geometrical features.  Biopolymers 22:2577-2637.


Kasuya  A,  Thornton  JM.  1999.  Three-dimensional  structure  analysis  of  PROSITE  patterns.  J.


      Mol. Biol. 286:1673-1691.


Kick EK, Roe DC, Skillman AG, Liu G, Ewing TJ, Sun Y, Kuntz ID, Ellman JA. 1997. Structure-


      based design and combinatorial chemistry yield low nanomolar inhibitors of cathepsin D. Chem.


      Biol. 4:297-307.


Kidera A. 1995. Enhanced conformational sampling in Monte Carlo simulations of proteins:  appli-


      cations to a constrained peptide.  Proc. Natl. Acad. Sci. USA 92:9886-9889.


                                                                 33


Kinoshita  K,  KSadanami,  Kidera  A,  Go  N.  1999.   Structural  motif  of  phosphate-binding  site


      common  to  various  protein  superfamilies:   all-againt-all  structural  comparison  of  protein-


      mononucleotide complexes.  Prot. Eng. 12:11-14.


Kleywegt GJ. 1999. Recognition of spatial motifs in protein structures. J. Mol. Biol. 285:1887-1897.


Koehl P, Delarue M. 1995.  A self consistent mean field approach to simultaneous gap closure and


      side-chain positioning in protein homology modelling.  Nature Struct. Biol. 2:163-170.


Kwasigroch JM, Chomilier J, Mornon JP. 1996.  A global taxonomy of loops in globular proteins.


      J. Mol. Biol. 259:855-872.


Lambert MH, Scheraga HA. 1989a.  Pattern recognition in the prediction of protein structure I.


      Tripeptide conformational probabilities calculated from the amino acid sequence.  J.  Comp.


      Chem. 10:770-797.


Lambert MH, Scheraga HA. 1989b.  Pattern recognition in the prediction of protein structure II.


      Chain conformation from a probability-directed search procedure. J. Comp. Chem. 10:798-816.


Lambert MH, Scheraga HA. 1989c.  Pattern recognition in the prediction of protein structure III.


      An importance-sampling minimization procedure.  J. Comp. Chem. 10:817-831.


Lessel U, Schomburg D. 1994. Similarities between protein 3D structures. Protein Eng. 7:1175-1187.


Levitt M. 1983.  Molecular dynamics of native protein. II. Analysis and nature of motion.  J. Mol.


      Biol. 168:621-657.


Levitt M. 1992.  Accurate modeling of protein conformation by automatic segment matching.  J.


      Mol. Biol. 226:507-533.


Lu Y, Valentine JS. 1997.  Engineering metal-binding sites in proteins.  Curr. Opin. Struct. Biol.


      7:495-500.


Luthy  R,  Bowie  JU,  Eisenberg  D.  1992.   Assessment  of  protein  models  with  three-dimensional


      profiles.  Nature 356:83-85.


MacKerell Jr AD, Bashford D, Bellott M, Dunbrack Jr R, Evanseck J, Field M, Fischer S, Gao


      J,  Guo  H,  Ha  S,  Joseph  McCarthy  D,  Kuchnir  L,  Kuczera  K,  Lau  F,  Mattos  C,  Michnick


                                                                 34


      S, Ngo T, Nguyen D, Prodhom B, Reiher III W, Roux Band Schlenkrich M, Smith J, Stote


      Rand  Straub  J,  Watanabe  M,  Wiorkiewicz  Kuczera  J,  Yin  D,  Karplus  M.  1998.   All-atom


      empirical potential for molecular modeling and dynamics studies of proteins.  J. Phys. Chem.


      B 102:3586-3616.


Mandal C, Kingery BD, Anchin JM, Subramaniam S, Linthicum DS. 1996. ABGEN: a knowledge-


      based automated approach for antibody structure modeling. Nature Biotechnology 14:323-328.


Marti  Renom  MA,  Stuart  A,  Fiser  A,  Sanchez  R,  Melo  F,  Sali  A.  2000.   Comparative  protein


      structure modeling of genes and genomes.  Ann. Rev. Biophys. Biomolec. Struct. 29:291-325.


Martin  A,  MacArthur  M,  Thornton  J.  1997.   Assessment  of  comparative  modeling  in  CASP2.


      Proteins Suppl. 1:14-28.


Martin ACR, Cheetham JC, Rees AR. 1989.  Modeling antibody hypervariable loops:  a combined


      algorithm.  Proc. Natl. Acad. Sci. USA 86:9268-9272.


Martin ACR, MacArthur MW, Thornton JM. 1997. Assessment of comparative modeling in casp2.


      Proteins Suppl 1:14-28.


Martin ACR, Thornton JM. 1996.  Structural families in loops of homologous proteins:  automatic


      classification, modelling and application to antibodies.  J. Mol. Biol. 263:800-815.


Mas MT, Smith KC, Yarmush DL, Aisaka K, Fine RM. 1992.  Modeling the anti-CEA antibody


      combining site by homology and conformational search.  Proteins 14:483-498.


Mattos C, Petsko GA, Karplus M. 1994.  Analysis of two-residue turns in proteins.  J. Mol. Biol.


      238:733-747.


McGarrah DB, Judson RS. 1993.  Analysis of the genetic algorithm method of molecular confor-


      mation determination.  J. Comp. Chem. 14:1385-1395.


Melo F, Feytmans E. 1997.  Novel knowledge-based mean force potential at atomic level.  J. Mol.


      Biol. 267:207-222.


Mezei M. 1998.  Chameleon sequences in the PDB.  Prot. Eng. 11:411-414.


                                                                 35


Mosimann  S,  Meleshko  R,  James  MNG.  1995.   A  critical  assessment  of  comparative  molecular


      modeling of tertiary structures of proteins.  Proteins 23:301-317.


Moult  J,  Hubbard  T,  Fidelis  K,  Pedersen  JT.  1999.   Critical  assessment  of  methods  of  protein


      structure prediction (CASP): round III.  Proteins Suppl. 3:2-6.


Moult J, James MNG. 1986.  An algorithm for determining the conformation of polypeptide seg-


      ments in proteins by systematic search.  Proteins 1:146-163.


Nakajima N, Higo J, Kidera A. 2000.  Free energy landscapes of peptides by enchanced conforma-


      tional sampling.  J. Mol. Biol. 296:197-216.


Oliva B, Bates PA, Querol E, Aviles FX, Sternberg MJE. 1997. An automated classification of the


      structure of protein loops.  J. Mol. Biol. 266:814-830.


Pascarella S, Argos P. 1992.  Analysis of insertions/deletions in protein structures.  J.  Mol.  Biol.


      224:461-471.


Pellequer  J,  Chen  S.  1997.   Does  conformational  free  energy  distinguish  loop  conformations  in


      proteins?  Biophys. J. 73:2359-2375.


Perona JJ, Craik CS. 1995.  Structural basis of substrate specificity.  Protein Science 4:337-360.


Ponder  JW,  Richards  FM.  1987.  Tertiary  templates  for  proteins:  use  of  packing  criteria  in  the


      enumeration of allowed sequences for different structural classes.  J. Mol. Biol. 193:775-791.


Press  WH,  Teukolsky  SA,  Vetterling  WT,  Flannery  BP.  1992.   Numerical  Recipes,  2nd  edition.


      Cambridge:  Cambridge University Press.


Ramachandran GE, Ramakrisnan C, Sasisekharan V. 1963.  Stereochemistry of polypeptide chain


      conformations.  J. Mol. Biol. 7:95-99.


Rao U, Teeter MM. 1993. Improvement of turn structure prediction by molecular dynamics:  a case


      study of ff-purithionin.  Protein Eng. 6:837-847.


Rapp  CS,  Friesner  RA.  1999.   Prediction  of  loop  geometries  using  a  generalized  Born  model  of


      solvation effect.  Proteins 35:173-183.


                                                                 36


Reczko M, Martin ACR, Bohr H, Suhai S. 1995. Prediction of hypervariable CDR-H3 loop structures


      in antibodies.  Protein Engineering 8:389-395.


Ring CS, Cohen FE. 1994.  Conformational sampling of loop structures using genetic algorithm.


      Isr. J. Chem. 34:245-252.


Ring CS, Kneller DG, Langridge R, Cohen FE. 1992.  Taxonomy and conformational analysis of


      loops in proteins.  J. Mol. Biol. 224:685-699.


Ring CS, Sun E, McKerrow JH, Lee GK, Rosenthal PJ, Kuntz ID, Cohen FE. 1993.  Structure-


      based  inhibitor  design  by  using  protein  models  for  the  development  of  antiparasitic  agents.


      Proc. Natl. Acad. Sci. USA 90:3583-3587.


Ringe D, Petsko GA. 1985.  Mapping protein dynamics by x-ray diffraction.  Prog. Biophys. Molec.


      Biol. 45:197-235.


Rosenbach  D,  Rosenfeld  R.  1995.  Simultaneous  modeling  of  multiple  loops  in  proteins.  Protein


      Sci. 4:496-505.


Rosenfeld  R,  Zheng  Q,  Vajda  S,  DeLisi  C.  1993.   Computing  the  structure  of  bound  peptides:


      application to antigen recognition by class I MHCs.  J. Mol. Biol. 234:515-521.


Rufino SD, Donate LE, Canard LHJ, Blundell TL. 1997. Predicting the conformational class of short


      and medium size loops connecting regular secondary structures:  application to comparative


      modeling.  J. Mol. Biol. 267:352-367.


Russell RB, Sasieni PD, Sternberg MJE. 1998. Supersites within superfolds. Binding site similarity


      in the absence of homology.  J. Mol. Biol. 282:903-918.


Rychlewski L, Zhang B, Godzik A. 1998. Fold and function predictions for Mycoplasma genitalium


      proteins.  Fold. Des. 3:229-238.


Sali A. 1998.  100,000 protein structures for the biologist.  Nat. Struct. Biol. 5:1029-1032.


Sali A, Blundell TL. 1993.  Comparative protein modelling by satisfaction of spatial restraints.  J.


      Mol. Biol. 234:779-815.


                                                                 37


Sali  A,  Sanchez  R,  Badretdinov  AY,  Fiser  A,  Melo  F,  Overington  JP,  Feyfant  E,  Marti


      Renom  MA.  1999.   Modeller,  A  Protein  Structure  Modeling  Program,  Release  5.   URL


      http://guitar.rockefeller.edu/.


Samudrala  R,  Moult  J.  1998.   A  graph-theoretic  algorithm  for  comparative  modeling  of  protein


      structure.  J. Mol. Biol. 279:287-302.


Sanchez R, Sali A. 1997a. Evaluation of comparative protein structure modeling by MODELLER-3.


      Proteins Suppl. 1:50-58.


Sanchez R, Sali A. 1997b. Advances in comparative protein-structure modeling. Curr. Opin. Struct.


      Biol. 7:206-214.


Sanchez R, Sali A. 1999.  ModBase :  a database of comparative protein structure models.  Bioin-


      formatics 15:1060-1061.


Sander C, Schneider R. 1991.  Database of homology-derived protein structures and the structural


      meaning of sequence alignment.  Proteins 9:56-68.


Shenkin PS, Yarmush DL, Fine RM, Wang H, Levinthal C. 1987. Predicting antibody hypervariable


      loop conformation:  i. Ensembles of random conformations for ring-like structures. Biopolymers


      26:2053-2085.


Shepherd  AJ,  Gorse  D,  Thornton  JM.  1999.   Prediction  of  the  location  and  type  of  fi-turns  in


      proteins using neural networks.  Protein Sci. 8:1045-1055.


Sibanda BL, Blundell TL, Thornton JM. 1989.  Conformation of fi-hairpins in protein structures:


      a systematic classification with applications to modelling by homology, electron deosity fitting


      and protein engineering.  J. Mol. Biol. 206:759-777.


Sippl MJ. 1990.  Calculation of conformational ensembles from potentials of mean force. An ap-


      proach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol.


      213:859-883.


Sippl MJ. 1993. Recognition of errors in three-dimensional structures of proteins. Proteins 17:355-


      362.


                                                                 38


Smith  KC,  Honig  B.  1994.   Evaluation  of  the  conformational  free  energies  of  loops  in  proteins.


      Proteins 18:119-132.


Sanchez R, Sali A. 1998.  Large-scale protein structure modeling of the Saccharomyces cerevisiae


      genome.  Proc. Natl. Acad. Sci. USA 95:13597-13602.


Sudarsanam S, DuBose RF, March CJ, Srinivasan S. 1995.  Modeling protein loops using a OEi+1 ,


       i dimer database.  Protein Sci. 4:1412-1420.


Summers NL, Karplus M. 1990.  Modeling of globular proteins:  a distance-based search procedure


      for the construction of insertion/deletion regions and pro ! non-pro mutations.  J. Mol. Biol.


      216:991-1016.


Tanner JJ, Nell LJ, McCammon JA. 1992.  Anti-insulin antibody structure and conformation. II.


      Molecular dynamics with explicit solvent.  Biopolymers 32:23-32.


Thanki  N,  Zeelen  JP,  Mathieu  M,  Jaenicke  R,  Abagyan  RA,  Wierenga  RK,  Schliebs  W.  1997.


      Protein  engineering  with  monomeric  triosephosphate  isomerase  (monoTIM):  the  modelling


      and structure verification of a seven-residue loop.  Protein Eng. 10:159-167.


Topham CM, McLeod A, Eisenmenger F, Overington JP, Johnson MS, Blundell TL. 1993. Fragment


      ranking in modelling of protein structure. Conformationally constrained environmental amino


      acid substitution tables.  J. Mol. Biol. 229:194-220.


Tramontano A, Chothia C, Lesk AM. 1989. Structural determinants of the conformations of medium


      sized loops in proteins.  Proteins 6:382-394.


Tramontano A, Lesk AM. 1992.  Common features of the conformations of antigen-binding loops


      in immunoglobulins and application to modeling loop conformations.  Proteins 13:231-245.


Vajda S, DeLisi C. 1990. Determining minimum energy conformations of polypeptides by dynamic


      programming.  Biopolymers 29:1755-1772.


van Vlijmen HWT, Karplus M. 1997. PDB-based protein loop prediction:  parameters for selection


      and methods for optimization.  J. Mol. Biol. 267:975-1001.


Vasmatzis  G,  Brower  RC,  DeLisi  C.  1994.   Predicting  immunoglobulin-like  hypervariable  loops.


      Biopolymers 34:1669-1680.


                                                                 39


Wang  J,  Kollman  PA,  Kuntz  ID.  1999.  Flexible  ligand  docking:  a  multistep  strategy  approach.


      Proteins 36:1-19.


Wei L, , Huang ES, Altman RB. 1999. Are predicted structures good enough to preserve functional


      sites?  Structure 7:643-650.


Wintjens  RT,  Rooman  MJ,  Wodak  SJ.  1996.  Automatic  classification  and  analysis  of  ff-ff  turn


      motifs in proteins.  J. Mol. Biol. 255:235-253.


Wojcik  J,  Mornon  JP,  Chomilier  J.  1999.  New  efficient  statistical  sequence-dependent  structure


      prediction of short to medium-sized protein loops based on an exhaustive loop classification.


      J. Mol. Biol. 289:1469-1490.


Wu SJ, Dean DH. 1996. Functional significance of loops in the receptor binding domain of Bacillus


      thuringiensis CryIIIA ffi-endotoxin.  J. Mol. Biol. 255:628-640.


Xu  LZ,  Sanchez  R,  Sali  A,  Heintz  N.  1996.   Ligand  specificity  of  brain  lipid  binding  protein.


      J.Biol.Chem. 271:24711-24719.


Zheng Q, Kyle DJ. 1994. Multiple copy sampling: rigid versus flexible protein. Proteins 19:324-329.


Zheng Q, Kyle DJ. 1996. Accuracy and reliability of the scaling-relaxation method for loop closure:


      an evaluation based on extensive and multiple copy conformational samplings. Proteins 24:209-


      217.


Zheng Q, Rosenfeld R, DeLisi C, Kyle DJ. 1994. Multiple copy sampling in protein loop modeling:


      computational efficiency and sensitivity to dihedral angle perturbations.  Protein Sci. 3:493-


      506.


Zheng Q, Rosenfeld R, Vajda S, DeLisi C. 1993a.  Determining protein loop conformation using


      scaling-relaxation techniques.  Protein Sci. 2:1242-1248.


Zheng Q, Rosenfeld R, Vajda S, DeLisi C. 1993b.  Loop closure via bond scaling and relaxation.  J.


      Comp. Chem. 14:556-565.


                                                                 40


Tables  and  Figures


                                            ______________________________________________
                                               Loop     _______Accuracy_class_____________
                                            __length_____Good_______Medium________Bad_____

                                                     1    100.0            0.0      0.0
                                                     2     97.5            2.5      0.0
                                                     3    100.0            0.0      0.0
                                                     4     82.5           17.5      0.0
                                                           85.0           15.0      0.0
                                                           42.8           57.2      0.0
                                                     5     70.0           27.5      2.5
                                                     6     57.5           27.5    15.0
                                                     7     52.5           25.0    22.5
                                                     8     30.0           47.5    22.5
                                                           50.0           40.0    10.0
                                                           14.1           59.4    26.5
                                                     9     20.0           35.0    45.0
                                                   10        7.5          40.0    52.5
                                                   11        2.5          27.5    72.5
                                                   12        2.5          17.5    80.0
                                                             7.5          22.5    70.0
                                                             0.0           6.9    93.1
                                                   13        5.0          10.0    80.0
                                            _______14________2.4___________2.4____95.1____


Table 1:  The fraction of loop predictions in the three accuracy classes, as a function of the number
of residues in the loop. The accuracy classes are defined in Figure 6.  The standard algorithm with
50  random  initial  conformations  (Theory  and  Algorithms)  was  applied  to  40  test  loops  of  each
length.  For 4, 8, and 12-residue loops, the numbers in the 2nd line give the results for 500 random
initial conformations in the native environment; the numbers in the 3rd line give the results for 50
random initial conformations in the environment distorted from 1.75 to 2.75A   (Figure 10).


                                                                 41


___________________________________________________________________________________________________________________________________*
 *___
  Loop                        Reference    ____________________________________Present_method______________________________________*
 *___
                            __________________________lowest_energy_prediction___________|_________lowest_RMSD_____prediction______*
 *____
_________________________________global*_____global*_____global____local______score____rank__|__global_____local_____score_____rank*
 *____

  2apr_76-83 (8)                    5.16        1.31       1.35     0.73     -48.39        5  | 0.94        0.50    -44.23         2
  8abp_203-208 (6)                  0.28        0.38       0.37     0.24     -35.34       75  | 0.24        0.17    -29.25       32
  2act_198-205 (8)                  1.58        2.04       2.21     1.60     -10.47       62  | 1.93        1.09     13.95      392
  8tln_E32-E38 (7)                  3.70        2.03       2.26     1.82     -49.45      435  | 0.93        0.65    -13.30      314
  3grs_83-89 (7)                    4.55        0.42       0.58     0.47       2.77        6  | 0.43        0.31      8.71         6
  5cpa_231-237 (7)                  2.14        0.95       1.23     1.06      -5.32       18  | 1.00        0.89     11.83      244
  2fb4_H26-H32 (7)                  1.62        4.20       4.25     2.06       3.54      282  | 0.52*       0.41     15.98      148
  2fbj_H100-H106 (7)                0.49        0.84       1.31     1.08       3.27       48  | 1.03*       0.83     47.17      532
  8tln_E248-E255 (8)                1.83        0.87       0.98     0.84     -26.04      182  | 0.70        0.61    -10.80      246
  3sgb_E199-E211 (9)                1.79        0.28       0.36     0.28     -53.36        4  | 0.29*       0.24    -49.14         3
  3dfr_20-23 (4)                    2.64        1.15       1.59     1.51     29.70       676  | 0.35        0.20     47.88      844
  3dfr_89-93 (5)                    1.62        1.02       1.14     0.85      -0.06       21  | 0.87        0.78     16.98      690
  3dfr_120-124 (5)                  0.47        0.26       0.28     0.20       0.71      217  | 0.23        0.15      9.96      484
__3blm_131-135_(5)__________________0.82________0.16_______0.22_____0.14_____-32.55______103__|_0.16________0.11____-26.57_______75*
 *___


Table 2:  Comparison of predictions for 14 test loops by a recent powerful method (Tables 8 and 9 in
van Vlijmen & Karplus, 1997) and the present method. The modeled loop segment is indicated after
the four character PDB code and its length is given in parentheses.  Since the reference study was
performed, structure 3tln was replaced by 8tln in PDB; thus, 8tln had to be used here.  Predictions
by the present method were obtained from 1000 independent optimizations, using the non-bonded
distance cutoff of 7A  .  The global and local column headers stand for RMSD     global   and RMSD     local ,

respectively.  The  global*  column  lists  RMSD    stem;NCffC  (loop; NCffC    ).  The  ranks  of  the  predictions
sorted by the energy and RMSD      are shown for the RMSD      and energy columns, respectively.  If the
lowest local and global RMSD      conformations are the same, the global RMSD      value is indicated by
a star.  The energy and rank of the conformation with the lowest local RMSD      are given.  The CPU
times for predicting one loop by the reference and present methods are approximately the same,
30 hours on an R10000-190 SGI workstation.


                                                                 42


Figure  legends


     Figures  1   Weights  for  the  mainchain  conformation  classes.  (a)  The  five  possible  mainchain


conformation classes are defined as areas A, B, E, P, and L in the Ramachandran plot spanned by


the  and  mainchain dihedral angles.  The plot shows the distribution for all 217,807 residues


in the 1,000 representative protein structures (Theory and Algorithms).  Slightly different borders


between B and P classes are used for the different residue types; the borders shown are approximate.


(b)  The  probability  that  a  given  residue  type  occurs  in  any  one  of  the  five  possible  mainchain


conformation classes, ! (Eq. 3).


     Figures  2   Actual  Ramachandran  plots  (left  side)  and  fitted  model  surfaces  (right  side)  for


four  representative  residue  types.  The  actual  Ramachandran  plots  are  the  natural  logarithm  of


the  observed  frequency  of  a  given  pair  of  the  mainchain  dihedral  angles    and    in  the  1,000


representative  protein  structures  (Theory  and  Algorithms).   The  modeled  Ramachandran  plots


correspond  to  the  restraints  used  in  modeling,  ln pm  (; =R)  (Eq.  3).  R=Asn  (9,767  residues);


R=Pro (9,755); R=Gly (16,650); R=Ala (17,454).


     Figures 3  Modeling of a loop. (a) Sample model.  The backbone trace of the native structure of


N-carbomoylsarcosine amidohydrolase (PDB code 1nba) is light blue, the native loop of 8 residues


(99-106) is dark blue, the distorted environment in which the loop is generally modeled is red, and


the initial loop conformation for optimization, before atomic coordinates are shifted randomly for


up to 5A  , is yellow.  (b) Sample optimization.  The energy (thick line) and temperature (thin line)


during conjugate gradients and molecular dynamics with simulated annealing are shown (Theory


and Algorithms).  The initial value of the energy is very high ( 106 ).


     Figures 4  Distribution of loop length. Secondary structure segments in the 1,000 representative


structures (Theory and Algorithms) were defined by the program DSSP (Kabsch & Sander, 1983)


(white bars).  All 13,444 segments that span helices and fi-strands were defined as loops.  Helices


were defined as contiguous segments of at least 4 residues in the H, G, or I conformation.  Strands


                                                                 43


were defined as contiguous segments of at least 3 residues in the E or B conformation.  The bars are


divided into two parts to indicate the fraction of predictions expected to be in the good and medium


classes (gray) and in the bad class (white), according to the evaluation based on 50 independent


optimizations in the correct environment (Table 1).  The results are similar for 500 independent


optimizations in the environments with the RMSD     core;NCffCO  (stem; NCffCO      ) error between 1.75 and


2.75 A  .


     Figures 5  Correlation of RMSD     local  with other measures of accuracy of a loop model. The three


lines are shown to facilitate interpreting the correlation; they have slopes of 0.5, 1.0, and 2.0.


     Figures 6  Definition of three levels of accuracy in loop modeling. Sample mainchains of a good


(a), medium (b), and bad (c) loop model are shown.  The defining RMSD     local  ranges are < 1A  , 1-


2A  , and > 2A  , respectively.  Models for residues 28-35 in NAD-dependent formate dehydrogenase


(PDB code 2nac) are shown.  The RMSD     core;NCffCO  (stem; NCffCO      ) for 3-residue stems on each side


of the loop is 1.2A  .  The left and right figures in each panel correspond to the RMSD     global   and


RMSD     local  superposition, respectively.


     Figures 7  Two sample loop predictions. On the left, the energy of the final conformations from


500 independent optimizations is plotted as a function of the RMSD     local  error.  On the right, the


C ff traces  of  the  17  lowest  energy  conformations  are  shown.  (a)  Successful  modeling  of  residues


45-52 in 5p21; the RMSD      error of the conformation with the lowest energy is 0.25A  .  (b) Incorrect


modeling of residues 34-41 in 1alc; the RMSD      error of the conformation with the lowest energy is


3.17A  .


     Figures 8  Accuracy of loop modeling as a function of thoroughness of optimization. For each


of the 40 test loops of 4 residues (a), 8 residues (b), and 12 residues (c), 500 independent optimiza-


tions were performed.  Each independent optimization started with a different random initial loop


conformation in the native environment (Theory and Algorithms).  Four different average RMSD


measures for the mainchain loop atoms, calculated over the 40 test loops, are plotted as a function


                                                                 44


of the number of independent optimizations.  Average RMSD     global   error for the models with the


lowest energy, open circles and thick line.  Average RMSD     local  error for the models with the lowest


energy, filled circles and thin line.  Average minimal RMSD     global   error, open diamonds and thick


dashed line.  Average minimal RMSD     local  error, filled diamonds and thin dashed line.  The standard


error of the mean is generally less than 0.1A   (not shown).  The curves are the least-squares fits of


the average RMSD      to the number of independent optimizations, no .  For example, for the 8-residue


loops:  <  RMSD     local  > = -6:28 + exp  (2:00 + 0:2593n-0:8045o )      .


     Figures 9  Accuracy of loop modeling in the correct environment as a function of loop length.


Models were calculated for 40 loops at each length from 1 to 14 residues, as described in Theory and


Algorithms.  50 independent optimizations were used to make each prediction.  Average accuracy


and the standard deviation of the accuracy are shown for each length for local (upper plot) and


global superposition.


     Figures 10  Accuracy of loop modeling as a function of environment error. The average RMSD


error and the standard error of the mean are plotted as a function of the error in the loop envi-


ronment.  Although many atoms around the loop were distorted (Results and Discussion), errors


in the environment are measured only by the mainchain RMSD      of the three stem residues on each


side of the loop, upon superposition of the non-distorted atoms of the native and model structures


(Figure 3).  Each of the 40 test loops of 4 residues (open diamonds),  8 residues (circles),  and 12


residues (filled diamonds) have been distorted 5 times, resulting in 200 loop predictions in a dis-


torted  environment.  Each  individual  loop  prediction  consisted  of  50  independent  optimizations.


Upper plot shows the local, while the one below the evaluation by global superposition.


     Figures 11  Accuracy of loop modeling as a function of loop properties. The RMSD      of the lowest


energy model from 500 independent optimizations is plotted against several loop properties.  (a)


Normalized average atomic isotropic temperature factor of the loop (< Biso  >loop  - < Biso  >protein          .


Normalization was performed to improve direct comparison of isotropic temperature factors from


independent structure determinations (Ringe & Petsko, 1985).  (b) Number of non-bonded atomic


                                                                 45


contacts within the loop.  (c) Number of intra-molecular atomic loop-environment contacts.  (d)


Number of inter-molecular atomic loop-environment contacts.  The symmetry related protein struc-


tures were calculated by program CRYSPACK (Janin & Rodier, 1995), relying on the information


in the PDB atom files.  An atomic, non-bonded contact occurs when two atoms are at a distance


of less than 4.5A   and are separated by more than three covalent bonds.  The data are shown for 40


loops of 8 residues each.  The Pearson correlation coefficient is also given in each panel.


     Figures 12  Prediction of the accuracy of the lowest energy loop model. The RMSD     local  error of


the best loop model correlates well with the structural variability of the 20 lowest energy models, out


of the 500 independently calculated models.  The variability of the lowest energy models is measured


by the average of the 20 x 19=2 pairwise RMSD     local  differences.  The correlation is shown for 40


loops of 8 residues each.  For reference, a line is drawn that corresponds to RMSD      = 2 < RMSD      >.


     Figures 13  Examples of loop and distortion modeling for CASP3. The native conformations are


in blue, the models are in red.  All the mainchain atoms are shown for the loops, only the backbone


trace is plotted for the 3 stem residues on each side of the loops.  (a) Loop prediction for residues


46-53 in target T0076.  No template loop of the same length was available from the structurally


defined members of the family.  The RMSD     local  error is 1.37A   while the stem residues superpose


on the native structure with RMSD     stem;NCffCO  (stem; NCffCO      ) of 1.52A  .  (b) Distortion modeling of


residues 80-85 in T0058.  The closest template structure (PDB code 1akz, shown in yellow) had a


segment aligned with the modeled segment without gaps, but their sequences were so different that


the target segment was modeled as a loop.  The template and target sequences are RPGIAI and


QRPVPP, respectively.  The RMSD     global  , RMSD     local , and RMSD     stem;NCffCO   (stem; NCffCO      ) errors


for the model are 1.57, 1.09, and 0.29 A  , respectively.  The corresponding deviations between the


template and the actual target structure are 0.99, 0.75, and 0.27A  .


                                                                 46


Figure 1:


      47


Figure 2:  Continued on next page.


                      48


     Figure 2 continued:


Figure 2.


                                                                 49


(a)


   (b)


                                                           Figure 3:


                                                                 50


Figure 4:


      51


Figure 5:

      52


Figure 6:


      53


Figure 7:


      54


Figure 8:


      55


Figure 9:


      56


Figure 10:


       57


Figure 11:


       58


Figure 12:


       59


Figure 13:


       60