Andrej Sali
Laboratories of Molecular Biophysics
Pels Family Center for Biochemistry and Structural Biology
The Rockefeller University, 1230 York Avenue, New York, NY 10021, USA
tel: (212) 327 7550; fax: (212) 327 7540; e-mail: sali@rockefeller.edu
Structural genomics is a comprehensive effort toward structural
characterization of all proteins
[1,2,3,4,5,6,7,8,9,10,11].
The first essential step in structural genomics is the selection of target protein
sequences for experimental structure determination such that all the remaining
proteins are related to at least one known structure at a useful level
of similarity (Figure 1). On pages xx-xx of this issue, Vitkup
et al. describe the scope of structural genomics [12]. The number
of targets is estimated from similarities among the sequences within
the 2,000 domain families in the Pfam database
[13]. To relate 90% of the domain sequences in Pfam to a
known structure with sequence identity, two
structures per Pfam family are needed. The Pfam domain
families cover only a quarter of the domains in several representative
genomes. In practice, inefficiencies in target selection are estimated to
increase the number of targets by approximately a factor of three relative to the
optimal target selection. Thus, the scope of structural genomics
corresponds to approximately 50,000 targets, which is well within
reach of the nascent global structural genomics effort (
Nat. Str. Biol. 7 Suppl., 2000).
A priori, two qualifications of structural genomics targets can
be made. First, the targets are likely to be individual domains rather
than multi-domain proteins. The reason is that the structure of a
single domain is usually easier to determine by X-ray crystallography
or NMR spectroscopy than that of a more flexible multi-domain protein.
Second, domains that are not amenable to structure determination are
excluded from consideration. Such domains may include membrane
spanning domains, domains with unusal amino acid residue composition
(, low-complexity regions), large flexible domains, domains that
require ligands for stability, and variants resulting from
post-translational modifications and alternative splicing.
Target selection is tied intimately to the chosen aim of structural
genomics. For example, if the aim is to map distant evolutionary
relationships between all related domains [14],
only a relatively low-density sampling of the protein space is
required. In contrast, inability of protein structure modeling
to predict reliably functional differences between homologs led
others to include close homologs on the target list (, 70%
sequence identity); but the scope is limited to a single genome so
that the project is still feasible [9]. Many additional
target selection strategies of the individual groups involved in
structural genomics are reviewed
comprehensively in ref. Brenner2000. For example, target
lists may correspond to the representatives of all fold families
[15,16], functional families [7],
all proteins from a genome [3], or all unusual
uncharacterized soluble proteins in a small genome [17].
Domain families and domain sequences may be prioritized by relevance
and feasibility criteria, such as currently perceived medical
importance and the number of methionin residues.
The target lists of the individual research groups are usually limited
to a certain type of a protein (, cancer-related proteins)
or to a subset of all protein sequences (, a genome) to make
the size of the individual projects reasonable. In contrast to individual
groups, who can afford to focus on relatively small parts of the
protein space, the target selection of the global structural genomics
effort must cover all protein sequences that are amenable to structure
determination.
It is convenient to take a model-centric view of target selection:
Structural genomics aims to produce useful comparative models for
most protein sequences [18,12].
This view is justified because the first step
in many structure-based annotations can be calculation of a
comparative model [19], although there are
trivial cases where modeling is not needed and difficult cases where
modeling cannot yet be helpful. To obtain a reasonable level of
accuracy, the models must be based on alignments with few errors. Such
alignments can usually be obtained when the sequence identity between
the modeled sequence and at least one known structure is higher than
30% [19].
Thus, structural genomics should determine protein structures
so that most sequences in the genome databases match at least one
structure with an overall sequence identity of more than 30%
[18,12].
Vitkup et al. first estimate the number of structural genomics
targets for a well defined set of 2,000 protein domain families in the
Pfam 4.4 database. The targets are selected by a ``greedy'' coverage
algorithm. This simple algorithm picks a target iteratively by
maximizing the number of domain sequences that can be modeled based on
at least 30% sequence identity to the selected target structure. The
number of targets required to cover all of the 260,000 domain
sequences in Pfam is 17,000 (13,000 if the membrane spanning domains
are excluded). Above 30% sequence identity, the number of targets
increases by 10,000 per 10 percentage points of sequence identity.
As described below, Vitkup et al. quantify substantial reductions
in the number of targets that result from improving modeling techniques and
from relaxing the completeness requirement. They also address the negative
impacts of failure in structure determination and deviations from the
optimal target selection strategy.
The number of required targets would be reduced by a factor of two if the
modeling techniques were improved so that the accuracy of comparative
models based on 20% sequence identity equaled the current accuracy at 30%
sequence identity [12]. To achieve this aim, improvements in
all aspects of comparative modeling are required, including fold assignment,
sequence-structure alignment, and modeling of insertions, core segments,
and sidechains [19].
A substantial reduction in the number of targets can also be achieved if
the small families are initially ignored. For example, when the
coverage requirement is relaxed from 100% to 90% of all sequences in
Pfam, only 4,000 targets (2 per family) instead of 17,000 targets (8 per
family) are required [12].
On the downside, it might be expected that the efficiency of
structural genomics is decreased significantly by the low
success rate of structure determination; , 10-20% for randomly
picked protein sequences [9]. However, the corresponding
decrease in the coverage of domain sequences by structural genomics is
only 10% [12]. The reason is that large families provide
many alternative targets, most of which are satisfactory
because they allow modeling of many of the remaining family members.
This result supports the class-directed approach to structure determination
[1].
The efficiency of structural genomics is also reduced when the individual
research groups are applying different target
selection criteria [12]. They may not all use the 30% sequence
identity cutoff rigorously and may impose additional filters,
such as the genome of origin and the biological significance of
the target. As a consequence, the ``selection'' of targets for
the global structural genomics effort does not minimize the
number of targets required for structural characterization of
most protein sequences. The target selection efficiency in
practice is expected to correspond to that of
selecting targets randomly, but only if they have less than 30%
sequence identity to an already determined structure. In such a case,
three times as many targets as with the optimal greedy algorithm would
be required. This result provides a strong incentive for global
coordination of target lists. Steps in this direction include the web
sites of the individual research groups mandated by NIH in North
America (Nat. Str. Biol. 7 Suppl., 2000), web sites with
comprehensive target lists (http://presage.berkeley.edu,
http://www.structuralgenomics.org), and tools such as PartsList,
a web based system for dynamically ranking domain folds based
on more than 180 attributes [20].
The final step in estimating the scope of structural genomics is to
extrapolate cautiously from the number of targets needed for the current Pfam
domain families to the number of targets needed for all domain families
[12].
It is necessary to assume that the modeling density in Pfam
applies to all domain families, including the currently unknown
ones. Since only about a quarter of all residues in the coding regions
of several representative genomes match one of the 2,000 Pfam
families, the total number of protein domains is estimated to be
approximately 8,000, which is consistent with some other estimates
[21]. Because 12,000 targets are required to cover 90%
of sequences in the current Pfam database when using a realistic
target selection algorithm, the scope of a comprehensive structural
genomics effort is approximately 50,000 targets (including the
membrane spanning domains). In other words, if the structures of 50,000 target domains
are determined by experiment, it should be possible to model
approximately 90% of all sequences based on at least 30% sequence
identity. In comparison, the fraction of domains that can currently be modeled
based on at least 30% sequence identity to a known structure is only
approximately 10% [12]. Thus, the currently known structures do
not significantly reduce the scope of structural genomics if at
least 30% sequence identity is required for modeling.
At present, structural biologists are producing approximately 500
protein structures qualifying as structural genomics targets per
year. In a few years, the global structural genomics efort is likely
to overcome this number several fold. Thus, it is conceivable that
structures of 70% of all protein domains within boundaries of
structural genomics will be structurally characterized in less than
5 years. As a result, application of the powerful principles of structural
biology to most biological problems is imminent.
Acknowledgments
AS is grateful to Stephen K. Burley, John Kuriyan, Terry Gaasterland and other members of the New York Structural Genomics Research Consortium, for many discussions about structural genomics, and to Heidi M. Moss and Narayanan Eswar for comments on the manuscript. AS is an Irma T. Hirschl Trust Career Scientist. Support by The Merck Genome Research Institute, Mathers Foundation, and NIH is also acknowledged.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -link 5 -t 'Sali Lab paper' main.tex
The translation was initiated by Andrej Sali on 2001-05-03