next_inactive up previous


Target practice
Right on target (1st alternative title)
Target selection for structural genomics (2nd alternative title)

Andrej Sali


Laboratories of Molecular Biophysics
Pels Family Center for Biochemistry and Structural Biology
The Rockefeller University, 1230 York Avenue, New York, NY 10021, USA
tel: (212) 327 7550; fax: (212) 327 7540; e-mail: sali@rockefeller.edu


















Abstract:

The scope of structural genomics has recently been estimated through comparison of the currently known protein sequences. Useful characterization of most protein sequences will be possible by protein structure modeling, once structures of approximately 50,000 carefully selected protein domains are determined experimentally.

Structural genomics is a comprehensive effort toward structural characterization of all proteins [1,2,3,4,5,6,7,8,9,10,11]. The first essential step in structural genomics is the selection of target protein sequences for experimental structure determination such that all the remaining proteins are related to at least one known structure at a useful level of similarity (Figure 1). On pages xx-xx of this issue, Vitkup et al. describe the scope of structural genomics [12]. The number of targets is estimated from similarities among the sequences within the 2,000 domain families in the Pfam database [13]. To relate 90% of the domain sequences in Pfam to a known structure with $>30\%$ sequence identity, two structures per Pfam family are needed. The Pfam domain families cover only a quarter of the domains in several representative genomes. In practice, inefficiencies in target selection are estimated to increase the number of targets by approximately a factor of three relative to the optimal target selection. Thus, the scope of structural genomics corresponds to approximately 50,000 targets, which is well within reach of the nascent global structural genomics effort ( Nat. Str. Biol. 7 Suppl., 2000). A priori, two qualifications of structural genomics targets can be made. First, the targets are likely to be individual domains rather than multi-domain proteins. The reason is that the structure of a single domain is usually easier to determine by X-ray crystallography or NMR spectroscopy than that of a more flexible multi-domain protein. Second, domains that are not amenable to structure determination are excluded from consideration. Such domains may include membrane spanning domains, domains with unusal amino acid residue composition (, low-complexity regions), large flexible domains, domains that require ligands for stability, and variants resulting from post-translational modifications and alternative splicing. Target selection is tied intimately to the chosen aim of structural genomics. For example, if the aim is to map distant evolutionary relationships between all related domains [14], only a relatively low-density sampling of the protein space is required. In contrast, inability of protein structure modeling to predict reliably functional differences between homologs led others to include close homologs on the target list (, 70% sequence identity); but the scope is limited to a single genome so that the project is still feasible [9]. Many additional target selection strategies of the individual groups involved in structural genomics are reviewed comprehensively in ref. Brenner2000. For example, target lists may correspond to the representatives of all fold families [15,16], functional families [7], all proteins from a genome [3], or all unusual uncharacterized soluble proteins in a small genome [17]. Domain families and domain sequences may be prioritized by relevance and feasibility criteria, such as currently perceived medical importance and the number of methionin residues. The target lists of the individual research groups are usually limited to a certain type of a protein (, cancer-related proteins) or to a subset of all protein sequences (, a genome) to make the size of the individual projects reasonable. In contrast to individual groups, who can afford to focus on relatively small parts of the protein space, the target selection of the global structural genomics effort must cover all protein sequences that are amenable to structure determination. It is convenient to take a model-centric view of target selection: Structural genomics aims to produce useful comparative models for most protein sequences [18,12]. This view is justified because the first step in many structure-based annotations can be calculation of a comparative model [19], although there are trivial cases where modeling is not needed and difficult cases where modeling cannot yet be helpful. To obtain a reasonable level of accuracy, the models must be based on alignments with few errors. Such alignments can usually be obtained when the sequence identity between the modeled sequence and at least one known structure is higher than 30% [19]. Thus, structural genomics should determine protein structures so that most sequences in the genome databases match at least one structure with an overall sequence identity of more than 30% [18,12]. Vitkup et al. first estimate the number of structural genomics targets for a well defined set of 2,000 protein domain families in the Pfam 4.4 database. The targets are selected by a ``greedy'' coverage algorithm. This simple algorithm picks a target iteratively by maximizing the number of domain sequences that can be modeled based on at least 30% sequence identity to the selected target structure. The number of targets required to cover all of the 260,000 domain sequences in Pfam is 17,000 (13,000 if the membrane spanning domains are excluded). Above 30% sequence identity, the number of targets increases by 10,000 per 10 percentage points of sequence identity. As described below, Vitkup et al. quantify substantial reductions in the number of targets that result from improving modeling techniques and from relaxing the completeness requirement. They also address the negative impacts of failure in structure determination and deviations from the optimal target selection strategy. The number of required targets would be reduced by a factor of two if the modeling techniques were improved so that the accuracy of comparative models based on 20% sequence identity equaled the current accuracy at 30% sequence identity [12]. To achieve this aim, improvements in all aspects of comparative modeling are required, including fold assignment, sequence-structure alignment, and modeling of insertions, core segments, and sidechains [19]. A substantial reduction in the number of targets can also be achieved if the small families are initially ignored. For example, when the coverage requirement is relaxed from 100% to 90% of all sequences in Pfam, only 4,000 targets (2 per family) instead of 17,000 targets (8 per family) are required [12]. On the downside, it might be expected that the efficiency of structural genomics is decreased significantly by the low success rate of structure determination; , 10-20% for randomly picked protein sequences [9]. However, the corresponding decrease in the coverage of domain sequences by structural genomics is only 10% [12]. The reason is that large families provide many alternative targets, most of which are satisfactory because they allow modeling of many of the remaining family members. This result supports the class-directed approach to structure determination [1]. The efficiency of structural genomics is also reduced when the individual research groups are applying different target selection criteria [12]. They may not all use the 30% sequence identity cutoff rigorously and may impose additional filters, such as the genome of origin and the biological significance of the target. As a consequence, the ``selection'' of targets for the global structural genomics effort does not minimize the number of targets required for structural characterization of most protein sequences. The target selection efficiency in practice is expected to correspond to that of selecting targets randomly, but only if they have less than 30% sequence identity to an already determined structure. In such a case, three times as many targets as with the optimal greedy algorithm would be required. This result provides a strong incentive for global coordination of target lists. Steps in this direction include the web sites of the individual research groups mandated by NIH in North America (Nat. Str. Biol. 7 Suppl., 2000), web sites with comprehensive target lists (http://presage.berkeley.edu, http://www.structuralgenomics.org), and tools such as PartsList, a web based system for dynamically ranking domain folds based on more than 180 attributes [20]. The final step in estimating the scope of structural genomics is to extrapolate cautiously from the number of targets needed for the current Pfam domain families to the number of targets needed for all domain families [12]. It is necessary to assume that the modeling density in Pfam applies to all domain families, including the currently unknown ones. Since only about a quarter of all residues in the coding regions of several representative genomes match one of the 2,000 Pfam families, the total number of protein domains is estimated to be approximately 8,000, which is consistent with some other estimates [21]. Because 12,000 targets are required to cover 90% of sequences in the current Pfam database when using a realistic target selection algorithm, the scope of a comprehensive structural genomics effort is approximately 50,000 targets (including the membrane spanning domains). In other words, if the structures of 50,000 target domains are determined by experiment, it should be possible to model approximately 90% of all sequences based on at least 30% sequence identity. In comparison, the fraction of domains that can currently be modeled based on at least 30% sequence identity to a known structure is only approximately 10% [12]. Thus, the currently known structures do not significantly reduce the scope of structural genomics if at least 30% sequence identity is required for modeling. At present, structural biologists are producing approximately 500 protein structures qualifying as structural genomics targets per year. In a few years, the global structural genomics efort is likely to overcome this number several fold. Thus, it is conceivable that structures of 70% of all protein domains within boundaries of structural genomics will be structurally characterized in less than 5 years. As a result, application of the powerful principles of structural biology to most biological problems is imminent.




Acknowledgments



AS is grateful to Stephen K. Burley, John Kuriyan, Terry Gaasterland and other members of the New York Structural Genomics Research Consortium, for many discussions about structural genomics, and to Heidi M. Moss and Narayanan Eswar for comments on the manuscript. AS is an Irma T. Hirschl Trust Career Scientist. Support by The Merck Genome Research Institute, Mathers Foundation, and NIH is also acknowledged.



Bibliography

1
T. C. Terwilliger, G. Waldo, T. S. Peat, J. M. Newman, K. Chu, and J. Berendzen.
Class-directed structure determination: Foundation for a Protein Structure Initiative.
Protein Sci., 7:1851-1856, 1998.

2
A. Šali.
100,000 protein structures for the biologist.
Nat. Struct. Biol., 5:1029-1032, 1998.

3
T. I. Zarembinski, L. W. Hung, H. J. Mueller-Dieckmann, K. K. Kim, H. Yokota, R. Kim, and S. H. Kim.
Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics.
Proc. Nat. Acad. Sci. USA, 95:15189-15193, 1998.

4
G. T. Montelione and S. Anderson.
Structural genomics: keystone for a human proteome project.
Nat. Str. Biol., 6:11-12, 1999.

5
S.A. Teichmann, C. Chothia, and M. Gerstein.
Advances in structural genomics.
Curr Opin Struct Biol, 9:390-9, 1999.

6
S. K. Burley, S. C. Almo, J. B. Bonanno, , M. Capel, M. R. Chance, T. Gaasterland, D. Lin, A. Šali, F. W. Studier, and S. Swaminathan.
Structural genomics: beyond the Human Genome Project.
Nat. Genet., 23:151-157, 1999.

7
J. R. Cort, E. V. Koonin, P. A. Bash, and M. A. Kennedy.
A phylogenetic approach to target selection for structural genomics: solution structure of YciH.
Nucl. Acids Res., 27:4018-4027, 1999.

8
S.E. Brenner and M. Levitt.
Expectations from structural genomics.
Protein Sci., 9:197-200, 2000.

9
D. Christendat, A. Yee, A. Dharamsi, Y. Kluger, A. Savchenko, J. R. Cort, V. Booth, C. D. MacKereth, V. Saridikis, I. Ekiel, G. Kozlov, K. L. Maxwell, N. Wu, L. P. McIntosh, K. Gehring, M. A. Kennedy, A. R. Davidson, E. F. Pai, M. Gerstein, A. M. Edwards, and C. H. Arrowsmith.
Structural proteomics of an arcaeon.
Nat. Str. Biol., 7:903-909, 2000.

10
U. Heinemann.
Structural genomics in Europe: slow start, strong finish?
Nat Struct Biol, 7 Suppl:940-2, 2000.

11
S. Yokoyama, H. Hirota, T. Kigawa, T. Yabuki, M. Shirouzu, T. Terada, Y. Ito, Y. Matsuo, Y. Kuroda, Y. Nishimura, Y. Kyogoku, K. Miki, R. Masui, and S. Kuramitsu.
Structural genomics projects in japan.
Nat Struct Biol, 7 Suppl:943-5, 2000.

12
D. Vitkup, E. Malamud, J. Moult, and C. Sander.
Completness in structural genomics.
Nat. Str. Biol., in press, 2001.

13
A. Bateman, E. Birney, R. Durbin, S. R. Eddy, K. L. Howe, and E. L. Sonnhammer.
The Pfam protein families database.
Nucl. Acids Res., 27:263-266, 2000.

14
S. E. Brenner.
Target selection for structural genomics.
Nat. Str. Biol., Suppl.:967-969, 2000.

15
P. Mallick, K. E. Goodwill, S. Fitz-Gibbons, J. H. Miller, and D. Eisenberg.
Selecting protein targets for structural genomics of Pyrobaculum aerophilum: validating automated fold assignment methods by using binary hypothesis testing.
Proc. Natl. Acad. Sci. USA, 97:2450-2455, 2000.

16
E. Portugalyi and M. Linial.
Estimating the probability for a protein to have a new fold: A statistical computational model.
Proc. Natl. Acad. Sci. USA, 97:5161-5116, 2000.

17
S. Balasubramanian, T. Schneider, M. Gerstein, and L. Regan.
Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome.
Nucl. Acids Res., 28:3075-3082, 2000.

18
R. Sánchez, U. Pieper, F. Melo, N. Eswar, M.A. Martí-Renom, M.S. Madhusudhan, N. Mirkovic, and A. Šali.
Protein structure modeling for structural genomics.
Nat. Struct. Biol., 7:986-990, 2000.

19
M. A. Martí-Renom, A. Stuart, A. Fiser, R. Sánchez, F. Melo, and A. Šali.
Comparative protein structure modeling of genes and genomes.
Ann. Rev. Biophys. Biomolec. Struct., 29:291-325, 2000.

20
J. Qian, B. Stenger, C. A. Wilson, J. Lin, R. Jansen, S. A. Teichmann, J. Park, W. G. Krebs, H. Yu, V. Alexandrov, N. Echols, and M. Gerstein.
PartsList: a web based system for dynamically ranking protein folds based on disparate attributes, including whole genome expression and interaction information.
Nucl. Acids. Res., 29:1750-1764, 2001.

21
Y.I. Wolf, N.V. Grishin, and E.V. Koonin.
Estimating the number of protein folds and families from complete genome data.
J. Mol. Biol., 299:897-905, 2000.

\begin{figure}
% latex2html id marker 1091
\begin{center}
\epsfig{file=../...
...ll as 3D models for all of the remaining
sequences (red dots). }
\end{figure}

About this document ...

Target practice
Right on target (1st alternative title)
Target selection for structural genomics (2nd alternative title)

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -link 5 -t 'Sali Lab paper' main.tex

The translation was initiated by Andrej Sali on 2001-05-03


next_inactive up previous
Andrej Sali 2001-05-03