Once a list of potential templates is obtained using searching methods, it is necessary to select one or more templates that are appropriate for the particular modeling problem. Several factors need to be taken into account when selecting a template.
The quality of a template increases with its overall sequence similarity to the target and decreases with the number and length of gaps in the alignment. The simplest template selection rule is to select the structure with the higher sequence similarity to the modeled sequence.
The family of proteins that includes the target and the templates can frequently be organized into sub-families. The construction of a multiple alignment and a phylogenetic tree [42] can help in selecting the template from the subfamily that is closest to the target sequence.
The similarity between the ``environment'' of the template and the environment in which the target needs to be modeled should also be considered. The term ``environment'' is used here in a broad sense, including everything that is not the protein itself (e.g., solvent, pH, ligands, quaternary interactions). If possible, a template bound to the same or similar ligands as the modeled sequence should generally be used.
The quality of the experimentally determined structure is another important factor in template selection. Resolution and R-factor of a crystallographic structure and the number of restraints per residue for an NMR structure are indicative of the accuracy of the structure. This information can generally be obtained from the template PDB files or from the articles describing structure determination. For instance, if two templates have comparable sequence similarity to the target, the one determined at the highest resolution should generally be used.
The criteria for selecting templates also depend on the purpose of a comparative model. For example, if a protein-ligand model is to be constructed, the choice of the template that contains a similar ligand is probably more important than the resolution of the template. On the other hand, if the model is to be used to analyze the geometry of the active site of an enzyme, it may be preferable to use a high-resolution template structure.
It is not necessary to select only one template. In fact, the use of several templates generally increases the model accuracy. One strength of MODELLER is that it can combine information from multiple template structures, in two ways. First, multiple template structures may be aligned with different domains of the target, with little overlap between them, in which case the modeling procedure can construct a homology-based model of the whole target sequence. Second, the template structures may be aligned with the same part of the target, in which case the modeling procedure is likely to automatically build the model on the locally best template [43,44]. In general, it is frequently beneficial to include in the modeling process all the templates that differ substantially from each other, if they share approximately the same overall similarity to the target sequence.
An elaborate way to select suitable templates is to generate and evaluate models for each candidate template structure and/or their combinations. The optimized all-atom models are evaluated by an energy or scoring function, such as the Z-score of PROSAII [45]. The PROSAII Z-score of a model is a measure of compatibility between its sequence and structure. Ideally, the Z-score of the model should be comparable to the Z-score of the template. PROSAII Z-score is frequently sufficiently accurate to allow picking one of the most accurate of the generated models [46]. This trial-and-error approach can be viewed as limited threading (i.e., the target sequence is threaded through similar template structures). For additional comments on model assessment see Section 2.5.