Before any modeling can begin, the sequences and segments with known 3D structures that are related to the sequence being modeled must be found. This can be achieved by the MODELLER SEQUENCE_SEARCH command. The search relies on a database of structures that are representative [Šali et al., 1995] of the whole Protein Data Bank (PDB) [Abola et al., 1987,Berman et al., 2000]. The PDB codes of these representative structures (about 3,000 codes) are listed in file modlib/CHAINS_3.0_40_XN.cod and their sequences are stored in file modlib/CHAINS_all.seq, which includes approximately 16,000 sequences for all the unique non-model chains in PDB longer than 25 amino acid residues. The representative structures are likely to have less than 40% sequence identity to each other and the length difference that is at least 30% of the shorter chain or 30 amino acid residues, whichever is smaller. The codes of other known PDB structures related to the representative structures at 40% sequence identity are listed in file modlib/CHAINS_3.0_40_XN.grp. A sample TOP script for searching by SEQUENCE_SEARCH is in examples/all-steps/search.top. Sequences related to the target are identified by their Z-scores that are larger than 4 or 5 (log file column SIGNIF).
For more difficult modeling problems when SEQUENCE_SEARCH does not find any homologs, template matching or threading methods can be used. Widely used programs for threading include PROFIT[Flockner et al., 1995], THREADER[Jones et al., 1992], and the Web server of the David Eisenberg group at UCLA (http://www.mbi.ucla.edu/people/frsvr/frsvr).
It may be beneficial to identify related sequences without known 3D structures at this stage. This is most conveniently achieved by PSI-BLAST [Altschul et al., 1997]. Using as many sequences as possible may improve the quality of the alignment prepared in the next two stages.