Before any modeling can begin, the sequences and segments with known
3D structures that are related to the sequence being modeled must be
found. This can be achieved by the MODELLER SEQUENCE_SEARCH
command. The search relies on a database of structures that are
representative [Ε ali et al., 1995]
of the whole Protein Data Bank (PDB) [Abola et al., 1987,Berman et al., 2000].
The PDB codes of these representative structures (about 3,000 codes) are
listed in file modlib/CHAINS_3.0_40_XN.cod and their sequences are stored in file
modlib/CHAINS_all.seq, which includes approximately 16,000 sequences for
all the unique non-model chains in PDB longer than 25 amino acid
residues. The representative structures are likely to have less than 40%
sequence identity to each other and the length difference that is
at least 30% of the shorter chain or 30 amino acid residues,
whichever is smaller. The codes of other known PDB
structures related to the representative structures at 40%
sequence identity are listed in file modlib/CHAINS_3.0_40_XN.grp.
A sample TOP script for searching by SEQUENCE_SEARCH is in examples/all-steps/search.top. Sequences related to the target are
identified by their Z-scores that are larger than 4 or 5
(log file column SIGNIF).
For more difficult modeling problems when SEQUENCE_SEARCH does not find any homologs, template matching or threading methods can be used. Widely used programs for threading include PROFIT[Flockner et al., 1995], THREADER[Jones et al., 1992], and the Web server of the David Eisenberg group at UCLA (http://www.mbi.ucla.edu/people/frsvr/frsvr).
It may be beneficial to identify related sequences without known 3D structures at this stage. This is most conveniently achieved by PSI-BLAST [Altschul et al., 1997]. Using as many sequences as possible may improve the quality of the alignment prepared in the next two stages.