Dear Ben (Ben Webb, Modeller Caretaker) and Joel (Subach),

Thanks a lot for your tips!

I tinkered with the alignment and Python script files and got Modeller to model the missing residues.

I found two possible solutions to the problem:

1) Use of a dash at the beginning of the structure-derived sequence portion of the alignment file, for each of the residues that were missing relative to the full-length protein sequence, as per NCBI's RefSeq (Reference Sequence):

For this, I used the following alignment file (with additional formatting at the relevant portions, for emphasis- but I used the plain text version for modelling), where I explicitly specified the starting and ending residue positions of the model segment that had coordinates (except for the short 6-residue stretch at S(431)ATDIG(436) (with missing coordinates)):

>P1;5bs8_B
structure:5bs8_B.pdb:425:B:675:B:DNA Gyrase:::
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ALVRRK------GLPGKLADCRSTDPRKSELYVVEGDSAGGSAKSGRDSMFQAILPLRGKIINVEKARIDRVLKNTEVQAIITALGTGIHDEFDIGKLRYHKIVLMADADVDGQHISTLLLTLLFRFMRPLIENGHVFLAQPPLYKLKWQRSDPEFAYSDRERDGLLEAGLKAGKKINKEDGIQRYKGLGEMDAKELWETTMDPSVRVLRQVTLDDAAAADELFSILMGEDVDARRSFITRNAKDVRFLDV*
>P1;5bs8B_fill
sequence:::::::::
MAAQKKKAQDEYGAASITILEGLEAVRKRPGMYIGSTGERGLHHLIWEVVDNAVDEAMAGYATTVNVVLLEDGGVEVADDGRGIPVATHASGIPTVDVVMTQLHAGGKFDSDAYAISGGLHGVGVSVVNALSTRLEVEIKRDGYEWSQVYEKSEPLGLKQGAPTKKTGSTVRFWADPAVFETTEYDFETVARRLQEMAFLNKGLTINLTDERVTQDEVVDEVVSDVAEAPKSASERAAESTAPHKVKSRTFHYPGGLVDFVKHINRTKNAIHSSIVDFSGKGTGHEVEIAMQWNAGYSESVHTFANTINTHEGGTHEEGFRSALTSVVNKYAKDRKLLKDKDPNLTGDDIREGLAAVISVKVSEPQFEGQTKTKLGNTEVKSFVQKVCNEQLTHWFEANPTDAKVVVNKAVSSAQARIAARKARELVRRKSATDIGGLPGKLADCRSTDPRKSELYVVEGDSAGGSAKSGRDSMFQAILPLRGKIINVEKARIDRVLKNTEVQAIITALGTGIHDEFDIGKLRYHKIVLMADADVDGQHISTLLLTLLFRFMRPLIENGHVFLAQPPLYKLKWQRSDPEFAYSDRERDGLLEAGLKAGKKINKEDGIQRYKGLGEMDAKELWETTMDPSVRVLRQVTLDDAAAADELFSILMGEDVDARRSFITRNAKDVRFLDV*

To make it easier for me to obtain a string of 424 dashes ("-"s) for the above alignment file and then copy and paste this sequence at the start of the structure-derived sequence part of the alignment file, without having to manually type and count them, I used the following short Python script (It can be modified according to the version of Python used, since some of the older versions of Python use a different syntax for print statements [e.g.: print "hello" vs print("hello")] ):

"""This script generates dashes. You need to enter the number of dashes to print, when prompted to do so."""
dashes = ""

n = int(input(("Please enter the number of dashes that you want to print as a contiguous stretch of dashes. Enter a non-zero, positive integer: ")))
for i in range(1, (n + 1)):
dashes += "-"

print(dashes)
print("\n")
print(f"The number of dashes stored in the variable 'dashes' is {len(dashes)}.")

This modelled the long stretch of 424 missing residues at the start of the structure-derived sequence portion of the alignment file (the first of the two sequences in the file) as a long loop region, without secondary structures. I then simply deleted the unnecessary residues at the N-terminal part of each Modeller-generated model, in UCSF Chimera (i.e., I deleted residues 1-422) and saved the modified PDB file.

2) Use of only a portion of the full-length protein sequence from NCBI (NCBI RefSeq), the residues corresponding to the region 425-675, which correspond exactly to the length of the residues present in the atom/structure file used (a PDB file generated from the original PDB 5BS8 by selecting chain B and saving only the selected atoms as a separate PDB file), except for the 6 missing residues inside this chain (S(431)ATDIG(436)), as the template sequence- the second sequence listed in the alignment file:

For this, in the alignment file, I mentioned the model segment bearing atom records (coordinates) as 425:B:675:B as shown below:

>P1;5bs8_B
structure:5bs8_B.pdb:425:B:675:B:DNA Gyrase:::
ALVRRK------GLPGKLADCRSTDPRKSELYVVEGDSAGGSAKSGRDSMFQAILPLRGKIINVEKARIDRVLKNTEVQAIITALGTGIHDEFDIGKLRYHKIVLMADADVDGQHISTLLLTLLFRFMRPLIENGHVFLAQPPLYKLKWQRSDPEFAYSDRERDGLLEAGLKAGKKINKEDGIQRYKGLGEMDAKELWETTMDPSVRVLRQVTLDDAAAADELFSILMGEDVDARRSFITRNAKDVRFLDV*
>P1;5bs8B_fill
sequence:::::::::
ELVRRKSATDIGGLPGKLADCRSTDPRKSELYVVEGDSAGGSAKSGRDSMFQAILPLRGKIINVEKARIDRVLKNTEVQAIITALGTGIHDEFDIGKLRYHKIVLMADADVDGQHISTLLLTLLFRFMRPLIENGHVFLAQPPLYKLKWQRSDPEFAYSDRERDGLLEAGLKAGKKINKEDGIQRYKGLGEMDAKELWETTMDPSVRVLRQVTLDDAAAADELFSILMGEDVDARRSFITRNAKDVRFLDV*

In the Python script file, I only replaced the following line in the definition of the select_atoms function [def select_atoms(self):]

return Selection(self.residue_range('431:B', '436:B'))

with

return Selection(self.residue_range('7:A', '12:A'))

This specified the portion to be allowed to move during model generation/refinement, without allowing the rest of the atoms to move. The residue ranges '431:B', '436:B' and '7:A', '12:A' both refer to S(431)ATDIG(436), with respect to the numbering in the full-length sequence (NCBI refSeq), but in the latter format, it corresponds to the numbering of residues given by Modeller to each of the newly generated models, which starts with residue number 1.

The two residues corresponding to residue positions 423 and 424 (as per the full-length sequence could then be modelled as a dipeptide using UCSF Chimera's Build Structure and then this dipeptide model could be saved as a PDB and then opened in UCSF Chimera along with the Modeller-generated model and the two chains (Chimera-generated dipeptide and Modeller-generated model) could be joined into a single model by forming a peptide bond between them using the Join Model function/tool in UCSF Chimera.

Note that the start of the sequence of residues in the PDB 5BS8 at chain b that has atom records/coordinates (sequence ALVRRK...) differs from the corresponding sequence in the NCBI RefSeq (where it is ELVRRK...) by the identity of a single residue and Modeller includes E rather than A at the start of this sequence, giving preference to the template sequence provided as the second sequence in the alignment file. So, if I wanted it to be "A" in the model, as in the structure file's sequence, I would need to make this change in the alignment file in the second sequence (template sequence) listed in the file.

Thanks, and regards,

Siddhartha

On Wed, May 29, 2024 at 12:27 PM Joel Subach <mjsubach@alumni.ncsu.edu> wrote:

Hi Siddhartha I hope your well:).

I superficially scanned your inquiry and I successfully completed missing residue modeling via Modeller via the below link,
accordingly if you follow this link step-by-step you should be able to successfully build these missing residues (if you try it
maybe again and it does not function feel free to inquire further and I will assist-you:).)

Best,
Joel 🚀

On Wed, May 29, 2024 at 8:49 AM Siddhartha Barua via modeller_usage <modeller_usage@salilab.org> wrote:
Dear Modeller Discussion Forum Members,

I am trying to repair Chain B in the RCSB PDB 5BS8. 5BS8's structure is that of DNA gyrase (from Mycobacterium tuberculosis). I used the example scripts, for filling in missing residues with Modeller, which were given at the URL https://salilab.org/modeller/wiki/Missing_residues (in Modeller Wiki), as well as the "basic-example" tutorial at the main Modeller website, and a YouTube tutorial video for guidance. Chain B contains 2 missing residues at the start of the sequence associated with the chain in the PDB file- S423 and N424. Thereafter, it contains the sequence "A(425)LVRRK(430)" (with atom records/coordinates) and then a stretch of 6 missing residues- "S(431)ATDIG(436)". I used the first script given at the abovementioned URL to generate a sequence file extracted from the PDB. I then used the following as my alignment file (using the NCBI RefSeq (NP_214519.2) for Mycobacterium tuberculosis gyrB (DNA Gyrase subunit B):

>P1;5bs8
structure:5bs8.pdb:FIRST:B:LAST:B:DNA Gyrase:::
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ALVRRK------GLPGKLADCRSTDPRKSELYVVEGDSAGGSAKSGRDSMFQAILPLRGKIINVEKARIDRVLKNTEVQAIITALGTGIHDEFDIGKLRYHKIVLMADADVDGQHISTLLLTLLFRFMRPLIENGHVFLAQPPLYKLKWQRSDPEFAYSDRERDGLLEAGLKAGKKINKEDGIQRYKGLGEMDAKELWETTMDPSVRVLRQVTLDDAAAADELFSILMGEDVDARRSFITRNAKDVRFLDV*

>P1;5bs8B_fill
sequence:::::::::
MAAQKKKAQDEYGAASITILEGLEAVRKRPGMYIGSTGERGLHHLIWEVVDNAVDEAMAGYATTVNVVLLEDGGVEVADDGRGIPVATHASGIPTVDVVMTQLHAGGKFDSDAYAISGGLHGVGVSVVNALSTRLEVEIKRDGYEWSQVYEKSEPLGLKQGAPTKKTGSTVRFWADPAVFETTEYDFETVARRLQEMAFLNKGLTINLTDERVTQDEVVDEVVSDVAEAPKSASERAAESTAPHKVKSRTFHYPGGLVDFVKHINRTKNAIHSSIVDFSGKGTGHEVEIAMQWNAGYSESVHTFANTINTHEGGTHEEGFRSALTSVVNKYAKDRKLLKDKDPNLTGDDIREGLAAVISVKVSEPQFEGQTKTKLGNTEVKSFVQKVCNEQLTHWFEANPTDAKVVVNKAVSSAQARIAARKARELVRRKSATDIGGLPGKLADCRSTDPRKSELYVVEGDSAGGSAKSGRDSMFQAILPLRGKIINVEKARIDRVLKNTEVQAIITALGTGIHDEFDIGKLRYHKIVLMADADVDGQHISTLLLTLLFRFMRPLIENGHVFLAQPPLYKLKWQRSDPEFAYSDRERDGLLEAGLKAGKKINKEDGIQRYKGLGEMDAKELWETTMDPSVRVLRQVTLDDAAAADELFSILMGEDVDARRSFITRNAKDVRFLDV*
I used the following as the script to run AutoModel to model only the selected residues:

from modeller import *
from modeller.automodel import * # Load the AutoModel class

log.verbose()
env = Environ()

# directories for input atom files
env.io.atom_files_directory = ['.', '../atom_files']

class MyModel(AutoModel):
def select_atoms(self):
return Selection(self.residue_range('431:B', '436:B'))

a = MyModel(env, alnfile = '5bs8_B-alignment.ali',
knowns = '5bs8', sequence = '5bs8B_fill')
a.starting_model= 1
a.ending_model = 1

This then raised the following error:

return Selection(self.residue_range('431:B', '436:B'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files (x86)\Modeller10.5\modlib\modeller\coordinates.py", line 385, in residue_range
start = self.residues[start]._num
~~~~~~~~~~~~~^^^^^^^
File "C:\Program Files (x86)\Modeller10.5\modlib\modeller\coordinates.py", line 302, in __getitem__
ret = modutil.handle_seq_indx(self, indx, self.mdl._indxres,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files (x86)\Modeller10.5\modlib\modeller\util\modutil.py", line 24, in handle_seq_indx
int_indx = lookup_func(*args)
^^^^^^^^^^^^^^^^^^
File "C:\Program Files (x86)\Modeller10.5\modlib\modeller\coordinates.py", line 379, in _indxres
self._report_bad_index(indx, suffix, "residue", 0)
File "C:\Program Files (x86)\Modeller10.5\modlib\modeller\coordinates.py", line 372, in _report_bad_index
raise KeyError("No such %s: %s" % (indxtyp, indx))
KeyError: 'No such residue: 431:B'

Next, I tried to run it again after deleting the 424 "-"s that preceded the sequence in the structure-associated sequence portion of the alignment file
(>P1;5bs8
structure:5bs8.pdb:FIRST:B:LAST:B:DNA Gyrase:::) and replacing them with 2 "-"s for S423 and N424 and the again, without these 2 preceding "-"s. Both times, I then got the same error:
(...... KeyError: 'No such residue: 431:B')

Please advise me on how to fill in missing residues for a chain that (a) has coordinates only for a middle portion/domain of the entire possible sequence (for the full-length protein) (because only the middle portion/domain was crystallised and subjected to X-ray crystallography, say) and (b) has missing residues at the start of this chain (due to high B-factors, say) with respect to the sequence that is associated with the solved structure of the chain in question (as can be seen in PDB viewer softwares such as UCSF Chimera) (e.g.: chain B of RCSB PDB 5BS8).

Thanks, and regards,
Siddhartha A. Barua, Ph.D.
--
Siddhartha A. Barua, Ph.D.
Mb.: +91 7777093994
_______________________________________________
modeller_usage mailing list
modeller_usage@salilab.org
https://salilab.org/mm/postorius/lists/modeller_usage.salilab.org/