On Thu, May 30, 2024 at 9:10 AM Siddhartha Barua via modeller_usage <modeller_usage@salilab.org> wrote:

Dear Ben (Ben Webb, Modeller Caretaker) and Joel (Subach),

Thanks a lot for your tips!

I tinkered with the alignment and Python script files and got Modeller to model the missing residues.

I found two possible solutions to the problem:

1) Use of a dash at the beginning of the structure-derived sequence portion of the alignment file, for each of the residues that were missing relative to the full-length protein sequence, as per NCBI's RefSeq (Reference Sequence):

For this, I used the following alignment file (with additional formatting at the relevant portions, for emphasis- but I used the plain text version for modelling), where I explicitly specified the starting and ending residue positions of the model segment that had coordinates (except for the short 6-residue stretch at S(431)ATDIG(436) (with missing coordinates)):

>P1;5bs8_B
structure:5bs8_B.pdb:425:B:675:B:DNA Gyrase:::
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ALVRRK------GLPGKLADCRSTDPRKSELYVVEGDSAGGSAKSGRDSMFQAILPLRGKIINVEKARIDRVLKNTEVQAIITALGTGIHDEFDIGKLRYHKIVLMADADVDGQHISTLLLTLLFRFMRPLIENGHVFLAQPPLYKLKWQRSDPEFAYSDRERDGLLEAGLKAGKKINKEDGIQRYKGLGEMDAKELWETTMDPSVRVLRQVTLDDAAAADELFSILMGEDVDARRSFITRNAKDVRFLDV*
>P1;5bs8B_fill
sequence:::::::::
MAAQKKKAQDEYGAASITILEGLEAVRKRPGMYIGSTGERGLHHLIWEVVDNAVDEAMAGYATTVNVVLLEDGGVEVADDGRGIPVATHASGIPTVDVVMTQLHAGGKFDSDAYAISGGLHGVGVSVVNALSTRLEVEIKRDGYEWSQVYEKSEPLGLKQGAPTKKTGSTVRFWADPAVFETTEYDFETVARRLQEMAFLNKGLTINLTDERVTQDEVVDEVVSDVAEAPKSASERAAESTAPHKVKSRTFHYPGGLVDFVKHINRTKNAIHSSIVDFSGKGTGHEVEIAMQWNAGYSESVHTFANTINTHEGGTHEEGFRSALTSVVNKYAKDRKLLKDKDPNLTGDDIREGLAAVISVKVSEPQFEGQTKTKLGNTEVKSFVQKVCNEQLTHWFEANPTDAKVVVNKAVSSAQARIAARKARELVRRKSATDIGGLPGKLADCRSTDPRKSELYVVEGDSAGGSAKSGRDSMFQAILPLRGKIINVEKARIDRVLKNTEVQAIITALGTGIHDEFDIGKLRYHKIVLMADADVDGQHISTLLLTLLFRFMRPLIENGHVFLAQPPLYKLKWQRSDPEFAYSDRERDGLLEAGLKAGKKINKEDGIQRYKGLGEMDAKELWETTMDPSVRVLRQVTLDDAAAADELFSILMGEDVDARRSFITRNAKDVRFLDV*

To make it easier for me to obtain a string of 424 dashes ("-"s) for the above alignment file and then copy and paste this sequence at the start of the structure-derived sequence part of the alignment file, without having to manually type and count them, I used the following short Python script (It can be modified according to the version of Python used, since some of the older versions of Python use a different syntax for print statements [e.g.: print "hello" vs print("hello")] ):

"""This script generates dashes. You need to enter the number of dashes to print, when prompted to do so."""
dashes = ""

n = int(input(("Please enter the number of dashes that you want to print as a contiguous stretch of dashes. Enter a non-zero, positive integer: ")))
for i in range(1, (n + 1)):
dashes += "-"

print(dashes)
print("\n")
print(f"The number of dashes stored in the variable 'dashes' is {len(dashes)}.")

This modelled the long stretch of 424 missing residues at the start of the structure-derived sequence portion of the alignment file (the first of the two sequences in the file) as a long loop region, without secondary structures. I then simply deleted the unnecessary residues at the N-terminal part of each Modeller-generated model, in UCSF Chimera (i.e., I deleted residues 1-422) and saved the modified PDB file.

2) Use of only a portion of the full-length protein sequence from NCBI (NCBI RefSeq), the residues corresponding to the region 425-675, which correspond exactly to the length of the residues present in the atom/structure file used (a PDB file generated from the original PDB 5BS8 by selecting chain B and saving only the selected atoms as a separate PDB file), except for the 6 missing residues inside this chain (S(431)ATDIG(436)), as the template sequence- the second sequence listed in the alignment file:

For this, in the alignment file, I mentioned the model segment bearing atom records (coordinates) as 425:B:675:B as shown below:

>P1;5bs8_B
structure:5bs8_B.pdb:425:B:675:B:DNA Gyrase:::
ALVRRK------GLPGKLADCRSTDPRKSELYVVEGDSAGGSAKSGRDSMFQAILPLRGKIINVEKARIDRVLKNTEVQAIITALGTGIHDEFDIGKLRYHKIVLMADADVDGQHISTLLLTLLFRFMRPLIENGHVFLAQPPLYKLKWQRSDPEFAYSDRERDGLLEAGLKAGKKINKEDGIQRYKGLGEMDAKELWETTMDPSVRVLRQVTLDDAAAADELFSILMGEDVDARRSFITRNAKDVRFLDV*
>P1;5bs8B_fill
sequence:::::::::
ELVRRKSATDIGGLPGKLADCRSTDPRKSELYVVEGDSAGGSAKSGRDSMFQAILPLRGKIINVEKARIDRVLKNTEVQAIITALGTGIHDEFDIGKLRYHKIVLMADADVDGQHISTLLLTLLFRFMRPLIENGHVFLAQPPLYKLKWQRSDPEFAYSDRERDGLLEAGLKAGKKINKEDGIQRYKGLGEMDAKELWETTMDPSVRVLRQVTLDDAAAADELFSILMGEDVDARRSFITRNAKDVRFLDV*

In the Python script file, I only replaced the following line in the definition of the select_atoms function [def select_atoms(self):]
return Selection(self.residue_range('431:B', '436:B'))
with
return Selection(self.residue_range('7:A', '12:A'))

This specified the portion to be allowed to move during model generation/refinement, without allowing the rest of the atoms to move. The residue ranges '431:B', '436:B' and '7:A', '12:A' both refer to S(431)ATDIG(436), with respect to the numbering in the full-length sequence (NCBI refSeq), but in the latter format, it corresponds to the numbering of residues given by Modeller to each of the newly generated models, which starts with residue number 1.

The two residues corresponding to residue positions 423 and 424 (as per the full-length sequence could then be modelled as a dipeptide using UCSF Chimera's Build Structure and then this dipeptide model could be saved as a PDB and then opened in UCSF Chimera along with the Modeller-generated model and the two chains (Chimera-generated dipeptide and Modeller-generated model) could be joined into a single model by forming a peptide bond between them using the Join Model function/tool in UCSF Chimera.

Note that the start of the sequence of residues in the PDB 5BS8 at chain b that has atom records/coordinates (sequence ALVRRK...) differs from the corresponding sequence in the NCBI RefSeq (where it is ELVRRK...) by the identity of a single residue and Modeller includes E rather than A at the start of this sequence, giving preference to the template sequence provided as the second sequence in the alignment file. So, if I wanted it to be "A" in the model, as in the structure file's sequence, I would need to make this change in the alignment file in the second sequence (template sequence) listed in the file.

Thanks, and regards,
Siddhartha

On Wed, May 29, 2024 at 12:30 PM Modeller Caretaker <modeller-care@salilab.org> wrote:
On 5/28/24 11:48 PM, Siddhartha Barua via modeller_usage wrote:
> *KeyError: 'No such residue: 431:B'*

Residues in the model are by default numbered starting at 1 and the
chains labeled alphabetically starting at A. Since you only have a
single chain, it will be labeled A, not B.
See https://salilab.org/modeller/10.5/manual/node23.html
If you want to number the residues differently, see
https://salilab.org/modeller/10.5/manual/node30.html

It looks like you are mistakenly using the template residue numbering here.

Ben Webb, Modeller Caretaker
--
modeller-care@salilab.org https://salilab.org/modeller/
Modeller mail list: https://salilab.org/mailman/listinfo/modeller_usage

--
Siddhartha A. Barua, Ph.D.
Mb.: +91 7777093994
_______________________________________________
modeller_usage mailing list
modeller_usage@salilab.org
https://salilab.org/mm/postorius/lists/modeller_usage.salilab.org/