Dear Modeller Users,

I am trying to use Modeller to model a comparative protein structure for the type I collagen.
I came to this trying to reproduce a model from this paper: doi 10.1021/nl103943u ( Nano Lett. 2011, 11, 2, 757–766 )
It also used Modeller to create such a model. I contacted the 1st author for instructions on how he created the model, but unfortunately he said not to remember anymore (it has been 10 years since).
I also want to mention that before posting this, I read the Modeller manual and several questions and answers posted in the mail listing. They helped me get to this point.

In the following I will describe in some details what I have done, with what files and how:
All necessary files for reproduction are attached in this email (I tried to attach all files - pdb, target fasta, outputs, etc. - , but it exceeded the limit of 100 KB).

1 - Target: P02452 (CO1A1_HUMAN) and P08123 (CO1A2_HUMAN)
from https://www.uniprot.org/uniprot/P02452 and https://www.uniprot.org/uniprot/P08123

The Targets are in the Fasta format. I converted them to .ali files in the PIR format, as suggested by Ben Webb in https://salilab.org/archives/modeller_usage/2010/msg00072.html .
(see FASTAtoPIR.py in the attached files).
I manually edited the protein code and field2 of the .ali files (based on what I read in the manual - File formats B.1 Alignment file (PIR))
That is how it looks like now:

COL1A1.ali
___________________________________________________________________
>P1;COL1A1
sequence:COL1A1:     : :     : :::-1.00:-1.00
MFSFVDLRLLLLLAATALLTHGQEEGQVEGQDEDIPPITCVQNGLRYHDRDVWKPEPCRICVCDNGKVLCDDVIC
...*
___________________________________________________________________


2 - Template: 3hr2.pdb
from https://www.rcsb.org/structure/3HR2
It is very important to me to get a model as close to this structure as possible, since it comprises the collagen fibrillar structure.


3 - Alignment: I decided to align each chain and sequence individually, as many answers in the mail listing suggested.

First of all I tried to align the COL1A1.ali with the chain A of the 3hr2.pdb structure.
I aligned the sequences using salign (align2d was taking too long).

The 3hr2.pdb contains 2 non standard amino acids:
HYP hydroxyproline
LYZ Hydroxylysine

I added - env.io.hetatm = True - in order to read them.

1A_salign.py
___________________________________________________________________
env = environ()
aln = alignment(env)
env.io.hetatm = True

mdl = model(env, file='3hr2', model_segment=('FIRST:A','LAST:A'))

aln.append_model(mdl, align_codes='3hr2A', atom_files='3hr2.pdb')
aln.append(file='COL1A1.ali', align_codes='COL1A1')
aln.salign()
aln.write(file='COL1A1-3hr2A.ali', alignment_format='PIR')
aln.write(file='COL1A1-3hr2A.pap', alignment_format='PAP')
___________________________________________________________________

I added to restyp.lib:
HETATM | HYP                 | O |   | HYP  | hydroxyproline
HETATM | LYZ                  | K |   | LYS  | lysine, +1

I have HYP topology and parameters, but since I don't have them for LYZ I chose to treat LYZ as LYS (Lysine).
I do not know if I can do it. Is it better to do so or to simply let Modeller continue with the following warning?

read_pd_459W> Residue type  LYZ not recognized. 'automodel' model building will treat this residue as a rigid body.


4 - Automodel: I created 10 models and chose the one with lowest DOPE score (all scored 1.0 in the GA341 score), in my case, COL1A1.B99990010.pdb.

I included HYP topology and parameters information to the top_heav.lib and par.lib files from the Modeller Library. Furthermore I added the HB1 (new type in HYP top) atom type information to solv.lib, radii.lib and radii14.lib.
The HYP information was taken from top_all36_prot_modify_res.str and par_all36_prot_modify_res.str in the CHARMm FF.

2A_model-single.py
___________________________________________________________________
from modeller.automodel import *
env = environ()
env.io.hetatm = True
env.libs.topology.read('${LIB}/top_heav.lib')
env.libs.parameters.read('$(LIB)/par.lib') # from manual

a = automodel(env, alnfile='COL1A1-3hr2A.ali',
              knowns='3hr2A', sequence='COL1A1',
              assess_methods=(assess.DOPE,
                              assess.GA341))
                             
a.starting_model = 1
a.ending_model = 10
a.make()
___________________________________________________________________


5 - Final model: I also ran the 3A_evaluate_model and 4A_plot_profiles python scripts and after that opened the 3hr2 A chain and the final model in VMD.

The model looks good, however, it is not as close to the 3hr2 structured as I wished. As a mechanical engineer I am no used to those models nor to homology modeling and have been learning all here by reading the manual, instructions and mail listing. Did I make many/any mistakes? Where and what? Can the model become better (more close to the 3hr2 template structure)? If yes, how? How to know if a model is "good enough"? Can I proceed like that and create the models for the chains B and C and complete the collagen structure?

I am afraid to lose the fibrillar collagen structure (which in this case is similar to a crystalline periodic structure) from 3hr2.pdb if the model is not very similar to the the 3hr2 structure.

Thank you in advance for reading up to here. I am looking forward to reading your observations, suggestions and corrections.
They are all very welcome.

Amadeus