1 Stage 4 - Analysis Part 2 {#rnapolii_4_2}
2 =========================
4 ## Post-clustering analysis
5 In
this stage we perform post-clustering analysis. Here, we will perform calculations
for:
7 * **Cluster Precision**: Determining the within-group precision and between-group similarity via RMSD
8 * **Cluster Accuracy**: Fit of the calculated clusters to the
true (known) solution
9 * **Sampling Exhaustiveness**: Qualitative and quantitative measurement of sampling completeness
11 ### Cluster Precision (precision_rmsf.py)
13 The `precision_rmsf.py` script can be used to determine the within- and between-cluster RMSD (i.e., precision). To run, use:
16 python precision_rmsf.py
\endcode
17 It will generate `precision.*.*.out` files in the `kmeans*` subdirectory containing precision information in text format,
while in each cluster directory it generates `.pdf` files showing the within-cluster residue mean square fluctuation. In a similar way to earlier scripts, subsets of the structure can be selected
for the calculation - in
this case, we
select the Rpb4 and Rpb7 subunits.
20 # choose components for the precision calculation
21 # key is the named precision item
22 # value is a list of selection tuples [either "domain_name" or (start,stop,"domain_name") ]
23 selections={
"Rpb4":[
"Rpb4"],
25 "Rpb4_Rpb7":[
"Rpb4",
"Rpb7"]}
28 The script then sets up a model and [Precision](@ref
IMP::pmi::analysis::Precision)
object for the given `selections` at the desired resolution
for computation of the precision (`resolution=1` specifies at the residue level).
31 # setup Precision calculator
33 pr = IMP.pmi.analysis.Precision(model,resolution=1,selection_dictionary=selections)
34 pr.set_precision_style(
'pairwise_rmsd')
37 Next, lists of structures are created that will be passed to the `Precision`
object. `rmf_list` references the specific `.rmf` file, which with those `frame_list` is used to reference a particular frame in that `.rmf` to use (in
this case, the only frame in the rmf, 0).
40 # gather the RMF filenames for each cluster
43 cluster_dirs=glob.glob(root_cluster_directory+
'/cluster.*/')
45 # runs on the first 10 structures to test if it runs smoothly
46 for d in cluster_dirs:
47 rmf_list.append(glob.glob(d+
'/*.rmf3')[0::10])
48 frame_list.append([0]*len(rmf_list[-1]))
50 for d in cluster_dirs:
51 rmf_list.append(glob.glob(d+
'/*.rmf3'))
52 frame_list.append([0]*len(rmf_list[-1]))
55 The list of frames and rmfs are added to the precision object
58 # add them to the Precision object
59 for rmfs,frames,cdir in zip(rmf_list,frame_list,cluster_dirs):
60 pr.add_structures(zip(rmfs,frames),cdir)
63 Self-precision and inter-cluster precision is then calculated, using the `rmf_list` and the output is placed in `root_cluster_directory`.
66 # calculate intra-cluster and inter-cluster precision
67 print(
"calculating precision")
68 for clus1,clus2 in combinations_with_replacement(range(len(rmf_list)),2):
69 pr.get_precision(cluster_dirs[clus1],
71 root_cluster_directory+"/precision."+str(clus1)+"."+str(clus2)+".out")
74 Finally, the RMSFs for each residue in the analyzed components are calculated and stored in `root_cluster_directory/rmsf.COMPONENT_NAME.dat`
77 # compute residue mean-square fluctuation (RMSF)
78 print(
"calculating RMSF")
79 for d in cluster_dirs:
80 pr.get_rmsf(structure_set_name=d,outdir=d)
83 <img src="rnapolii_rmsf.Rpb7.png" width="500px" alt="residue root mean square fluctuations calculated on a cluster of structures" />
85 ### Accuracy evaluation (accuracy.py)
86 We have provided a script to evaluate the accuracy of a model against a native configuration. When run, it will enumerate the structures in the first cluster and print the average and minimum distance between those structures and a given reference model. This is useful
for benchmarking (but obviously is of no use when we don
't know the 'real
' structure).
94 First, identify the reference structure and list of `.rmf` structures to use in calculation
98 reference_rmf = "../data/native.rmf3"
99 test_mode = False # run on every 10 rmf files
100 rmfs = glob.glob('kmeans_*_1/cluster.0