IMP logo
IMP Manual  for IMP version 2.6.0
rnapolii_4_2.md
1 Stage 4 - Analysis Part 2 {#rnapolii_4_2}
2 =========================
3 
4 ## Post-clustering analysis
5 In this stage we perform post-clustering analysis. Here, we will perform calculations for:
6 
7 * **Cluster Precision**: Determining the within-group precision and between-group similarity via RMSD
8 * **Cluster Accuracy**: Fit of the calculated clusters to the true (known) solution
9 * **Sampling Exhaustiveness**: Qualitative and quantitative measurement of sampling completeness
10 
11 ### Cluster Precision (precision_rmsf.py)
12 
13 The `precision_rmsf.py` script can be used to determine the within- and between-cluster RMSD (i.e., precision). To run, use:
14 
15 \code{.sh}
16 python precision_rmsf.py \endcode
17 It will generate `precision.*.*.out` files in the `kmeans*` subdirectory containing precision information in text format, while in each cluster directory it generates `.pdf` files showing the within-cluster residue mean square fluctuation. In a similar way to earlier scripts, subsets of the structure can be selected for the calculation - in this case, we select the Rpb4 and Rpb7 subunits.
18 
19 \code{.py}
20 # choose components for the precision calculation
21 # key is the named precision item
22 # value is a list of selection tuples [either "domain_name" or (start,stop,"domain_name") ]
23 selections={"Rpb4":["Rpb4"],
24  "Rpb7":["Rpb7"],
25  "Rpb4_Rpb7":["Rpb4","Rpb7"]}
26 \endcode
27 
28 The script then sets up a model and [Precision](@ref IMP::pmi::analysis::Precision) object for the given `selections` at the desired resolution for computation of the precision (`resolution=1` specifies at the residue level).
29 
30 \code{.py}
31 # setup Precision calculator
32 model = IMP.Model()
33 pr = IMP.pmi.analysis.Precision(model,resolution=1,selection_dictionary=selections)
34 pr.set_precision_style('pairwise_rmsd')
35 \endcode
36 
37 Next, lists of structures are created that will be passed to the `Precision` object. `rmf_list` references the specific `.rmf` file, which with those `frame_list` is used to reference a particular frame in that `.rmf` to use (in this case, the only frame in the rmf, 0).
38 
39 \code{.py}
40 # gather the RMF filenames for each cluster
41 rmf_list=[]
42 frame_list=[]
43 cluster_dirs=glob.glob(root_cluster_directory+'/cluster.*/')
44 if test_mode:
45  # runs on the first 10 structures to test if it runs smoothly
46  for d in cluster_dirs:
47  rmf_list.append(glob.glob(d+'/*.rmf3')[0::10])
48  frame_list.append([0]*len(rmf_list[-1]))
49 else:
50  for d in cluster_dirs:
51  rmf_list.append(glob.glob(d+'/*.rmf3'))
52  frame_list.append([0]*len(rmf_list[-1]))
53 \endcode
54 
55 The list of frames and rmfs are added to the precision object
56 
57 \code{.py}
58 # add them to the Precision object
59 for rmfs,frames,cdir in zip(rmf_list,frame_list,cluster_dirs):
60  pr.add_structures(zip(rmfs,frames),cdir)
61 \endcode
62 
63 Self-precision and inter-cluster precision is then calculated, using the `rmf_list` and the output is placed in `root_cluster_directory`.
64 
65 \code{.py}
66 # calculate intra-cluster and inter-cluster precision
67 print("calculating precision")
68 for clus1,clus2 in combinations_with_replacement(range(len(rmf_list)),2):
69  pr.get_precision(cluster_dirs[clus1],
70  cluster_dirs[clus2],
71  root_cluster_directory+"/precision."+str(clus1)+"."+str(clus2)+".out")
72 \endcode
73 
74 Finally, the RMSFs for each residue in the analyzed components are calculated and stored in `root_cluster_directory/rmsf.COMPONENT_NAME.dat`
75 
76 \code{.py}
77 # compute residue mean-square fluctuation (RMSF)
78 print("calculating RMSF")
79 for d in cluster_dirs:
80  pr.get_rmsf(structure_set_name=d,outdir=d)
81 \endcode
82 
83 <img src="rnapolii_rmsf.Rpb7.png" width="500px" alt="residue root mean square fluctuations calculated on a cluster of structures" />
84 
85 ### Accuracy evaluation (accuracy.py)
86 We have provided a script to evaluate the accuracy of a model against a native configuration. When run, it will enumerate the structures in the first cluster and print the average and minimum distance between those structures and a given reference model. This is useful for benchmarking (but obviously is of no use when we don't know the 'real' structure).
87 
88 To run, use
89 
90 \code{.sh}
91 python accuracy.py
92 \endcode
93 
94 First, identify the reference structure and list of `.rmf` structures to use in calculation
95 
96 \code{.py}
97 # common settings
98 reference_rmf = "../data/native.rmf3"
99 test_mode = False # run on every 10 rmf files
100 rmfs = glob.glob('kmeans_*_1/cluster.0/*.rmf3') # list of the RMFS to calculate on
101 \endcode
102 
103 The components that will be compared to reference must be explicitly enumerated in `selections`
104 
105 \code{.py}
106 selections = {"Rpb4":["Rpb4"],
107  "Rpb7":["Rpb7"],
108  "Rpb4_Rpb7":["Rpb4","Rpb7"]}
109 \endcode
110 
111 Initialize an %IMP `Model` and `Precision` class object and add the `selections`. Then add the list of `.rmf` structures
112 
113 \code{.py}
114 # setup Precision calculator
115 model=IMP.Model()
116 frames=[0]*len(rmfs)
117 pr=IMP.pmi.analysis.Precision(model,selection_dictionary=selections)
118 pr.set_precision_style('pairwise_rmsd')
119 pr.add_structures(zip(rmfs,frames),"ALL")
120 \endcode
121 
122 Average to the reference structure in angstroms for each component in `selections` is then calculated and outputted to the screen.
123 
124 \code{.py}
125 # calculate average distance to the reference file
126 pr.set_reference_structure(reference_rmf,0)
127 print(pr.get_average_distance_wrt_reference_structure("ALL"))
128 \endcode
129 
130 The output of this analysis will be printed in the terminal. For instance,
131 you will get something like:
132 
133  Rpb4 average distance 20.7402052384 minimum distance 11.9324734377
134  All average distance 5.05387877292 minimum distance 3.4664144466
135  Rpb7 average distance 10.5032807663 minimum distance 5.06599370365
136  Rpb4_Rpb7 average distance 16.0757238511 minimum distance 9.63785403195
137 
138 The average distance is the average RMSD of each model in the cluster with respect to the reference structure. The program prints values for all selections (`Rpb4`, `Rpb7` and `Rpb4_Rpb7`) and automatically for all the complex (`All`)
139 
140 ### Sampling Exhaustiveness
141 We can also determine sampling exhaustiveness by dividing the models into multiple sets, performing clustering on each set separately, and comparing the clusters. This step is left as an exercise to the reader. To aid with splitting the data, we have added the optional keyword `first_and_last_frames` to the IMP::pmi::macros::AnalysisReplicaExchange0::clustering() method.
142 If you set this keyword to a tuple (values are percentages, e.g. [0,0.5]), it will only analyze that fraction of the data. Some things you can try:
143 * cluster two subsets of the data
144 * qualitative analysis: look at the localization densities - they should be similar for the two subsets
145 * quantitative analysis: combine the cluster results into one folder (rename as needed) and call `precision_rmsf.py`, which will automatically compute cross-precision for the clusters.
146 
147 If the sampling is exhaustive, then similar clusters should be obtained from each independent set, and the inter-cluster precision between two equivalent clusters should be very low (that is, there should be a 1:1 correspondence between the two sets of clusters, though the ordering may be different).