Daniel, 

I think we  should separate the discussion for 
fine coarsening ( up to 5 residues) 
coarse coarsening ( more than 5 residues).

For fine coarsening I think the helper function is fine and most restraints would work will well with it.
For coarse coarsening I agree that a covering sphere is not the best solution and that clustering based techniques ( such as GMM used in MultiFit) is better. I tested a few cases today and indeed GMM is able to overcome outliers ( such as helices) and generate better spheres. Once I will migrate my code into IMP we can use GMM as a helper function to generate low resolution representations of proteins.

Coarse coarsening, on the other hand,  is needed mostly for flexible chain representation for which we can not apply either of these methods, as we do not have a structure, and so I think that for now relaxing the restraints is sufficient.
Anyway - representation is a work in progress, so lets first finish testing various alternatives before imposing solutions in IMP.
As for moving the residue based one from helper to em, I do not think it is necessary as it will be used by others for fine coarsening ( from a discussion I had today with Hao and Jeremy) - but as long as the function is there - I care less on its exact position :)

Keren.
On Nov 1, 2009, at 11:16 AM, Daniel Russel wrote:


On Nov 1, 2009, at 9:47 AM, Keren Lasker wrote:





- helper.create_simplified_by_residue needs to be thought about since its current method of asigning radii doesn't make sense for anything other than density based restraints (so it may make sense to move it to em).
And, I should add, when you know residue-residue proximities.



How else do you propose to define radius other than the particles' sphere cover ?
radius of gyration or a radius so that the sphere volume matches that of the k residues match or something that doesn't go all crazy when (depending on the scale) you have beta strands or alpha helices.

Given a molecule interesting geometric aspects include
1) residue locations
2) regions of space occupied by the molecule
3) regions of space free from the protein
4) centers of mass
5) total volume

Each of these is required for different sorts of restraints. For example, EM fitting requires 4 have bounded error, residue proximities  requires 1 have bounded error, packing a bunch of molecules to form a complex requires bounds on 5 be accurate and preferable bounds on 2 and 3.

If we are generating a rigid model to approximate a given pdb we should be able to get all of them (the helper.create_simplified() can be trivially modified to do so, but is slow). Given you experience with clustering for finding centers for em-fitting, a faster approach might be to cluster the density and then put spheres at these cluster centers. We can then measure the error for all of the above and increase the number of spheres as needed until the error matches the tolerance passed a the parameter to the function.

I don't see that doing it along the backbone makes any sense after 4ish residues as the set of shapes that those residues occupy can vary too much to be represented by a sphere. And if you are holding the structure rigid (or only letting it change a bit), you don't gain anything from having particles represent consecutive residues (and if it is non-rigid, we will have some serious issues with preventing it from blowing up). Is there something?

One issue that this raises again is that we use radius for several different purposes.
- for proximity detection, we want to know the maximum extents of an object: that is, the size of the space a residue could possibly be in
- for packing we the core set of space that it occupies, which will always be smaller than the maximum extent

We could separate the two, but that would be a reasonably significant amount of work making sure various classes use the right one. But might be worth it. If we do that then
- restraints that force things to be close together (residue-residue proximities for example) could use the extents
- restraints that force things apart  (excluded volume) could use the core radius

Then, a simple simplification procedure which
- uses the cover of the residues to produce an extents radius
- uses the volume of the residues produce a core radius
would be pretty OK for most any way one split of the residues. Clustering them would still be better than chopping along the backbone when coarsening a lot.

Does this make sense?
_______________________________________________
IMP-dev mailing list
IMP-dev@salilab.org
https://salilab.org/mailman/listinfo/imp-dev