Forwarded to the list for reference.
-------- Original Message -------- Subject: PDB lib Date: Tue, 13 Nov 2007 16:06:24 -0800 From: Daniel Russel drussel@gmail.com Reply-To: drussel@salilab.org To: Ben Webb ben@salilab.org References: 3682E173-8139-44FE-8BB0-C2739807D71E@salilab.org 473A0947.9030501@salilab.org 473A09AD.2000603@gmail.com 473A31BC.1060000@salilab.org
> > At any rate, this PDB reader stuff needs to be discussed on imp-dev > before we proceed. For example, what's wrong with the BALL stuff you > were playing with before? BALL is dead. No activity on email list. No response to bugs. No move to actually document their newest version even though it was released a year ago. I don't think we want to tie ourselves to it. Sure we can take it to IMP dev. No one else seems to care much :-)
I have looked around and asked around and couldn't find any decent PDB readers (in C or C++) which are not buried in some huge project. > Why can't we link against this PDB library, rather than > cut-and-pasting thousands of lines of code? The nice thing about it is that it is small and simple and mine so we can just ship it along with IMP and not worry about dependencies, name collisions etc. I don't want people to have to get another library from somewhere else, hence my desire to put a copy into imp svn. Soon enough the lib will make it to fedora extras (whenever the next CGAL release is) so we could potentially just use that. > Should the kernel include a PDB reader anyway? It doesn't matter if it is in the kernel or a separate lib that is brought along, especially since we are using dynamic linking (so the library can be pulled in my libimp). Everyone seems to want a PDB reader, including me so we need one somewhere. > My concerns are 1. it'll be a maintenance nightmare to fork this PDB > library, Not at all. It doesn't actually need forking other than changing #include <CGAL/PDB to #include <IMP/PDB to avoid stomping on things. I really with there were some way to parameterize such things.
It was written to be self-contained, I just got sloppy with the last few changes so I had to fix those this morning).
> 2. it's LGPL so we can't include it in IMP without being forced to > adopt LGPL (only link against it) Why would we want to do more than link against it? Anyway, changing the license is easy. It was initially free, I just changed the license recently because much of the rest of CGAL is LGPL. > and 3. from a brief reading, it looks like a not-very-good PDB library > anyway (hard-coded atom names - what's with that?) Well, it is either that or use strings which pushes the checks to runtime rather than compile time. Adding to an enum and recompiling is trivial (and adding a constant externally works just as well for must purposes). Checking everywhere than an object falls in a small set of allowed strings is hard (especially if you can't specify that set of strings anywhere). BALL has hardcoded atoms for that matter (just a lot more of them :-)
Anyway, I am a fan of it (as I should be :-) And since our backups are rather less than I thought, the ball code got lost with the constant rechecking out of IMP.
Ben Webb wrote: > Forwarded to the list for reference. > > -------- Original Message -------- > Subject: PDB lib > Date: Tue, 13 Nov 2007 16:06:24 -0800 > From: Daniel Russel drussel@gmail.com > Reply-To: drussel@salilab.org > To: Ben Webb ben@salilab.org > References: 3682E173-8139-44FE-8BB0-C2739807D71E@salilab.org > 473A0947.9030501@salilab.org 473A09AD.2000603@gmail.com > 473A31BC.1060000@salilab.org > > >> At any rate, this PDB reader stuff needs to be discussed on imp-dev >> before we proceed. For example, what's wrong with the BALL stuff you >> were playing with before? >> > BALL is dead. No activity on email list. No response to bugs. No move to > actually document their newest version even though it was released a > year ago. I don't think we want to tie ourselves to it. Sure we can take > it to IMP dev. No one else seems to care much :-) > what about that one? seems to be the open source default in crystallography these days. they have good funding from electron microscopy as well.
can also be accessed from python:
http://www.phenix-online.org/documentation/what-is-phenix.htm > I have looked around and asked around and couldn't find any decent PDB > readers (in C or C++) which are not buried in some huge project. > >> Why can't we link against this PDB library, rather than >> cut-and-pasting thousands of lines of code? >> > The nice thing about it is that it is small and simple and mine so we > can just ship it along with IMP and not worry about dependencies, name > collisions etc. I don't want people to have to get another library from > somewhere else, hence my desire to put a copy into imp svn. Soon enough > the lib will make it to fedora extras (whenever the next CGAL release > is) so we could potentially just use that. > >> Should the kernel include a PDB reader anyway? >> > It doesn't matter if it is in the kernel or a separate lib that is > brought along, especially since we are using dynamic linking (so the > library can be pulled in my libimp). Everyone seems to want a PDB > reader, including me so we need one somewhere. > >> My concerns are 1. it'll be a maintenance nightmare to fork this PDB >> library, >> > Not at all. It doesn't actually need forking other than changing > #include <CGAL/PDB to #include <IMP/PDB to avoid stomping on things. I > really with there were some way to parameterize such things. > > It was written to be self-contained, I just got sloppy with the last few > changes so I had to fix those this morning). > > >> 2. it's LGPL so we can't include it in IMP without being forced to >> adopt LGPL (only link against it) >> > Why would we want to do more than link against it? Anyway, changing the > license is easy. It was initially free, I just changed the license > recently because much of the rest of CGAL is LGPL. > >> and 3. from a brief reading, it looks like a not-very-good PDB library >> anyway (hard-coded atom names - what's with that?) >> > Well, it is either that or use strings which pushes the checks to > runtime rather than compile time. Adding to an enum and recompiling is > trivial (and adding a constant externally works just as well for must > purposes). Checking everywhere than an object falls in a small set of > allowed strings is hard (especially if you can't specify that set of > strings anywhere). BALL has hardcoded atoms for that matter (just a lot > more of them :-) > > Anyway, I am a fan of it (as I should be :-) And since our backups are > rather less than I thought, the ball code got lost with the constant > rechecking out of IMP. > _______________________________________________ > IMP-dev mailing list > IMP-dev@salilab.org > https://salilab.org/mailman/listinfo/imp-dev >
>> At any rate, this PDB reader stuff needs to be discussed on imp-dev >> before we proceed. For example, what's wrong with the BALL stuff you >> were playing with before? > BALL is dead. No activity on email list. No response to bugs. No move to > actually document their newest version even though it was released a > year ago. I don't think we want to tie ourselves to it. Sure we can take > it to IMP dev. No one else seems to care much :-)
People certainly care (they keep coming to talk to me, anyway). But I guess they don't like writing emails.
If that's really the case for BALL, then we should probably explore other possibilities, as per Frido's email. I know that BALL's Python interface is rather lacking, certainly.
> I have looked around and asked around and couldn't find any decent PDB > readers (in C or C++) which are not buried in some huge project. >> Why can't we link against this PDB library, rather than >> cut-and-pasting thousands of lines of code? > The nice thing about it is that it is small and simple and mine so we > can just ship it along with IMP and not worry about dependencies, name > collisions etc. I don't want people to have to get another library from > somewhere else, hence my desire to put a copy into imp svn. Soon enough > the lib will make it to fedora extras (whenever the next CGAL release > is) so we could potentially just use that.
If it's an external library, it should be a dependency, not part of IMP. Otherwise, regardless of whether you describe it as a "fork", it'll fork as versions of it elsewhere change. CGAL source control sounds like the best place for it if it's going to be part of CGAL. Embedded copies of other projects are a great way to ensure that bugs never get fixed (think of all the projects that bundle zlib).
>> and 3. from a brief reading, it looks like a not-very-good PDB library >> anyway (hard-coded atom names - what's with that?) > Well, it is either that or use strings which pushes the checks to > runtime rather than compile time. Adding to an enum and recompiling is > trivial (and adding a constant externally works just as well for must > purposes). Checking everywhere than an object falls in a small set of > allowed strings is hard (especially if you can't specify that set of > strings anywhere). BALL has hardcoded atoms for that matter (just a lot > more of them :-)
A PDB reader which needs to be recompiled for every new HETATM type is simply not going to work. See http://www.bmrb.wisc.edu/elec_dep/pdb_het_library/pdbhetn.htm for example. Hao's project absolutely requires HETATMs, for example. And I don't share your concern for runtime checks, since PDB reading is not performance-critical.
Any PDB reader that we adopt needs to be extendable at runtime. Even Modeller can do that. PyMol, for example, has a library of HETATM fragments (stored as Python pickles, I believe). It also needs to be extensible to be able to read PDBML or possibly MMCIF.
Everybody and his dog has written a PDB reader. Andrej wrote one. Maya wrote one. Javi wrote one. Keren wrote one. There's one in biopython, one in BALL, one in PyMol, one in Chimera, and one in Biskit, all free and widely available software. I can't believe we have to burden the world with another one.
Ben
I think Ben and I have a disagreement about the significance of picking a PDB reader at this point. As I see it, any PDB reading should be hidden behind some very general interface (I have a proposal using the MolecularHierarchyDecorator). As a result, the code doing the actually reading doesn't matter too much and so we shouldn't worry now about whether we have the perfect solution. We should pick something easy which does what we want and minimizes the amount of effort people have to make.
When someone needs something more, we can either modify our current solution or chuck it and write an adaptor to someone else's code. Writing these adaptors is trivial given a backend which supports all we want.
I picked my pdb reader since it is small and so can be stuck in the with rest of imp so that no one has to worry about installing external libraries and it does what I want, namely give me a hierarchy for proteins and a bond information.
The alternatives I have looked at so far are
BALL- probably dead although it took a breath this morning
the one Frido sent- I don't see how to get bonds out of it, but otherwise fine. The documentation really sucks, so I might be missing something.
a few random other C and C++ ones all of which are either embedded in a large library, suck, don't give me bonds etc.
One area I haven't looked is the in python only readers. There may be some good ones there.
> > If it's an external library, it should be a dependency, not part of > IMP. It depends. Having obscure external dependencies is a good way to piss off people who have to install your software. It is a tradeoff. I would say that we shouldn't depend on anything not in fedora or fedora-extras for something basic like this. >
> Hao's project absolutely requires HETATMs, for example. And I > don't share your concern for runtime checks, since PDB reading is not > performance-critical. My concern on checks was not for efficiency, it was for correctness. Depending on strings is poor as capitalization or abbreviation errors don't easily get caught.
"it's just a reader" - basically read particles to be stored in imp - can be writtern in python,c++ does not matter much - it only initializes imp particles. But, we should think weather we need a molecule_utils lib ( like superpose, selection, surface generator ...) - I guess we do, and then the question is should we use some c++ lib for that or write it ourselfs ( we certainly should not use python lib for that ... :) )
As for just a pdb reader, since it only reads atoms and put then in IMP we can have in the meantime even a few readers and converge to something as time passes. I would recommend that we have a few test cases that check main functionalities ( multiple positions, HETATMs, chains, symmetry ..... ) - so that we can have some objective criteria for a final decision. I can also offer considering c++ code written in my lab in TA which was used for structural alignment and docking procedures - so it was debugged well.
On Nov 14, 2007, at 6:53 PM, Daniel Russel wrote:
> I think Ben and I have a disagreement about the significance of > picking a PDB reader at this point. As I see it, any PDB reading > should be hidden behind some very general interface (I have a > proposal using the MolecularHierarchyDecorator). As a result, the > code doing the actually reading doesn't matter too much and so we > shouldn't worry now about whether we have the perfect solution. We > should pick something easy which does what we want and minimizes the > amount of effort people have to make.
> > When someone needs something more, we can either modify our current > solution or chuck it and write an adaptor to someone else's code. > Writing these adaptors is trivial given a backend which supports all > we want. > > I picked my pdb reader since it is small and so can be stuck in the > with rest of imp so that no one has to worry about installing > external libraries and it does what I want, namely give me a > hierarchy for proteins and a bond information. > > The alternatives I have looked at so far are > > BALL- probably dead although it took a breath this morning > > the one Frido sent- I don't see how to get bonds out of it, but > otherwise fine. The documentation really sucks, so I might be missing > something. > > a few random other C and C++ ones all of which are either embedded in > a large library, suck, don't give me bonds etc. > > One area I haven't looked is the in python only readers. There may be > some good ones there. > >> >> If it's an external library, it should be a dependency, not part of >> IMP. > It depends. Having obscure external dependencies is a good way to > piss off people who have to install your software. It is a tradeoff. > I would say that we shouldn't depend on anything not in fedora or > fedora-extras for something basic like this. >> > >> Hao's project absolutely requires HETATMs, for example. And I >> don't share your concern for runtime checks, since PDB reading is not >> performance-critical. > My concern on checks was not for efficiency, it was for correctness. > Depending on strings is poor as capitalization or abbreviation errors > don't easily get caught. > > _______________________________________________ > IMP-dev mailing list > IMP-dev@salilab.org > https://salilab.org/mailman/listinfo/imp-dev
> As for just a pdb reader, since it only reads atoms and put then in > IMP we can have in the meantime even a few readers and converge to > something as time passes. > I would recommend that we have a few test cases that check main > functionalities ( multiple positions, HETATMs, chains, > symmetry ..... ) - so that we can have some objective criteria for > a final decision. Agreed. As we want more from the PDB reader we write more involved test cases and evolve the backend to support what we need. No reason to jump to a full feature set if that is not easy.
> I can also offer considering c++ code written in my lab in TA > which was used for structural alignment and docking procedures - so > it was debugged well. Sounds good. The more the merrier :-)
> > On Nov 14, 2007, at 6:53 PM, Daniel Russel wrote: > >> I think Ben and I have a disagreement about the significance of >> picking a PDB reader at this point. As I see it, any PDB reading >> should be hidden behind some very general interface (I have a >> proposal using the MolecularHierarchyDecorator). As a result, the >> code doing the actually reading doesn't matter too much and so we >> shouldn't worry now about whether we have the perfect solution. We >> should pick something easy which does what we want and minimizes the >> amount of effort people have to make. > >> >> When someone needs something more, we can either modify our current >> solution or chuck it and write an adaptor to someone else's code. >> Writing these adaptors is trivial given a backend which supports all >> we want. >> >> I picked my pdb reader since it is small and so can be stuck in the >> with rest of imp so that no one has to worry about installing >> external libraries and it does what I want, namely give me a >> hierarchy for proteins and a bond information. >> >> The alternatives I have looked at so far are >> >> BALL- probably dead although it took a breath this morning >> >> the one Frido sent- I don't see how to get bonds out of it, but >> otherwise fine. The documentation really sucks, so I might be missing >> something. >> >> a few random other C and C++ ones all of which are either embedded in >> a large library, suck, don't give me bonds etc. >> >> One area I haven't looked is the in python only readers. There may be >> some good ones there. >> >>> >>> If it's an external library, it should be a dependency, not part of >>> IMP. >> It depends. Having obscure external dependencies is a good way to >> piss off people who have to install your software. It is a tradeoff. >> I would say that we shouldn't depend on anything not in fedora or >> fedora-extras for something basic like this. >>> >> >>> Hao's project absolutely requires HETATMs, for example. And I >>> don't share your concern for runtime checks, since PDB reading is >>> not >>> performance-critical. >> My concern on checks was not for efficiency, it was for correctness. >> Depending on strings is poor as capitalization or abbreviation errors >> don't easily get caught. >> >> _______________________________________________ >> IMP-dev mailing list >> IMP-dev@salilab.org >> https://salilab.org/mailman/listinfo/imp-dev
cool! Also - it might be a good time to start discussing other molecule features. For example, Frido and I plan to use a surface generator in the emlib/IMP for the 26S project, which can also be used for generating the non-bonded list.
On Nov 14, 2007, at 11:31 PM, Daniel Russel wrote:
>> As for just a pdb reader, since it only reads atoms and put then in >> IMP we can have in the meantime even a few readers and converge to >> something as time passes. >> I would recommend that we have a few test cases that check main >> functionalities ( multiple positions, HETATMs, chains, >> symmetry ..... ) - so that we can have some objective criteria for >> a final decision. > Agreed. As we want more from the PDB reader we write more involved > test cases and evolve the backend to support what we need. No reason > to jump to a full feature set if that is not easy. > >> I can also offer considering c++ code written in my lab in TA >> which was used for structural alignment and docking procedures - so >> it was debugged well. > Sounds good. The more the merrier :-) > >> >> On Nov 14, 2007, at 6:53 PM, Daniel Russel wrote: >> >>> I think Ben and I have a disagreement about the significance of >>> picking a PDB reader at this point. As I see it, any PDB reading >>> should be hidden behind some very general interface (I have a >>> proposal using the MolecularHierarchyDecorator). As a result, the >>> code doing the actually reading doesn't matter too much and so we >>> shouldn't worry now about whether we have the perfect solution. We >>> should pick something easy which does what we want and minimizes the >>> amount of effort people have to make. >> >>> >>> When someone needs something more, we can either modify our current >>> solution or chuck it and write an adaptor to someone else's code. >>> Writing these adaptors is trivial given a backend which supports all >>> we want. >>> >>> I picked my pdb reader since it is small and so can be stuck in the >>> with rest of imp so that no one has to worry about installing >>> external libraries and it does what I want, namely give me a >>> hierarchy for proteins and a bond information. >>> >>> The alternatives I have looked at so far are >>> >>> BALL- probably dead although it took a breath this morning >>> >>> the one Frido sent- I don't see how to get bonds out of it, but >>> otherwise fine. The documentation really sucks, so I might be >>> missing >>> something. >>> >>> a few random other C and C++ ones all of which are either >>> embedded in >>> a large library, suck, don't give me bonds etc. >>> >>> One area I haven't looked is the in python only readers. There >>> may be >>> some good ones there. >>> >>>> >>>> If it's an external library, it should be a dependency, not part of >>>> IMP. >>> It depends. Having obscure external dependencies is a good way to >>> piss off people who have to install your software. It is a >>> tradeoff. >>> I would say that we shouldn't depend on anything not in fedora or >>> fedora-extras for something basic like this. >>>> >>> >>>> Hao's project absolutely requires HETATMs, for example. And I >>>> don't share your concern for runtime checks, since PDB reading is >>>> not >>>> performance-critical. >>> My concern on checks was not for efficiency, it was for correctness. >>> Depending on strings is poor as capitalization or abbreviation >>> errors >>> don't easily get caught. >>> >>> _______________________________________________ >>> IMP-dev mailing list >>> IMP-dev@salilab.org >>> https://salilab.org/mailman/listinfo/imp-dev > > _______________________________________________ > IMP-dev mailing list > IMP-dev@salilab.org > https://salilab.org/mailman/listinfo/imp-dev
Daniel Russel wrote: > I think Ben and I have a disagreement about the significance of picking > a PDB reader at this point. As I see it, any PDB reading should be > hidden behind some very general interface (I have a proposal using the > MolecularHierarchyDecorator).
Agreed. We can certainly use whichever PDB reader you're happy with in the short term. But I have a lot of experience with reading dodgy PDB files (although Eashwar is probably the lab expert in this regard) so know there's a lot of corner cases to worry about, and I don't relish fixing all of these again with a new PDB reader. And that's before we start worrying about heterogens.
> I picked my pdb reader since it is small and so can be stuck in the with > rest of imp so that no one has to worry about installing external > libraries and it does what I want, namely give me a hierarchy for > proteins and a bond information.
We're talking about two different things here. You want to distribute your PDB reader with IMP. I don't want to include the code in IMP SVN. The two issues are orthogonal; if you really want to bundle the code, we can do it at makedist time (for tarball releases) or we can use an SVN externals definition if you want it for SVN users. For the latter, just point me to the URL of your SVN repository (presumably this is a path within the CGAL SVN, or if you prefer I can make a repository for your PDB reader at svn.salilab.org).
> the one Frido sent- I don't see how to get bonds out of it, but > otherwise fine. The documentation really sucks, so I might be missing > something.
How are you "getting bonds" out of a PDB file? PDB files don't provide that information. (The most you can get is the CONECT and SSBOND records.) For that you generally need a description of the topology, which is generally covered by the topology file portion of most MM forcefields. I really hope this stuff isn't hard-coded, because that would really have trouble with patches and other residue modifications (think covalently-bonded ligands, acetylated termini, disulfides, MSE's, cyclic proteins, nucleic acids, etc.).
>> Hao's project absolutely requires HETATMs, for example. And I >> don't share your concern for runtime checks, since PDB reading is not >> performance-critical. > My concern on checks was not for efficiency, it was for correctness. > Depending on strings is poor as capitalization or abbreviation errors > don't easily get caught.
I didn't mean you wouldn't have actual atom type objects, just that they shouldn't be hardcoded. For example, Modeller reads a set of residue types from its parameter files at runtime, and after that maps every residue type in the PDB file from the string to an integer residue type. Unknown residue types result in a warning, and the generation of a new integer residue type at runtime. You could of course use Residue objects rather than integer types.
Ben
Useless technical discussion follows :-)
>> I picked my pdb reader since it is small and so can be stuck in >> the with >> rest of imp so that no one has to worry about installing external >> libraries and it does what I want, namely give me a hierarchy for >> proteins and a bond information. > > We're talking about two different things here. You want to distribute > your PDB reader with IMP. I don't want to include the code in IMP SVN. > The two issues are orthogonal; They eventually yes. At the moment they are not as people just get IMP from svn and build in SVN. If we can put a link in SVN to another repository, that would be very cool (and I wouldn't put it beyond SVN) or I guess scons could just fetch files from another repository at build time. Again, I don't see that which way it is done is very important as long as it does not require people to take extra steps (over svn update and scons in the IMP or new_imp tree) to keep it up to date.
> > How are you "getting bonds" out of a PDB file? PDB files don't provide > that information. It doesn't make sense to talk about a molecule without it bonds, so a molecule loader better handle it on one level or another. BALL handles standard bonds as a cleanup pass after reading which is a good solution too, but makes it a bit funny for connect records as these bonds are created on reading while other, more standard bonds need to wait for the cleanup.
If someone wants to put tables into IMP that build the bonds from atom names, that would be better than in the reader. Just requires more work right now. .
> For example, Modeller reads a set of residue > types from its parameter files at runtime, and after that maps every > residue type in the PDB file from the string to an integer residue > type. > Unknown residue types result in a warning, and the generation of a new > integer residue type at runtime. You could of course use Residue > objects > rather than integer types. Sure. New types are pretty trivial to add on top of any system of predefining the common atom types (just have a function which takes a string and associates it to the next free int in the internal maps). But we might as well give the standard ones standard integer codes and nice C++ names (the enum) which reduces the chance of typos. Where this data happens to reside isn't all that important at the moment.
Daniel Russel wrote: >> We're talking about two different things here. You want to distribute >> your PDB reader with IMP. I don't want to include the code in IMP SVN. >> The two issues are orthogonal; > They eventually yes. At the moment they are not as people just get IMP > from svn and build in SVN.
I think you misunderstood my post - I was talking about SVN externals: http://svnbook.red-bean.com/en/1.4/svn.advanced.externals.html
That should work fine with the current "get IMP from SVN" approach.
>> How are you "getting bonds" out of a PDB file? PDB files don't provide >> that information. > It doesn't make sense to talk about a molecule without it bonds, so a > molecule loader better handle it on one level or another. BALL handles > standard bonds as a cleanup pass after reading which is a good solution > too, but makes it a bit funny for connect records as these bonds are > created on reading while other, more standard bonds need to wait for the > cleanup.
Well, sure, but to do that you need to know the topology - i.e. which atoms are connected in which residues, and at what distances and angles. That's what a topology file (half of an MM force field) does. It's not good to hardcode this because, for example, you may want to fill in missing atoms in a heme residue.
Ben
participants (4)
-
Ben Webb
-
Daniel Russel
-
Friedrich Foerster
-
Keren Lasker