Introduction¶
MDT prepares a raw frequency table, given information from MODELLER
alignments and/or PDB files. It can also process the raw frequency table in
several ways (e.g., normalization with Table.normalize()
, smoothing with
Table.smooth()
, perform entropy calculations with
Table.entropy_full()
, and write out the data in various formats,
including for plotting by ASGL
(Table.write_asgl()
) and use as restraints by MODELLER.
More precisely, MDT uses a sample of sequences, structures, and/or alignments to construct a table N(a,b,c,…,d) for features a, b, c, …, d. The sample for generating the frequencies N is obtained depending on the type of features a, b, c, …, d. The sample can contain individual proteins, pairs of proteins, pairs of residues in proteins, pairs of aligned residues, pairs of aligned pairs of residues, chemical bonds, angles, dihedral angles, and pairs of tuples of atoms. Some features work with triple alignments, too. All the needed features a, b, c, …, d are calculated automatically from the sequences, alignments, and/or PDB files. The feature bins are defined by the user when each feature is created.
MDT features¶
A ‘feature’ in MDT is simply some binnable property of your input alignment.
Example features include the
residue type
,
chi1
and Phi
dihedral angles,
sequence identity
between two sequences,
X-ray resolution
,
atom-atom distances
,
atom type
, and
bond length
.
MDT understands that different features act on different sets of proteins,
or parts of proteins, and will automatically scan over the correct range to
collect necessary statistics (e.g. when you call Table.add_alignment()
).
For example, to collect statistics for the residue type feature, it is
necessary to scan all residues in all proteins in the alignment. The
X-ray resolution feature, on the other hand, only requires each protein
in the alignment to be scanned, not each residue.
The atom-atom distance feature requires scanning over all pairs of atoms in
all proteins in the alignment, while the sequence identity feature requires
scanning all pairs of proteins in the alignment. If you construct a table of
multiple features, the most fine-grained of the features determines the scan -
for example, a table of X-ray resolution against Φ dihedral would require
a scan of all residues.
See the scan types table for all of the scan types.
When choosing which proteins to scan, MDT also considers the features. It will scan each protein individually, all pairs of proteins, or all triples of proteins. The latter two scans only happen if you have features in your table that require multiple proteins (e.g. protein pair or aligned residue features) or you have single-protein features such as protein or residue features but you have asked to evaluate them on the second or third protein (by setting the protein argument to 1 or 2 rather than the default 0).
MDT also knows that some
residue pair or
atom pair features are symmetric,
and will perform a non-redundant scan in this case. If, however, any feature
in the table is asymmetric, a full scan is performed. If in doubt, you can query
Table.symmetric
to see whether
a symmetric scan will be performed for the current set of features.
(Currently, any tuple pair
feature in your table forces a full scan.)
The feature bins determine how to convert a feature value into a frequency table. For most feature types, you can specify how many bins to use, and their value ranges - see Specification of bins for more information. The last bin is always reserved as an ‘undefined’ bin, for values that don’t fall into any other bin [1].
(Some features are predetermined by the setup of the system - for
example, the residue type
feature always has 22 bins - 20 for the standard amino acids, 1 for gaps in
the alignment, and 1 for undefined.)
Type |
Example feature |
---|---|
Protein |
|
Residue [2] |
|
Atom |
|
Atom pair [3] |
|
Atom tuple |
|
Atom tuple pair |
|
Chemical bond |
|
Chemical angle |
|
Chemical dihedral angle |
Dependent and independent features¶
An MDT Table
object is simply a table of
counts N(a,b,c,…,d) for features a, b, c, …, d. However, this is often used
to generate a conditional PDF, p(x,y,…,z | a,b,…,c) for independent features
a, b, …, c and dependent features x, y, …, z. By convention in MDT the
dependent features are the last or rightmost features in the table, and so
methods which are designed to deal with PDFs such as
Table.smooth()
, Table.super_smooth()
,
Table.normalize()
, Table.offset_min()
, Table.close()
expect the dependent features to be the last features. If necessary you can
reorder the features using Table.reshape()
or Table.integrate()
.
Specification of bins¶
Most features take a bins argument when they are created,
which specifies the bin ranges. This is simply a list of (start, end, symbol)
triples, which specify the feature range for each bin, and the symbol to refer
to it by. For example, the following creates an
X-ray resolution
feature,
with 4 bins, the first for 0.51-1.4 Å,
the second for 1.4-1.6 Å, and so on. Anything below 0.51 Å or
2.0 Å or above (or an undefined value) will be placed into a fifth
‘undefined’ bin.
xray = mdt.features.XRayResolution(mlib, bins=[(0.51, 1.4, "<1.4"),
(1.4, 1.6, "1.4-1.6"),
(1.6, 1.8, "1.6-1.8"),
(1.8, 2.0, "1.8-2.0")])
Note
Bin ranges in MDT are half-closed, i.e. a feature value must be greater than or equal to the lower value of the range, and less than the upper value, to be counted in the bin. For example, in the case above, 1.0 Å would be placed into the first bin, and 1.4 Å into the second. (If you define bins with overlapping ranges, values will be placed into the first bin that matches.)
In most cases, a set of bins of equal width is desired, and it is
tedious to specify these by hand. A utility function
uniform_bins()
is provided, which takes
three arguments - the number of bins, the lower range of the first bin,
and the width of each bin - and creates a set of bins; all bins are of the
same size and follow after the first bin. For example, the following bins the
atom-atom distance
feature into 60 bins,
each 0.5 Å wide, with the first bin starting at 0 Å.
The first bin is thus 0-0.5 Å, the second 0.5-1.0 Å, and so
on, up to bin 60 which is 29.5-30.0 Å. The additional ‘undefined’
bin thus counts anything below 0 Å, greater than or equal to
30.0 Å, or which could not be calculated for some reason.
atdist = mdt.features.AtomDistance(mlib, bins=mdt.uniform_bins(60, 0, 0.5))
Storage for bin data¶
By default, when a table is created in MDT it uses double precision floating
point to store the counts. This allows large counts themselves to be accurately
scored, and can also store floating point data such as PDFs. However, for
very large tables, this may use a prohibitive amount of memory. Therefore, it
is possible to change the data type used to store bin data, by specifying
the bin_type parameter when creating a
Table
object. The same parameter can be
given to Table.copy()
, to make a copy
of the table using a different data type for its storage. Note that other
data types use less storage, but can also store a smaller range of counts.
For example, the UnsignedInt8
data type uses only a single byte for each bin, but can only store integer
counts between 0 and 255 (floating point values, or values outside of this
range, will be truncated). MDT uses double precision floating point for all
internal operations, but any storage of bin values uses the user-selected
bin type. Thus you should be careful not to use an inappropriate bin type -
for example, don’t use an integer bin type if you are planning to store PDFs
or perform normalization, smoothing, etc.
Footnotes