File formats¶
Hits file¶
The hits file (usually has a .hit
extension, or .sel
for
selected hits after filtering) is in standard YAML format, and contains a
header indicating which version of ModPipe generated the file, plus a list of
hits, each starting with the line ‘- !<Hit>’. Each hit contains the following
fields:
sequence: contains a unique identifier for the target sequence (id) and its length.
alignment: contains a unique identifier for the target-template alignment, chi squared and KS statistics for the agreement between the observed/expected score distribution, the e-value of the alignment, and the percentage of gaps in the alignment.
region: the actual beginning and end of the region modeled.
fold_assignment_method: the method used to detect the template and create the alignment (see Fold assignment methods for a fuller description).
highest_sequence_identity, as a percentage, of all templates.
templates: a list of the templates used. For each template the PDB code and chain, the region used in the alignment, and the sequence identity between this template and the target is listed.
It is permitted for a hits file to contain multiple YAML documents. This may be useful to break up very large files, when using YAML parsers that read in a whole document at a time.
Models file¶
The models file (usually has a .mod
extension) is in YAML format,
and contains a header indicating which version of ModPipe generated the file,
plus a list of models, each starting with the line ‘- !<Model>’. Each model
contains the same fields as in the hits file, plus
the following:
id: a unique identifier for the model.
hetatms: number of HETATM residues in the model.
waters: number of waters in the model.
score: a set of assessment scores. These include the MODELLER objective function; the
DOPE
andDOPE-HR
scores; theGA341
score, compactness score, and its individual components (pairwise distance energy, surface area energy, combined energy) and the Z scores of each; the TSVMod scores (if requested), thenormalized DOPE Z score
; and the ModPipe quality score. The quality score is defined as (model_len / target_len) + (sequence_identity / 100) + (GA341_score / 10) - (percentage_gaps / 100) - alignment_ks_stat - (normalized_DOPE_score / 10); higher scores are better.rating: a string of 9 digits to rate the quality of the alignment and the model using various measures (1 for a good measure, 0 for a poor one). These measures are, in order:
Coverage
chi squared alignment score
Alignment E value
Gap ratio
Sequence identity
GA341 score
GA341 compactness score
DOPE Z score
TSVMod Predicted Native Overlap (3.5)
As for hits files, models files may contain multiple YAML documents.
Note
The Python module modpipe.serialize
contains functions and classes
that can be used in Python scripts to handle the
Models file and Hits file YAML format data.