Analysis of results from ABFE calculations#
In this notebook we show how to analyze results obtained with the OpenFE ABFE protocol. This notebook shows you how to extract
The overall binding free energy of each ligand in the dataset (DG)
The contribution from the different legs (complex and solvent) of a transformation.
Downloading the example dataset#
First let’s download some example ABFE results. Please skip this section if you have already done this!
[1]:
!curl -O https://zenodo.org/records/19498687/files/abfe_results.zip
!unzip -o abfe_results.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3092k 100 3092k 0 0 432k 0 0:00:07 0:00:07 --:--:-- 501k
Archive: abfe_results.zip
inflating: abfe_results/README.md
inflating: abfe_results/abfe_results_single_unit/results_2/1.json
inflating: abfe_results/abfe_results_single_unit/toluene_results.json
inflating: abfe_results/abfe_results_single_unit/results_1/1.json
inflating: abfe_results/abfe_results_single_unit/results_0/1.json
inflating: abfe_results/abfe_results_multiple_units/results_2/ligand1_transformation.json
inflating: abfe_results/abfe_results_multiple_units/results_1/ligand1_transformation.json
inflating: abfe_results/abfe_results_multiple_units/results_0/ligand1_transformation.json
Imports#
Here are a bunch of imports we will need later in the notebook.
[2]:
import numpy as np
import os
import pathlib
from gufe.tokenization import JSON_HANDLER
import pandas as pd
from openff.units import unit
from openfecli.commands.gather import (
format_estimate_uncertainty,
_collect_result_jsons,
load_json,
)
Some helper methods to load and format the ABFE results#
Over the next few cells, we define some helper methods that we will use to load and format the ABFE results.
Note: you do not need to directly interact with any of these, unless you are looking to change the behaviour of how data is being processed
[3]:
def _get_name(result: dict) -> str:
"""Get the ligand name from a unit's results data.
Parameters
----------
result : dict
A results dict.
Returns
-------
str
Ligand name corresponding to the results.
"""
solvent_data = list(result["protocol_result"]["data"]["solvent"].values())[0][0]
try:
name = solvent_data["inputs"]["setup_results"]["inputs"]["alchemical_components"]["stateA"][
0
]["molprops"]["ofe-name"]
except KeyError:
name = solvent_data["inputs"]["alchemical_components"]["stateA"][0]["molprops"]["ofe-name"]
return str(name)
[4]:
def _load_valid_result_json(
fpath: os.PathLike | str,
) -> tuple[tuple | None, dict | None]:
"""Load the data from a results JSON into a dict.
Parameters
----------
fpath : os.PathLike | str
The path to deserialized results.
Returns
-------
dict | None
A dict containing data from the results JSON,
or None if the JSON file is invalid or missing.
Raises
------
ValueError
If the JSON file contains an ``estimate`` or ``uncertainty`` key with the
value ``None``.
If
"""
# TODO: only load this once during collection, then pass namedtuple(fname, dict) into this function
# for now though, it's not the bottleneck on performance
result = load_json(fpath)
try:
name = _get_name(result)
except (ValueError, IndexError):
print(f"{fpath}: Missing ligand names. Skipping.")
return None, None
if result["estimate"] is None:
errormsg = f"{fpath}: No 'estimate' found, assuming to be a failed simulation."
raise ValueError(errormsg)
return name, result
[5]:
def _get_legs_from_result_jsons(
result_fns: list[pathlib.Path],
) -> dict[str, dict[str, list]]:
"""
Iterate over a list of result JSONs and populate a dict of dicts with all data needed
for results processing.
Parameters
----------
result_fns : list[pathlib.Path]
List of filepaths containing results formatted as JSON.
report : Literal["dg", "raw"]
Type of report to generate.
Returns
-------
legs: dict[str, dict[str, list]]
Data extracted from the given result JSONs, organized by the leg's ligand name and simulation type.
"""
from collections import defaultdict
dgs = defaultdict(lambda: defaultdict(list))
for result_fn in result_fns:
name, result = _load_valid_result_json(result_fn)
if name is None: # this means it couldn't find name and/or simtype
continue
dgs[name]["overall"].append([result["estimate"], result["uncertainty"]])
proto_key = [k for k in result["unit_results"].keys() if k.startswith("ProtocolUnitResult")]
for p in proto_key:
# In openfe v1.9+, we only want to pick up results from
# the Analysis Unit. To ensure backwards compatibility with
# prior releases of openfe v1.x, we exclude Setup and Simulation
if (
"Setup" in result["unit_results"][p]["source_key"]
or "Simulation" in result["unit_results"][p]["source_key"]
):
continue
if "unit_estimate" in result["unit_results"][p]["outputs"]:
simtype = result["unit_results"][p]["outputs"]["simtype"]
dg = result["unit_results"][p]["outputs"]["unit_estimate"]
dg_error = result["unit_results"][p]["outputs"]["unit_estimate_error"]
dgs[name][simtype].append([dg, dg_error])
if "standard_state_correction" in result["unit_results"][p]["outputs"]:
corr = result["unit_results"][p]["outputs"]["standard_state_correction"]
# In openfe v1.9+, standard state corrections are set to 0 kcal/mol
# when no correction is being applied (e.g. no restraints).
# To make raw outputs similar to pre-v1.9, we exclude corrections
# if they are close to 0.
if not np.isclose(corr.m, 0):
dgs[name]["standard_state_correction"].append(
[corr, 0 * unit.kilocalorie_per_mole]
)
else:
continue
return dgs
[6]:
def _error_std(r):
"""
Calculate the error of the estimate as the std of the repeats
"""
return np.std([v[0].m for v in r["overall"]])
def _error_mbar(r):
"""
Calculate the error of the estimate using the reported MBAR errors.
This also takes into account that repeats may have been run for this edge by using the average MBAR error
"""
complex_errors = np.array([x[1].m for x in r["complex"]])
solvent_errors = np.array([x[1].m for x in r["solvent"]])
return np.sqrt(np.mean(complex_errors**2) + np.mean(solvent_errors**2))
Methods to extract and manipulate the ABFE results#
The next three methods allow you to extract ABFE results (extract_results_dict) and then manipulate them to get different types of results.
These manipulation methods include:
generate_dg: to get the dG values.generate_dg_raw: to get the raw dG values for each individual legs in the ABFE transformation cycles.
[7]:
def extract_results_dict(
results_files: list[os.PathLike | str],
) -> dict[str, dict[str, list]]:
"""
Get a dictionary of ABFE results from a list of directories.
Parameters
----------
results_files : list[os.PathLike]
A list of directors with ABFE result files to process.
Returns
-------
sim_results : dict[str, dict[str, list]]
Simulation results, organized by the leg's ligand names and simulation type.
"""
# find and filter result jsons
result_fns = _collect_result_jsons(results_files)
# pair legs of simulations together into dict of dicts
sim_results = _get_legs_from_result_jsons(result_fns)
return sim_results
[8]:
def generate_dg(results_dict: dict[str, dict[str, list]]) -> pd.DataFrame:
"""Compute and write out DG values for the given results.
Parameters
----------
results_dict : dict[str, dict[str, list]]
Dictionary of results created by ``extract_results_dict``.
Returns
-------
pd.DataFrame
A pandas DataFrame with the dG results for each ligand.
"""
data = []
# check the type of error which should be used based on the number of repeats
repeats = {len(v["overall"]) for v in results_dict.values()}
error_func = _error_mbar if 1 in repeats else _error_std
for lig, results in sorted(results_dict.items()):
dg = np.mean([v[0].m for v in results["overall"]])
error = error_func(results)
m, u = format_estimate_uncertainty(dg, error, unc_prec=2)
data.append((lig, m, u))
df = pd.DataFrame(
data,
columns=[
"ligand",
"DG (kcal/mol)",
"uncertainty (kcal/mol)",
],
)
return df
[9]:
def generate_dg_raw(results_dict: dict[str, dict[str, list]]) -> pd.DataFrame:
"""
Get all the transformation cycle legs found and their DG values.
Parameters
----------
results_dict : dict[str, dict[str, list]]
Dictionary of results created by ``extract_results_dict``.
Returns
-------
pd.DataFrame
A pandas DataFrame with the individual cycle leg dG results.
"""
data = []
for lig, results in sorted(results_dict.items()):
for simtype, repeats in sorted(results.items()):
if simtype != "overall":
for repeat in repeats:
m, u = format_estimate_uncertainty(
repeat[0].m, repeat[1].m, unc_prec=2
)
data.append((simtype, lig, m, u))
df = pd.DataFrame(
data,
columns=[
"leg",
"ligand",
"DG (kcal/mol)",
"uncertainty (kcal/mol)",
],
)
return df
Analyzing your results#
Now that we have defined a set of methods to help us extract results, let’s analyze the results!
Specify result directories and gather results#
Let’s start by gathering all our simulation results. First we define all the directories where our ABFE results exist. Here we assume that our simulation repeats sit in three different results directories under abfe_results, named from results_0 to results_2.
[10]:
# Specify paths to result directories
results_dir = [pathlib.Path("abfe_results/abfe_results_multiple_units")]
dgs = extract_results_dict(results_dir)
Obtain the absolute binding free energy for all nodes in the network#
With these extracted results, we can now get the dG prediction between each ligand.
Note: if only a single repeat was run, the MBAR error is used as uncertainty estimate, while the standard deviation is used when results from more than one repeat are provided.
[11]:
df_dg = generate_dg(dgs)
df_dg.to_csv('dg.tsv', sep="\t", lineterminator="\n", index=False)
[12]:
df_dg
[12]:
| ligand | DG (kcal/mol) | uncertainty (kcal/mol) | |
|---|---|---|---|
| 0 | 1 | -20.87 | 0.47 |
Obtain the raw DGs of every leg in the thermodynamic cycle#
If needed, you can also get the individual dG results for each leg of the ABFE transformation cycles. This can be useful in various different situation, such as when trying to diagnose which part of your simulation has the highest uncertainty. These include the DG of turning ligand interactions off in the complex and solvent, as well as the standard_state_correction that accounts for turning off the Boresch restraints between the non-interacting ligand and the protein and
transferring the non-interacting ligand into the standard state.
[13]:
df_raw = generate_dg_raw(dgs)
df_raw.to_csv('dg_raw.tsv', sep="\t", lineterminator="\n", index=False)
[14]:
df_raw
[14]:
| leg | ligand | DG (kcal/mol) | uncertainty (kcal/mol) | |
|---|---|---|---|---|
| 0 | complex | 1 | 36.34 | 0.91 |
| 1 | complex | 1 | 36.17 | 0.80 |
| 2 | complex | 1 | 34.77 | 0.81 |
| 3 | solvent | 1 | 6.5 | 1.4 |
| 4 | solvent | 1 | 5.4 | 1.4 |
| 5 | solvent | 1 | 5.5 | 1.4 |
| 6 | standard_state_correction | 1 | -8.9 | 0.0 |
| 7 | standard_state_correction | 1 | -9.3 | 0.0 |
| 8 | standard_state_correction | 1 | -9.0 | 0.0 |
[ ]: