Analysis of results from ABFE calculations#

In this notebook we show how to analyze results obtained with the OpenFE ABFE protocol. This notebook shows you how to extract

The overall binding free energy of each ligand in the dataset (DG)
The contribution from the different legs (complex and solvent) of a transformation.

Downloading the example dataset#

First let’s download some example ABFE results. Please skip this section if you have already done this!

[1]:

!curl -O https://zenodo.org/records/19498687/files/abfe_results.zip
!unzip -o abfe_results.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3092k  100 3092k    0     0   432k      0  0:00:07  0:00:07 --:--:--  501k
Archive:  abfe_results.zip
  inflating: abfe_results/README.md
  inflating: abfe_results/abfe_results_single_unit/results_2/1.json
  inflating: abfe_results/abfe_results_single_unit/toluene_results.json
  inflating: abfe_results/abfe_results_single_unit/results_1/1.json
  inflating: abfe_results/abfe_results_single_unit/results_0/1.json
  inflating: abfe_results/abfe_results_multiple_units/results_2/ligand1_transformation.json
  inflating: abfe_results/abfe_results_multiple_units/results_1/ligand1_transformation.json
  inflating: abfe_results/abfe_results_multiple_units/results_0/ligand1_transformation.json

Imports#

Here are a bunch of imports we will need later in the notebook.

[2]:

import numpy as np
import os
import pathlib
from gufe.tokenization import JSON_HANDLER
import pandas as pd
from openff.units import unit
from openfecli.commands.gather import (
    format_estimate_uncertainty,
    _collect_result_jsons,
    load_json,
)

Some helper methods to load and format the ABFE results#

Over the next few cells, we define some helper methods that we will use to load and format the ABFE results.

Note: you do not need to directly interact with any of these, unless you are looking to change the behaviour of how data is being processed

[3]:

def _get_name(result: dict) -> str:
    """Get the ligand name from a unit's results data.

    Parameters
    ----------
    result : dict
        A results dict.

    Returns
    -------
    str
        Ligand name corresponding to the results.
    """

    solvent_data = list(result["protocol_result"]["data"]["solvent"].values())[0][0]
    try:
        name = solvent_data["inputs"]["setup_results"]["inputs"]["alchemical_components"]["stateA"][
            0
        ]["molprops"]["ofe-name"]
    except KeyError:
        name = solvent_data["inputs"]["alchemical_components"]["stateA"][0]["molprops"]["ofe-name"]

    return str(name)

[4]:

def _load_valid_result_json(
    fpath: os.PathLike | str,
) -> tuple[tuple | None, dict | None]:
    """Load the data from a results JSON into a dict.

    Parameters
    ----------
    fpath : os.PathLike | str
        The path to deserialized results.

    Returns
    -------
    dict | None
        A dict containing data from the results JSON,
        or None if the JSON file is invalid or missing.

    Raises
    ------
    ValueError
      If the JSON file contains an ``estimate`` or ``uncertainty`` key with the
      value ``None``.
      If
    """

    # TODO: only load this once during collection, then pass namedtuple(fname, dict) into this function
    # for now though, it's not the bottleneck on performance
    result = load_json(fpath)
    try:
        name = _get_name(result)
    except (ValueError, IndexError):
        print(f"{fpath}: Missing ligand names. Skipping.")
        return None, None
    if result["estimate"] is None:
        errormsg = f"{fpath}: No 'estimate' found, assuming to be a failed simulation."
        raise ValueError(errormsg)

    return name, result

[5]:

def _get_legs_from_result_jsons(
    result_fns: list[pathlib.Path],
) -> dict[str, dict[str, list]]:
    """
    Iterate over a list of result JSONs and populate a dict of dicts with all data needed
    for results processing.


    Parameters
    ----------
    result_fns : list[pathlib.Path]
        List of filepaths containing results formatted as JSON.
    report : Literal["dg", "raw"]
        Type of report to generate.

    Returns
    -------
    legs: dict[str, dict[str, list]]
        Data extracted from the given result JSONs, organized by the leg's ligand name and simulation type.
    """
    from collections import defaultdict

    dgs = defaultdict(lambda: defaultdict(list))

    for result_fn in result_fns:
        name, result = _load_valid_result_json(result_fn)
        if name is None:  # this means it couldn't find name and/or simtype
            continue

        dgs[name]["overall"].append([result["estimate"], result["uncertainty"]])
        proto_key = [k for k in result["unit_results"].keys() if k.startswith("ProtocolUnitResult")]
        for p in proto_key:
            # In openfe v1.9+, we only want to pick up results from
            # the Analysis Unit. To ensure backwards compatibility with
            # prior releases of openfe v1.x, we exclude Setup and Simulation
            if (
                "Setup" in result["unit_results"][p]["source_key"]
                or "Simulation" in result["unit_results"][p]["source_key"]
            ):
                continue
            if "unit_estimate" in result["unit_results"][p]["outputs"]:
                simtype = result["unit_results"][p]["outputs"]["simtype"]
                dg = result["unit_results"][p]["outputs"]["unit_estimate"]
                dg_error = result["unit_results"][p]["outputs"]["unit_estimate_error"]

                dgs[name][simtype].append([dg, dg_error])
            if "standard_state_correction" in result["unit_results"][p]["outputs"]:
                corr = result["unit_results"][p]["outputs"]["standard_state_correction"]
                # In openfe v1.9+, standard state corrections are set to 0 kcal/mol
                # when no correction is being applied (e.g. no restraints).
                # To make raw outputs similar to pre-v1.9, we exclude corrections
                # if they are close to 0.
                if not np.isclose(corr.m, 0):
                    dgs[name]["standard_state_correction"].append(
                        [corr, 0 * unit.kilocalorie_per_mole]
                    )
            else:
                continue
    return dgs

[6]:

def _error_std(r):
    """
    Calculate the error of the estimate as the std of the repeats
    """
    return np.std([v[0].m for v in r["overall"]])

def _error_mbar(r):
    """
    Calculate the error of the estimate using the reported MBAR errors.

    This also takes into account that repeats may have been run for this edge by using the average MBAR error
    """
    complex_errors = np.array([x[1].m for x in r["complex"]])
    solvent_errors = np.array([x[1].m for x in r["solvent"]])
    return np.sqrt(np.mean(complex_errors**2) + np.mean(solvent_errors**2))

Methods to extract and manipulate the ABFE results#

The next three methods allow you to extract ABFE results (extract_results_dict) and then manipulate them to get different types of results.

These manipulation methods include:

generate_dg: to get the dG values.
generate_dg_raw: to get the raw dG values for each individual legs in the ABFE transformation cycles.

[7]:

def extract_results_dict(
    results_files: list[os.PathLike | str],
) -> dict[str, dict[str, list]]:
    """
    Get a dictionary of ABFE results from a list of directories.

    Parameters
    ----------
    results_files : list[os.PathLike]
        A list of directors with ABFE result files to process.

    Returns
    -------
    sim_results : dict[str, dict[str, list]]
        Simulation results, organized by the leg's ligand names and simulation type.
    """
    # find and filter result jsons
    result_fns = _collect_result_jsons(results_files)

    # pair legs of simulations together into dict of dicts
    sim_results = _get_legs_from_result_jsons(result_fns)

    return sim_results

[8]:

def generate_dg(results_dict: dict[str, dict[str, list]]) -> pd.DataFrame:
    """Compute and write out DG values for the given results.

    Parameters
    ----------
    results_dict : dict[str, dict[str, list]]
        Dictionary of results created by ``extract_results_dict``.

    Returns
    -------
    pd.DataFrame
        A pandas DataFrame with the dG results for each ligand.
    """
    data = []
    # check the type of error which should be used based on the number of repeats
    repeats = {len(v["overall"]) for v in results_dict.values()}
    error_func = _error_mbar if 1 in repeats else _error_std
    for lig, results in sorted(results_dict.items()):
        dg = np.mean([v[0].m for v in results["overall"]])
        error = error_func(results)
        m, u = format_estimate_uncertainty(dg, error, unc_prec=2)
        data.append((lig, m, u))

    df = pd.DataFrame(
        data,
        columns=[
            "ligand",
            "DG (kcal/mol)",
            "uncertainty (kcal/mol)",
        ],
    )
    return df

[9]:

def generate_dg_raw(results_dict: dict[str, dict[str, list]]) -> pd.DataFrame:
    """
    Get all the transformation cycle legs found and their DG values.

    Parameters
    ----------
    results_dict : dict[str, dict[str, list]]
        Dictionary of results created by ``extract_results_dict``.

    Returns
    -------
    pd.DataFrame
        A pandas DataFrame with the individual cycle leg dG results.
    """
    data = []
    for lig, results in sorted(results_dict.items()):
        for simtype, repeats in sorted(results.items()):
            if simtype != "overall":
                for repeat in repeats:
                    m, u = format_estimate_uncertainty(
                        repeat[0].m, repeat[1].m, unc_prec=2
                    )
                    data.append((simtype, lig, m, u))

    df = pd.DataFrame(
        data,
        columns=[
            "leg",
            "ligand",
            "DG (kcal/mol)",
            "uncertainty (kcal/mol)",
        ],
    )
    return df

Analyzing your results#

Now that we have defined a set of methods to help us extract results, let’s analyze the results!

Specify result directories and gather results#

Let’s start by gathering all our simulation results. First we define all the directories where our ABFE results exist. Here we assume that our simulation repeats sit in three different results directories under abfe_results, named from results_0 to results_2.

[10]:

# Specify paths to result directories
results_dir = [pathlib.Path("abfe_results/abfe_results_multiple_units")]
dgs = extract_results_dict(results_dir)

Obtain the absolute binding free energy for all nodes in the network#

With these extracted results, we can now get the dG prediction between each ligand.

Note: if only a single repeat was run, the MBAR error is used as uncertainty estimate, while the standard deviation is used when results from more than one repeat are provided.

[11]:

df_dg = generate_dg(dgs)
df_dg.to_csv('dg.tsv', sep="\t", lineterminator="\n", index=False)

[12]:

df_dg

[12]:

	ligand	DG (kcal/mol)	uncertainty (kcal/mol)
0	1	-20.87	0.47

Obtain the raw DGs of every leg in the thermodynamic cycle#

If needed, you can also get the individual dG results for each leg of the ABFE transformation cycles. This can be useful in various different situation, such as when trying to diagnose which part of your simulation has the highest uncertainty. These include the DG of turning ligand interactions off in the complex and solvent, as well as the standard_state_correction that accounts for turning off the Boresch restraints between the non-interacting ligand and the protein and transferring the non-interacting ligand into the standard state.

[13]:

df_raw = generate_dg_raw(dgs)
df_raw.to_csv('dg_raw.tsv', sep="\t", lineterminator="\n", index=False)

[14]:

df_raw

[14]:

	leg	ligand	DG (kcal/mol)	uncertainty (kcal/mol)
0	complex	1	36.34	0.91
1	complex	1	36.17	0.80
2	complex	1	34.77	0.81
3	solvent	1	6.5	1.4
4	solvent	1	5.4	1.4
5	solvent	1	5.5	1.4
6	standard_state_correction	1	-8.9	0.0
7	standard_state_correction	1	-9.3	0.0
8	standard_state_correction	1	-9.0	0.0

[ ]: