Free energy perturbation¶

Background

The free energy perturbation (FEP) is a general method to calcualte the free energy difference between two specific states (such as A and B states), which follows in theory [1],

$F_{AB} = -log<e^{-f_{AB}(x)}>_B$, or $F_{BA} = -log<e^{-f_{BA}(x)}>_A$,

where the $f_{AB}(x)$ is the unitless relative reduced energy. For example, if we assume their kinetic energies are the same, then the $f_{AB}(x) = \beta E_A(x) - \beta E_B(x) = f_A(x) - f_B(x)$. $<>_B$ means an average of all samples that follow the ensemble generated from the $E_B(x)$ potential energy. Obviously, we expect that both A and B ensembles can be used to calcualte the free energy difference, $F_{AB}$ or $F_{BA}$, and they can provide the same values in principle. However, they are usually different, due to the potential different ensembles of these two states, so that the we usually find that $F_{AB} + F_{BA} \neq 0$, particularlly when they have little conformational overlap.

To my knowledge, there are two ways to reduce this potential difference, one common way is to create mulitple intermediate states in between states $A$ and $B$ by ensuring any two neighbor states have sufficient conformational overlap. Another way is to use a reweighting strategy that combines these two states, so that we can use both ensembles to estimate the $F_{AB}$ and other properties related. Importantly, the second way is normally combined with the first way. In this tutorial, I will introduce how we can use both ways to calculate the free energy difference between two states by an example of modified alanine dipeptide dimer, and also provide some Python scripts to run these FEP calculations. Here, we used an OpenMM package that supports Python APIs to run all simulations.

Reference

[1]. Free energy perturbation. Link: https://en.wikipedia.org/wiki/Free_energy_perturbation

Tuturial: free energies of ALAM-ALAD dimer¶

Model system¶

Free energies of a dimer: modified alanine dipeptide (ALAM) - alanine dipeptide (ALAD)

This tuturial used a modified alanine dipeptide dimer showed in the Fig. 1 in the reference [1], which was trying to mimic a backbone hydrogen bonding environment. It can be found that this dimer has a hydrogen bond formed by ALAM:HL(8) and ALAD:OR(30).Our goal is to calcualte their free energies or potential mean force (PMF) profile along the separated distances (0.1 to 1.2 nm) between them, where these two dipeptides are fixed when moving along their distance. Its gromacs-supported structure was provided here, which was solvated by the TIP3P water molecules, and the current distance between atom 8 and 30 is 1.2 nm.

Reference

[1]. Im, Wonpil, Jianhan Chen, and Charles L. Brooks III. "Peptide and protein folding and conformational equilibria: theoretical treatment of electrostatics and hydrogen bonding with implicit solvent models." Advances in protein chemistry 72 (2005): 173-198. Link: https://doi.org/10.1016/S0065-3233(05)72007-6

General steps¶

To calcualte the PMF profile of this dimer along its distance, we have to design multiple states that have different distances, and then use all ensembles to calculate their free energies. One choice of its distances can be from 1.2 to 0.1 nm with a -0.01 nm step, which means that we have 111 states in total. Normally, it is better to generate these 111 initial structures, where the dimer has different distances. Then, we can fix the dimer and run 111 standard MD simulations, to collect all conformational ensembles. In this way, we can submit 111 simulation jobs at the same time to save time, if each simulation takes a long time to sample sufficient coformations. The following describes several general steps to run the FEP calculations.

Design all states and generate their initial conformations: One way to generate those initial conformations is that we can use the current structure to generate the initial strcuture of next state by a MD equilibrium.
Run standard MD simulations for all conformations by using the given potential energies
Collect all samples of two neighbor states, and then calculate their properties of interest

Prepare for an initial structure and topology by using Gromacs¶

The force field to use is the charmm36-feb2021.ff. Although this force field include the topoly and parameters of alanine dipeptide, it does not have the topology of modified alanine dipeptide. Here, we have to add its topology showed below into the "charmm36-feb2021.ff/merged.rtp". For example, we can just put the following text into the end of that rtp file. It is noted that we do not need to change the parameter files, because these parameteres of these atom types, bonds, and angles can be found in the current force field.

; additional
[ ALAM ]; modified alanine dipeptide
  [ atoms ]
           CL   CT3   -0.270  0
          HL1   HA3    0.090  1
          HL2   HA3    0.090  2
          HL3   HA3    0.090  3
          CLP     C    0.510  4
           OL     O   -0.510  5
           NL   NH1   -0.470  6
           HL     H    0.310  7
           CA   CT1    0.070  8
           HA   HB1    0.090  9
           CB   CT3   -0.270 10
          HB1   HA3    0.090 11
          HB2   HA3    0.090 12
          HB3   HA3    0.090 13
  [ bonds ]
           CL   CLP
          CLP    NL
           NL    CA
           NL    HL
           CA    HA
           CA    CB
           CL   HL1
           CL   HL2
           CL   HL3
           CB   HB1
           CB   HB2
           CB   HB3
          CLP    OL
  [ impropers ]
          CLP    CL    NL    OL
           NL   CLP    CA    HL

Then, we can use the Gromacs to generate the topology file of this model system, typing

gmx pdb2gmx -f ala2.solv.gro -p ala2.solv.top -o ala2.solv.pdb

Selections: 1: CHARMM36 all-atom force field (July 2020) ; from the current directory, and 1: TIP3P ; recommended by default

Obtain all initial equilibrium structures using OpenMM¶

Here, we first create a data folder, and then run a Python script "genstr.py". This script will generate a system xml file, dcd file used to visuallize, and state xml files that saved all equilibrum states.

python genstr.py # this will generate the system, state, and dcd files
vmd ala2.solv.gro ala2.solv.eq.dcd # visualize the dcd files

The following can be saved to the "genstr.py".

# OpenMM
import simtk.openmm as omm # contains functions for MD
import simtk.openmm.app as app # contains functions for i/o
from simtk import unit # controls unique object types for physical units
import sys
import numpy as np
import mdtraj as md
import time

start = time.time()

# setting
temperature = 300e0*unit.kelvin
pdbid = 'ala2.solv'

# Platforms
platform = omm.Platform.getPlatformByName('CUDA')
properties = {'CudaPrecision': 'mixed', 'DeviceIndex': '0'}

# create a modeller
gro = app.GromacsGroFile(pdbid+'.gro')
top = app.GromacsTopFile(pdbid+'.top', periodicBoxVectors=gro.getPeriodicBoxVectors(),
        includeDir='./charmm36-feb2021.ff')
top.setPeriodicBoxVectors = gro.getPeriodicBoxVectors()
modeller = app.Modeller(top.topology, gro.positions)

# create a system
system = top.createSystem(nonbondedMethod=app.PME,
                          nonbondedCutoff=1.2e0*unit.nanometers,
                          rigidWater=True,)
# fix ala2 using zero mass
pdb = md.load_pdb(pdbid+'.pdb')
inxALAM = pdb.topology.select('resname ALAM')
inxHL =   pdb.topology.select('resname ALAM and name HL')
inxALAD = pdb.topology.select('resname ALAD')
inxOR =   pdb.topology.select('resname ALAD and name OR')
resid = pdb.topology.select('not waters')
for k in resid:
    system.setParticleMass(int(k), 0.e0*unit.dalton)
# save a system
with open(pdbid+'.system.xml', 'w') as f:
    f.write(omm.XmlSerializer.serialize(system))

# generate trajectories
pos = modeller.getPositions()
dd = -0.01e0
distArr = np.linspace(1.2e0,0.1e0,111,endpoint=True)
nwindows = len(distArr)
print("# #windows to generate: ", nwindows)
print("----------------------------")
for k in range(0,nwindows):
    print("#generating the window >> ", k)

    # create an integrator and simulation
    dt = 1.e0*unit.femtoseconds
    nstep = 50*1000 # 50 ps
    prnfrq = nstep
    integrator = omm.LangevinIntegrator(temperature, 1.e0/unit.picosecond, dt)
    simulation = app.Simulation(modeller.topology, system, integrator, platform=platform, platformProperties=properties)
    rep = app.StateDataReporter(sys.stdout, prnfrq, separator=' ', step=True, time=True, potentialEnergy=True, kineticEnergy=True, temperature=True, totalEnergy=True)
    simulation.reporters.append(rep)
    if k==0: simulation.reporters.append(app.DCDReporter(pdbid+'.eq.dcd', nstep))
    if k>0: simulation.reporters.append(app.DCDReporter(pdbid+'.eq.dcd', nstep, append=True))

    # update positions
    pos = modeller.getPositions() #traj.openmm_positions(frame=j)
    simulation.context.setPositions(pos)

    # energy calculations
    # >>
    state = simulation.context.getState(getPositions=True,getEnergy=True)
    pos = state.getPositions()
    ener = state.getPotentialEnergy()
    dist = unit.norm(pos[inxHL[0]] - pos[inxOR[0]])
    print('#e0: Distance(HL7,OR29)=', dist, 'ener=', ener)

    ## energy minimization
    #simulation.minimizeEnergy()
    # >>
    #state = simulation.context.getState(getPositions=True,getEnergy=True)
    #pos = state.getPositions()
    #ener = state.getPotentialEnergy()
    #dist = unit.norm(pos[7] - pos[29])
    #print('em: Distance(HL7,OR29)=', dist, 'ener=', ener)

    # md equilibration
    simulation.step(nstep)
    # >>
    state = simulation.context.getState(getPositions=True,getVelocities=True,getEnergy=True)
    pos = state.getPositions()
    ener = state.getPotentialEnergy()
    dist = unit.norm(pos[inxHL[0]] - pos[inxOR[0]])
    print('#eq: Distance(HL7,OR29)=', dist, 'ener=', ener)

    # save state
    output = './data/'+pdbid+'.d'+"{0:.1f}".format(dist/unit.angstroms)
    #app.PDBFile.writeFile(modeller.topology, modeller.positions, open(output+".pdb", 'w'), keepIds=True)
    simulation.saveState(output+'.state.xml')

    # move to next window
    xyz = np.array(pos/unit.nanometer)
    xyz[inxALAM,0] -= dd*0.5e0
    xyz[inxALAD,0] += dd*0.5e0
    gro.positions = xyz*unit.nanometer
    modeller = app.Modeller(top.topology, gro.positions)

    print('')

end = time.time()
print("# Elapsed wall-clock time (s): ", end - start)

Obtain the trajectories of all production simulations using OpenMM¶

Once we saved all initial equilibrium states, we can use them for the MD production simulations. Besides each state, we also need to save the system xml file that includes all forces. Then, we can use the following Python script to run simulations, which is "gendcd.py". After runing these simulations, we will obtain all trajectories of each simulations, which is located in the "./data/" folder.

# OpenMM
import simtk.openmm as omm # contains functions for MD
import simtk.openmm.app as app # contains functions for i/o
from simtk import unit # controls unique object types for physical units
import sys
import numpy as np
import mdtraj as md
import time
import os

start = time.time()

pdbid = 'ala2.solv'

# Platforms
platform = omm.Platform.getPlatformByName('CUDA')
properties = {'CudaPrecision': 'mixed', 'DeviceIndex': '0'}

# load topology
gro = app.GromacsGroFile(pdbid+'.gro')
top = app.GromacsTopFile(pdbid+'.top', periodicBoxVectors=gro.getPeriodicBoxVectors(),
        includeDir='./charmm36-feb2021.ff')
top.setPeriodicBoxVectors = gro.getPeriodicBoxVectors()
# load system
with open(pdbid+'.system.xml','r') as f:
    system = omm.XmlSerializer.deserialize(f.read())

# set up simulations
pdb = md.load_pdb(pdbid+'.pdb')
inxHL = pdb.topology.select('resname ALAM and name HL')
inxOR = pdb.topology.select('resname ALAD and name OR')
distArr = np.linspace(1.2e0,0.1e0,111,endpoint=True)
nwindows = len(distArr)
print("# Number of windows to generate: ", nwindows)
print("----------------------------")
for k in range(0,nwindows):
    print("# Generating the window >> ", k)

    output = './data/'+pdbid+'.d'+"{0:.1f}".format(distArr[k]*10e0)
    # load state
    with open(output+'.state.xml','r') as f:
        state = omm.XmlSerializer.deserialize(f.read())
    # create an integrator and simulation
    nstep = 1000*500 # 1 ns
    prnfrq = 500
    dt = 2.e0*unit.femtoseconds
    temperature = 300e0*unit.kelvin
    integrator = omm.LangevinIntegrator(temperature, 1.e0/unit.picosecond, dt)
    simulation = app.Simulation(top.topology, system, integrator, platform=platform, platformProperties=properties)
    dcdfile = output+'.dcd'
    simulation.reporters.append(app.DCDReporter(output+'.dcd', prnfrq))
   #if os.path.isfile(dcdfile):
   #    print('# appending the dcd file')
   #    simulation.reporters.append(app.DCDReporter(output+'.dcd', prnfrq, append=True))
   #else:
   #    simulation.reporters.append(app.DCDReporter(output+'.dcd', prnfrq))
    #rep = app.StateDataReporter(sys.stdout, prnfrq, separator=' ', step=True, time=True, potentialEnergy=True, kineticEnergy=True, temperature=True, totalEnergy=True)
    #simulation.reporters.append(rep)

    # update the state
    simulation.context.setPeriodicBoxVectors(*state.getPeriodicBoxVectors())
    simulation.context.setPositions(state.getPositions())
    simulation.context.setVelocities(state.getVelocities())
    simulation.context.setTime(state.getTime())

    # energy calculations
    # >>
    state = simulation.context.getState(getPositions=True,getEnergy=True)
    pos = state.getPositions()
    ener = state.getPotentialEnergy()
    dist = unit.norm(pos[inxHL[0]] - pos[inxOR[0]])
    print('#e0: Distance(HL7,OR29)=', dist, 'ener=', ener)

    # md simulations
    simulation.step(nstep)
    # >>
    state = simulation.context.getState(getPositions=True,getVelocities=True,getEnergy=True)
    pos = state.getPositions()
    ener = state.getPotentialEnergy()
    dist = unit.norm(pos[inxHL[0]] - pos[inxOR[0]])
    print('#md: Distance(HL7,OR29)=', dist, 'ener=', ener)
    print('')

    # save state
    statefile = './data/'+pdbid+'.d'+"{0:.1f}".format(dist/unit.angstroms)
    simulation.saveState(statefile+'.state.xml')

end = time.time()
print("# Elapsed wall-clock time (s): ", end - start)

Free energy perturbation¶

Methology¶

We have finished all MD simulations and obtained these trajectories in the data folder. The next thing is that we can use them to calculate any properties of interest. Before calculating the properties, we have to generate the enough samples that follow specific probability distribution.

As we know, any property can be calculated from the following equation,

$<A>_P = \int A(x)P(x)dx$, (1)

where the $A(x)$ is the property to calculate, and $P(x)$ is our target probability distribution. So, the next question is how we can generate the samples that follow the target $P(x)$. One important strategy is using a reweighting strategy that combines these two neighbor states. For example, we can collect the trajectory samples from two dcd files (state 1: ala2.solv.d12.0.dcd and state 2: ala2.solv.d11.9.dcd), which were sampled in two simulations in which the distances are 1.20 and 1.19 nm, respectively. Then, we can construct a mixture distribution distribution that they follow,

$P_{mix}(x) = \sum_{k=1}^K{c_k * e^{F_k - f_k(x)}}$, (2)

$c_k = \frac{n_k}{\sum_k{n_k}} = \frac{n_k}{N}$,

where $K$ is 2, $F_k$ is the relative free energy at $k^{th}$ state, and $f_k(x) = \beta_k E_k(x)$ this mixture probability can be used to describe the chosen samples. In this way, given all samples, we should know how to calcualte this mixture probability, and all properties can be calculated from this mixture probability.

Then the Eq. (1) can be further derived as follows,

$<A>_P = \int{A(x)P(x)}dx = \int{A(x)\frac{P(x)}{P_{mix}(x)} P_{mix}(x)}dx$

$= \sum_{n=1}^{N} A(x_n) w(x_n), x_n \in P_{mix}(x)$, (2)

where $ w(x_n) = \frac{1}{N} \sum_{n=1}^{N} \frac{P(x_n)}{P_{mix}(x_n)} $ and it subjects to $\sum_{n} w(x_n) = 1$.

In this way, we can make the following equation alway be correct by combining all samples of both states,

$F_{AB} + F_{BA} = 0$. (4)

Calculating energies and their perturbations¶

First, we need to caclualte the relative free energies, $f_k(x_n)$, k = 1,2,...,111. $x_n$ samples loop all trajectories of any two neighbor states. The Python script was posted below.

# OpenMM
import simtk.openmm as omm # contains functions for MD
import simtk.openmm.app as app # contains functions for i/o
from simtk import unit # controls unique object types for physical units
import sys
import numpy as np
import mdtraj as md
import time
import os

start = time.time()

kB = 0.593e0/298.e0 # kcal/mol
T = 300.e0
beta = 1.e0/(kB*T)
pdbid = 'ala2.solv'

# Platforms
platform = omm.Platform.getPlatformByName('CUDA')
properties = {'CudaPrecision': 'mixed', 'DeviceIndex': '0'}

# load topology
gro = app.GromacsGroFile(pdbid+'.gro')
top = app.GromacsTopFile(pdbid+'.top', periodicBoxVectors=gro.getPeriodicBoxVectors(),
        includeDir='./charmm36-feb2021.ff')
top.setPeriodicBoxVectors = gro.getPeriodicBoxVectors()
# load system
with open(pdbid+'.system.xml','r') as f:
    system = omm.XmlSerializer.deserialize(f.read())

# set up simulations
dd = -0.01e0
pdb = md.load_pdb(pdbid+'.pdb')
inxALAM = pdb.topology.select('resname ALAM')
inxHL =   pdb.topology.select('resname ALAM and name HL')
inxALAD = pdb.topology.select('resname ALAD')
inxOR =   pdb.topology.select('resname ALAD and name OR')
distArr = np.linspace(1.2e0,0.1e0,111,endpoint=True)
nwindows = len(distArr)
print("# Number of windows to generate: ", nwindows)
print("----------------------------")
for k in range(0,nwindows):
    print("# Generating the window >> ", k)

    output = './data/'+pdbid+'.d'+"{0:.1f}".format(distArr[k]*10e0)

    # create an integrator and simulation
    dt = 2.e0*unit.femtoseconds
    temperature = 300e0*unit.kelvin
    integrator = omm.LangevinIntegrator(temperature, 1.e0/unit.picosecond, dt)
    simulation = app.Simulation(top.topology, system, integrator, platform=platform, platformProperties=properties)

    # load trajectories
    traj = md.load(output+'.dcd', top=pdbid+'.gro')
    nframe = len(traj)
    save_fep = np.zeros((nframe,6))
    print('# nframe = ', nframe)
    for k in range(0,nframe):
        pos = traj.openmm_positions(frame=k)
        j = 0
        for dmove in [-dd*0.5e0, 0.e0, dd*0.5e0]:
            # update positions
            xyz = np.array(pos/unit.nanometer)
            xyz[inxALAM,0] -= dmove
            xyz[inxALAD,0] += dmove
            simulation.context.setPositions(xyz*unit.nanometer)
            state = simulation.context.getState(getPositions=True,getEnergy=True)
            pos_fep = state.getPositions()
            dist_fep = unit.norm(pos_fep[inxHL[0]] - pos_fep[inxOR[0]])/unit.angstroms
            ener_fep = state.getPotentialEnergy()/unit.kilocalorie_per_mole
            save_fep[k,j] = dist_fep
            save_fep[k,j+1] = ener_fep*beta
            j += 2
    np.savetxt(output+'.fener', save_fep, header="# dm1, u(dm1), d0, u(d0), d1, u(d1)")

end = time.time()
print("# Elapsed wall-clock time (s): ", end - start)

Free energies by MBAR reweighting¶

The relative free energies, $F_k$, can be further written as follows,

${F_k} = -log \int e^{-f_k(x)}dx$ $=-log \sum_n \frac{e^{-f_k(x_n)}}{\sum_{k=1}^K{n_k * e^{F_k - f_k(x_n)}}}, k=1,2$. (1)

Here, the pymbar and FastMBAR packages can be used to calculate the relative $F_k$, given two neighbor states. In this way, we can do a cumulative sum of their relative free energy difference between any two neighbor states, then, the free energy of the $n^{th}$ state can be written as below, which are based on the first state.

$F_{n0} = \sum_{k=1}^n (F_{k} - F_{k-1})$. (2)

Reference

[1]. pymbar installation: https://github.com/choderalab/pymbar

[2]. FastMBAR installation: https://github.com/xqding/FastMBAR

import numpy as np
from pymbar import MBAR
import glob
import os
import time

start = time.time()

# analysis
files = glob.glob('./data/*.fener')
files.sort(key=os.path.getmtime)
#print(files)
nfiles = len(files)
fener = []
nframe = 1000
ukn = np.zeros((2,2*nframe))
for k in range(0,nfiles-1):
    dat1 = np.loadtxt(files[k])
    d1 = np.mean(dat1[:,2])
    dat2 = np.loadtxt(files[k+1])
    d2 = np.mean(dat2[:,2])
    ukn[0,0:nframe] = dat1[:,3]
    ukn[0,nframe:nframe*2] = dat2[:,1]
    ukn[1,0:nframe] = dat1[:,5]
    ukn[1,nframe:2*nframe] = dat2[:,3]
    mbar = MBAR(ukn, np.ones(2)*nframe) # assume the number of their samples is the same
    #print(mbar.f_k)
    fener.append([d1, d2, mbar.f_k[0], mbar.f_k[1]])
np.savetxt('./data/fener.dat', fener)

end = time.time()
print("# elapsed wall-clock time (s): ", end - start)

Plot PMF profile¶

%cd '/home/ping/tutorial/fep'

import numpy as np
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = [8, 6]
plt.rcParams.update({'font.size': 16})
plt.rcParams.update({'lines.linewidth': 3})

kB = 0.593/298 # kcal/mol
T = 300
beta = 1.0/(kB*T)
dcut = 8.0 # cutoff distance (Ang)

# CHARMM
dat = np.loadtxt('ref.pmf')
dist = dat[:,0]
fk = dat[:,1]*beta
plt.plot(dist, fk-np.mean(fk[dist>dcut]), 'k-', label='FEP-100ps (Chen, 2004)')

# OpenMM
dat = np.loadtxt('./data/fener.dat')
dist = dat[:,0]
fk = np.cumsum(dat[:,1])
plt.plot(dist, fk-np.mean(fk[dist>dcut]), 'g-', label='FEP-1ns (Gong, 2022)')

plt.ylim([-3, 3])
x=plt.xticks(np.arange(1,13,1))
plt.legend()
plt.xlabel('Dist ($\AA$)')
plt.ylabel('PMF (kT)')
plt.title('ALAM-ALAD dimer')

/home/ping/tutorial/fep

Text(0.5, 1.0, 'ALAM-ALAD dimer')

Appendix¶

A simple approximation of free energy differenece without iterations¶

Here, we derived a simple way to calculate the free energy difference, $F_{AB}$,

We first make a free energy decomposition by the cumulant expansion.

$F_{AB} = -log<e^{-f_{AB}(x)}>_B \approx <f_{AB}(x)>_B - \frac{1}{2}<\delta^2 f_{AB}(x)>_B$, (1)

$F_{AB} = log<e^{-f_{BA}(x)}>_A \approx <f_{AB}(x)>_A + \frac{1}{2}<\delta^2 f_{BA}(x)>_A$, (2)

then, we can add up the above two equations with a given coefficient, we can get the following approximation, if the second order is small,

$F_{AB} \approx c_B <f_{AB}(x)>_B + c_A <f_{AB}(x)>_A$

$\approx \sum_n f_{AB}(x_n), x_n \in P_{mix}(x)$. (3)