# Free energy perturbation¶

Background

The free energy perturbation (FEP) is a general method to calcualte the free energy difference between two specific states (such as A and B states), which follows in theory ,

$F_{AB} = -log<e^{-f_{AB}(x)}>_B$, or $F_{BA} = -log<e^{-f_{BA}(x)}>_A$,

where the $f_{AB}(x)$ is the unitless relative reduced energy. For example, if we assume their kinetic energies are the same, then the $f_{AB}(x) = \beta E_A(x) - \beta E_B(x) = f_A(x) - f_B(x)$. $<>_B$ means an average of all samples that follow the ensemble generated from the $E_B(x)$ potential energy. Obviously, we expect that both A and B ensembles can be used to calcualte the free energy difference, $F_{AB}$ or $F_{BA}$, and they can provide the same values in principle. However, they are usually different, due to the potential different ensembles of these two states, so that the we usually find that $F_{AB} + F_{BA} \neq 0$, particularlly when they have little conformational overlap.

To my knowledge, there are two ways to reduce this potential difference, one common way is to create mulitple intermediate states in between states $A$ and $B$ by ensuring any two neighbor states have sufficient conformational overlap. Another way is to use a reweighting strategy that combines these two states, so that we can use both ensembles to estimate the $F_{AB}$ and other properties related. Importantly, the second way is normally combined with the first way. In this tutorial, I will introduce how we can use both ways to calculate the free energy difference between two states by an example of modified alanine dipeptide dimer, and also provide some Python scripts to run these FEP calculations. Here, we used an OpenMM package that supports Python APIs to run all simulations.

Reference

. Free energy perturbation. Link: https://en.wikipedia.org/wiki/Free_energy_perturbation

# Tuturial: free energies of ALAM-ALAD dimer¶

## Model system¶

Free energies of a dimer: modified alanine dipeptide (ALAM) - alanine dipeptide (ALAD)

This tuturial used a modified alanine dipeptide dimer showed in the Fig. 1 in the reference , which was trying to mimic a backbone hydrogen bonding environment. It can be found that this dimer has a hydrogen bond formed by ALAM:HL(8) and ALAD:OR(30).Our goal is to calcualte their free energies or potential mean force (PMF) profile along the separated distances (0.1 to 1.2 nm) between them, where these two dipeptides are fixed when moving along their distance. Its gromacs-supported structure was provided here, which was solvated by the TIP3P water molecules, and the current distance between atom 8 and 30 is 1.2 nm.

Reference

. Im, Wonpil, Jianhan Chen, and Charles L. Brooks III. "Peptide and protein folding and conformational equilibria: theoretical treatment of electrostatics and hydrogen bonding with implicit solvent models." Advances in protein chemistry 72 (2005): 173-198. Link: https://doi.org/10.1016/S0065-3233(05)72007-6

## General steps¶

To calcualte the PMF profile of this dimer along its distance, we have to design multiple states that have different distances, and then use all ensembles to calculate their free energies. One choice of its distances can be from 1.2 to 0.1 nm with a -0.01 nm step, which means that we have 111 states in total. Normally, it is better to generate these 111 initial structures, where the dimer has different distances. Then, we can fix the dimer and run 111 standard MD simulations, to collect all conformational ensembles. In this way, we can submit 111 simulation jobs at the same time to save time, if each simulation takes a long time to sample sufficient coformations. The following describes several general steps to run the FEP calculations.

• Design all states and generate their initial conformations: One way to generate those initial conformations is that we can use the current structure to generate the initial strcuture of next state by a MD equilibrium.

• Run standard MD simulations for all conformations by using the given potential energies

• Collect all samples of two neighbor states, and then calculate their properties of interest

### Prepare for an initial structure and topology by using Gromacs¶

The force field to use is the charmm36-feb2021.ff. Although this force field include the topoly and parameters of alanine dipeptide, it does not have the topology of modified alanine dipeptide. Here, we have to add its topology showed below into the "charmm36-feb2021.ff/merged.rtp". For example, we can just put the following text into the end of that rtp file. It is noted that we do not need to change the parameter files, because these parameteres of these atom types, bonds, and angles can be found in the current force field.

; additional
[ ALAM ]; modified alanine dipeptide
[ atoms ]
CL   CT3   -0.270  0
HL1   HA3    0.090  1
HL2   HA3    0.090  2
HL3   HA3    0.090  3
CLP     C    0.510  4
OL     O   -0.510  5
NL   NH1   -0.470  6
HL     H    0.310  7
CA   CT1    0.070  8
HA   HB1    0.090  9
CB   CT3   -0.270 10
HB1   HA3    0.090 11
HB2   HA3    0.090 12
HB3   HA3    0.090 13
[ bonds ]
CL   CLP
CLP    NL
NL    CA
NL    HL
CA    HA
CA    CB
CL   HL1
CL   HL2
CL   HL3
CB   HB1
CB   HB2
CB   HB3
CLP    OL
[ impropers ]
CLP    CL    NL    OL
NL   CLP    CA    HL



Then, we can use the Gromacs to generate the topology file of this model system, typing

gmx pdb2gmx -f ala2.solv.gro -p ala2.solv.top -o ala2.solv.pdb

Selections: 1: CHARMM36 all-atom force field (July 2020) ; from the current directory, and 1: TIP3P ; recommended by default

### Obtain all initial equilibrium structures using OpenMM¶

Here, we first create a data folder, and then run a Python script "genstr.py". This script will generate a system xml file, dcd file used to visuallize, and state xml files that saved all equilibrum states.

python genstr.py # this will generate the system, state, and dcd files
vmd ala2.solv.gro ala2.solv.eq.dcd # visualize the dcd files



The following can be saved to the "genstr.py".

# OpenMM
import simtk.openmm as omm # contains functions for MD
import simtk.openmm.app as app # contains functions for i/o
from simtk import unit # controls unique object types for physical units
import sys
import numpy as np
import mdtraj as md
import time

start = time.time()

# setting
temperature = 300e0*unit.kelvin
pdbid = 'ala2.solv'

# Platforms
platform = omm.Platform.getPlatformByName('CUDA')
properties = {'CudaPrecision': 'mixed', 'DeviceIndex': '0'}

# create a modeller
gro = app.GromacsGroFile(pdbid+'.gro')
top = app.GromacsTopFile(pdbid+'.top', periodicBoxVectors=gro.getPeriodicBoxVectors(),
includeDir='./charmm36-feb2021.ff')
top.setPeriodicBoxVectors = gro.getPeriodicBoxVectors()
modeller = app.Modeller(top.topology, gro.positions)

# create a system
system = top.createSystem(nonbondedMethod=app.PME,
nonbondedCutoff=1.2e0*unit.nanometers,
rigidWater=True,)
# fix ala2 using zero mass
inxALAM = pdb.topology.select('resname ALAM')
inxHL =   pdb.topology.select('resname ALAM and name HL')
inxOR =   pdb.topology.select('resname ALAD and name OR')
resid = pdb.topology.select('not waters')
for k in resid:
system.setParticleMass(int(k), 0.e0*unit.dalton)
# save a system
with open(pdbid+'.system.xml', 'w') as f:
f.write(omm.XmlSerializer.serialize(system))

# generate trajectories
pos = modeller.getPositions()
dd = -0.01e0
distArr = np.linspace(1.2e0,0.1e0,111,endpoint=True)
nwindows = len(distArr)
print("# #windows to generate: ", nwindows)
print("----------------------------")
for k in range(0,nwindows):
print("#generating the window >> ", k)

# create an integrator and simulation
dt = 1.e0*unit.femtoseconds
nstep = 50*1000 # 50 ps
prnfrq = nstep
integrator = omm.LangevinIntegrator(temperature, 1.e0/unit.picosecond, dt)
simulation = app.Simulation(modeller.topology, system, integrator, platform=platform, platformProperties=properties)
rep = app.StateDataReporter(sys.stdout, prnfrq, separator=' ', step=True, time=True, potentialEnergy=True, kineticEnergy=True, temperature=True, totalEnergy=True)
simulation.reporters.append(rep)
if k==0: simulation.reporters.append(app.DCDReporter(pdbid+'.eq.dcd', nstep))
if k>0: simulation.reporters.append(app.DCDReporter(pdbid+'.eq.dcd', nstep, append=True))

# update positions
pos = modeller.getPositions() #traj.openmm_positions(frame=j)
simulation.context.setPositions(pos)

# energy calculations
# >>
state = simulation.context.getState(getPositions=True,getEnergy=True)
pos = state.getPositions()
ener = state.getPotentialEnergy()
dist = unit.norm(pos[inxHL] - pos[inxOR])
print('#e0: Distance(HL7,OR29)=', dist, 'ener=', ener)

## energy minimization
#simulation.minimizeEnergy()
# >>
#state = simulation.context.getState(getPositions=True,getEnergy=True)
#pos = state.getPositions()
#ener = state.getPotentialEnergy()
#dist = unit.norm(pos - pos)
#print('em: Distance(HL7,OR29)=', dist, 'ener=', ener)

# md equilibration
simulation.step(nstep)
# >>
state = simulation.context.getState(getPositions=True,getVelocities=True,getEnergy=True)
pos = state.getPositions()
ener = state.getPotentialEnergy()
dist = unit.norm(pos[inxHL] - pos[inxOR])
print('#eq: Distance(HL7,OR29)=', dist, 'ener=', ener)

# save state
output = './data/'+pdbid+'.d'+"{0:.1f}".format(dist/unit.angstroms)
#app.PDBFile.writeFile(modeller.topology, modeller.positions, open(output+".pdb", 'w'), keepIds=True)
simulation.saveState(output+'.state.xml')

# move to next window
xyz = np.array(pos/unit.nanometer)
xyz[inxALAM,0] -= dd*0.5e0
gro.positions = xyz*unit.nanometer
modeller = app.Modeller(top.topology, gro.positions)

print('')

end = time.time()
print("# Elapsed wall-clock time (s): ", end - start)


### Obtain the trajectories of all production simulations using OpenMM¶

Once we saved all initial equilibrium states, we can use them for the MD production simulations. Besides each state, we also need to save the system xml file that includes all forces. Then, we can use the following Python script to run simulations, which is "gendcd.py". After runing these simulations, we will obtain all trajectories of each simulations, which is located in the "./data/" folder.

# OpenMM
import simtk.openmm as omm # contains functions for MD
import simtk.openmm.app as app # contains functions for i/o
from simtk import unit # controls unique object types for physical units
import sys
import numpy as np
import mdtraj as md
import time
import os

start = time.time()

pdbid = 'ala2.solv'

# Platforms
platform = omm.Platform.getPlatformByName('CUDA')
properties = {'CudaPrecision': 'mixed', 'DeviceIndex': '0'}

gro = app.GromacsGroFile(pdbid+'.gro')
top = app.GromacsTopFile(pdbid+'.top', periodicBoxVectors=gro.getPeriodicBoxVectors(),
includeDir='./charmm36-feb2021.ff')
top.setPeriodicBoxVectors = gro.getPeriodicBoxVectors()
with open(pdbid+'.system.xml','r') as f:

# set up simulations
inxHL = pdb.topology.select('resname ALAM and name HL')
inxOR = pdb.topology.select('resname ALAD and name OR')
distArr = np.linspace(1.2e0,0.1e0,111,endpoint=True)
nwindows = len(distArr)
print("# Number of windows to generate: ", nwindows)
print("----------------------------")
for k in range(0,nwindows):
print("# Generating the window >> ", k)

output = './data/'+pdbid+'.d'+"{0:.1f}".format(distArr[k]*10e0)
with open(output+'.state.xml','r') as f:
# create an integrator and simulation
nstep = 1000*500 # 1 ns
prnfrq = 500
dt = 2.e0*unit.femtoseconds
temperature = 300e0*unit.kelvin
integrator = omm.LangevinIntegrator(temperature, 1.e0/unit.picosecond, dt)
simulation = app.Simulation(top.topology, system, integrator, platform=platform, platformProperties=properties)
dcdfile = output+'.dcd'
simulation.reporters.append(app.DCDReporter(output+'.dcd', prnfrq))
#if os.path.isfile(dcdfile):
#    print('# appending the dcd file')
#    simulation.reporters.append(app.DCDReporter(output+'.dcd', prnfrq, append=True))
#else:
#    simulation.reporters.append(app.DCDReporter(output+'.dcd', prnfrq))
#rep = app.StateDataReporter(sys.stdout, prnfrq, separator=' ', step=True, time=True, potentialEnergy=True, kineticEnergy=True, temperature=True, totalEnergy=True)
#simulation.reporters.append(rep)

# update the state
simulation.context.setPeriodicBoxVectors(*state.getPeriodicBoxVectors())
simulation.context.setPositions(state.getPositions())
simulation.context.setVelocities(state.getVelocities())
simulation.context.setTime(state.getTime())

# energy calculations
# >>
state = simulation.context.getState(getPositions=True,getEnergy=True)
pos = state.getPositions()
ener = state.getPotentialEnergy()
dist = unit.norm(pos[inxHL] - pos[inxOR])
print('#e0: Distance(HL7,OR29)=', dist, 'ener=', ener)

# md simulations
simulation.step(nstep)
# >>
state = simulation.context.getState(getPositions=True,getVelocities=True,getEnergy=True)
pos = state.getPositions()
ener = state.getPotentialEnergy()
dist = unit.norm(pos[inxHL] - pos[inxOR])
print('#md: Distance(HL7,OR29)=', dist, 'ener=', ener)
print('')

# save state
statefile = './data/'+pdbid+'.d'+"{0:.1f}".format(dist/unit.angstroms)
simulation.saveState(statefile+'.state.xml')

end = time.time()
print("# Elapsed wall-clock time (s): ", end - start)


### Free energy perturbation¶

#### Methology¶

We have finished all MD simulations and obtained these trajectories in the data folder. The next thing is that we can use them to calculate any properties of interest. Before calculating the properties, we have to generate the enough samples that follow specific probability distribution.

As we know, any property can be calculated from the following equation,

$<A>_P = \int A(x)P(x)dx$, (1)

where the $A(x)$ is the property to calculate, and $P(x)$ is our target probability distribution. So, the next question is how we can generate the samples that follow the target $P(x)$. One important strategy is using a reweighting strategy that combines these two neighbor states. For example, we can collect the trajectory samples from two dcd files (state 1: ala2.solv.d12.0.dcd and state 2: ala2.solv.d11.9.dcd), which were sampled in two simulations in which the distances are 1.20 and 1.19 nm, respectively. Then, we can construct a mixture distribution distribution that they follow,

$P_{mix}(x) = \sum_{k=1}^K{c_k * e^{F_k - f_k(x)}}$, (2)

$c_k = \frac{n_k}{\sum_k{n_k}} = \frac{n_k}{N}$,

where $K$ is 2, $F_k$ is the relative free energy at $k^{th}$ state, and $f_k(x) = \beta_k E_k(x)$ this mixture probability can be used to describe the chosen samples. In this way, given all samples, we should know how to calcualte this mixture probability, and all properties can be calculated from this mixture probability.

Then the Eq. (1) can be further derived as follows,

$<A>_P = \int{A(x)P(x)}dx = \int{A(x)\frac{P(x)}{P_{mix}(x)} P_{mix}(x)}dx$

$= \sum_{n=1}^{N} A(x_n) w(x_n), x_n \in P_{mix}(x)$, (2)

where $w(x_n) = \frac{1}{N} \sum_{n=1}^{N} \frac{P(x_n)}{P_{mix}(x_n)}$ and it subjects to $\sum_{n} w(x_n) = 1$.

In this way, we can make the following equation alway be correct by combining all samples of both states,

$F_{AB} + F_{BA} = 0$. (4)

#### Calculating energies and their perturbations¶

First, we need to caclualte the relative free energies, $f_k(x_n)$, k = 1,2,...,111. $x_n$ samples loop all trajectories of any two neighbor states. The Python script was posted below.

# OpenMM
import simtk.openmm as omm # contains functions for MD
import simtk.openmm.app as app # contains functions for i/o
from simtk import unit # controls unique object types for physical units
import sys
import numpy as np
import mdtraj as md
import time
import os

start = time.time()

kB = 0.593e0/298.e0 # kcal/mol
T = 300.e0
beta = 1.e0/(kB*T)
pdbid = 'ala2.solv'

# Platforms
platform = omm.Platform.getPlatformByName('CUDA')
properties = {'CudaPrecision': 'mixed', 'DeviceIndex': '0'}

gro = app.GromacsGroFile(pdbid+'.gro')
top = app.GromacsTopFile(pdbid+'.top', periodicBoxVectors=gro.getPeriodicBoxVectors(),
includeDir='./charmm36-feb2021.ff')
top.setPeriodicBoxVectors = gro.getPeriodicBoxVectors()
with open(pdbid+'.system.xml','r') as f:

# set up simulations
dd = -0.01e0
inxALAM = pdb.topology.select('resname ALAM')
inxHL =   pdb.topology.select('resname ALAM and name HL')
inxOR =   pdb.topology.select('resname ALAD and name OR')
distArr = np.linspace(1.2e0,0.1e0,111,endpoint=True)
nwindows = len(distArr)
print("# Number of windows to generate: ", nwindows)
print("----------------------------")
for k in range(0,nwindows):
print("# Generating the window >> ", k)

output = './data/'+pdbid+'.d'+"{0:.1f}".format(distArr[k]*10e0)

# create an integrator and simulation
dt = 2.e0*unit.femtoseconds
temperature = 300e0*unit.kelvin
integrator = omm.LangevinIntegrator(temperature, 1.e0/unit.picosecond, dt)
simulation = app.Simulation(top.topology, system, integrator, platform=platform, platformProperties=properties)

nframe = len(traj)
save_fep = np.zeros((nframe,6))
print('# nframe = ', nframe)
for k in range(0,nframe):
pos = traj.openmm_positions(frame=k)
j = 0
for dmove in [-dd*0.5e0, 0.e0, dd*0.5e0]:
# update positions
xyz = np.array(pos/unit.nanometer)
xyz[inxALAM,0] -= dmove
simulation.context.setPositions(xyz*unit.nanometer)
state = simulation.context.getState(getPositions=True,getEnergy=True)
pos_fep = state.getPositions()
dist_fep = unit.norm(pos_fep[inxHL] - pos_fep[inxOR])/unit.angstroms
ener_fep = state.getPotentialEnergy()/unit.kilocalorie_per_mole
save_fep[k,j] = dist_fep
save_fep[k,j+1] = ener_fep*beta
j += 2
np.savetxt(output+'.fener', save_fep, header="# dm1, u(dm1), d0, u(d0), d1, u(d1)")

end = time.time()
print("# Elapsed wall-clock time (s): ", end - start)


#### Free energies by MBAR reweighting¶

The relative free energies, $F_k$, can be further written as follows,

${F_k} = -log \int e^{-f_k(x)}dx$ $=-log \sum_n \frac{e^{-f_k(x_n)}}{\sum_{k=1}^K{n_k * e^{F_k - f_k(x_n)}}}, k=1,2$. (1)

Here, the pymbar and FastMBAR packages can be used to calculate the relative $F_k$, given two neighbor states. In this way, we can do a cumulative sum of their relative free energy difference between any two neighbor states, then, the free energy of the $n^{th}$ state can be written as below, which are based on the first state.

$F_{n0} = \sum_{k=1}^n (F_{k} - F_{k-1})$. (2)

Reference

. pymbar installation: https://github.com/choderalab/pymbar

. FastMBAR installation: https://github.com/xqding/FastMBAR

import numpy as np
from pymbar import MBAR
import glob
import os
import time

start = time.time()

# analysis
files = glob.glob('./data/*.fener')
files.sort(key=os.path.getmtime)
#print(files)
nfiles = len(files)
fener = []
nframe = 1000
ukn = np.zeros((2,2*nframe))
for k in range(0,nfiles-1):
d1 = np.mean(dat1[:,2])
d2 = np.mean(dat2[:,2])
ukn[0,0:nframe] = dat1[:,3]
ukn[0,nframe:nframe*2] = dat2[:,1]
ukn[1,0:nframe] = dat1[:,5]
ukn[1,nframe:2*nframe] = dat2[:,3]
mbar = MBAR(ukn, np.ones(2)*nframe) # assume the number of their samples is the same
#print(mbar.f_k)
fener.append([d1, d2, mbar.f_k, mbar.f_k])
np.savetxt('./data/fener.dat', fener)

end = time.time()
print("# elapsed wall-clock time (s): ", end - start)


### Plot PMF profile¶

In :
%cd '/home/ping/tutorial/fep'

import numpy as np
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = [8, 6]
plt.rcParams.update({'font.size': 16})
plt.rcParams.update({'lines.linewidth': 3})

kB = 0.593/298 # kcal/mol
T = 300
beta = 1.0/(kB*T)
dcut = 8.0 # cutoff distance (Ang)

# CHARMM
dist = dat[:,0]
fk = dat[:,1]*beta
plt.plot(dist, fk-np.mean(fk[dist>dcut]), 'k-', label='FEP-100ps (Chen, 2004)')

# OpenMM
dist = dat[:,0]
fk = np.cumsum(dat[:,1])
plt.plot(dist, fk-np.mean(fk[dist>dcut]), 'g-', label='FEP-1ns (Gong, 2022)')

plt.ylim([-3, 3])
x=plt.xticks(np.arange(1,13,1))
plt.legend()
plt.xlabel('Dist ($\AA$)')
plt.ylabel('PMF (kT)')

/home/ping/tutorial/fep

Out:
Text(0.5, 1.0, 'ALAM-ALAD dimer') # Appendix¶

## A simple approximation of free energy differenece without iterations¶

Here, we derived a simple way to calculate the free energy difference, $F_{AB}$,

We first make a free energy decomposition by the cumulant expansion.

$F_{AB} = -log<e^{-f_{AB}(x)}>_B \approx <f_{AB}(x)>_B - \frac{1}{2}<\delta^2 f_{AB}(x)>_B$, (1)

$F_{AB} = log<e^{-f_{BA}(x)}>_A \approx <f_{AB}(x)>_A + \frac{1}{2}<\delta^2 f_{BA}(x)>_A$, (2)

then, we can add up the above two equations with a given coefficient, we can get the following approximation, if the second order is small,

$F_{AB} \approx c_B <f_{AB}(x)>_B + c_A <f_{AB}(x)>_A$

$\approx \sum_n f_{AB}(x_n), x_n \in P_{mix}(x)$. (3)

In [ ]: