A Jupyter Notebook to aid Docking to MurD protein

This notebook implements a typical protocol for docking ligands to a target protein. It uses RDKit (http://www.rdkit.org) to generate a number of reasonable conformations for each ligand and then uses SMINA (https://sourceforge.net/projects/smina/) to do the docking. Two methods of docking are implemented, the first docks into a rigid receptor, the second sets the protein side-chains around the active site to be flexible. Bear in mind flexible docking will be much, much slower. In the optional final step the resulting docked poses are rescored using a random forest model described in https://www.nature.com/articles/srep46710.

In [1]:
import sys
from collections import defaultdict
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
import pandas as pd
IPythonConsole.ipython_3d=True
%pylab inline
Populating the interactive namespace from numpy and matplotlib

File location of structures for docking and file format

First we need get the location of the input file of structures you want to dock, replace "asinexSelection.sdf" with your file. You may want to rename the output file for conformations, and the output file containing the docked structures.

The sdf file needs to have the name included in the first line of each molecule record.

AEM 10028511 MOE2019 2D

22 24 0 0 0 0 0 0 0 0999 V2000 7.2040 -6.7290 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 6.3790 -6.7290 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0</code>

In [2]:
# File locations
sdfFilePath = 'asinexSelection.sdf' # The input file of structures to genrate conformations from
ConfoutputFilePath = 'asinexSelectionForDocking.sdf' # Output file containing conformations for docking

inputMols = [x for x in Chem.SDMolSupplier(sdfFilePath,removeHs=False)]
# Assign atomic chirality based on the structures:
len(inputMols) # Check how many strucures
Out[2]:
115
In [3]:
#Check that all molecules have a name
for i, mol in enumerate(inputMols):
    if mol is None:
        print('Warning: Failed to read molecule %s in %s' % (i, sdfFilePath))
    if not mol.GetProp('_Name'):

        print('Warning: No name for molecule %s in %s' % (i, sdfFilePath))

Conformation generation

We next generate conformations, this uses paralellisation code from http://www.rdkit.org/docs/Cookbook.html contributed by Andrew Dalke. We don't use all cores on a desktop machine or it might be unresponsive. If running on a cluster you should modify this.

In [4]:
import multiprocessing

# Download this from http://pypi.python.org/pypi/futures
from concurrent import futures

# conda install progressbar
import progressbar

#Find number cores available, leave two or system might be unresponsive
numcores = multiprocessing.cpu_count()
max_workers = numcores -2

#Knowledge based torsion generator http://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00654
# This function is called in the subprocess.
# The parameters (molecule and number of conformers) are passed via a Python

ps = AllChem.ETKDG()
ps.pruneRmsThresh=0.5
ps.numThreads=0
#Edit for number of confs desired eg n = 5
n=5
def generateconformations(m, n, name):
    m = Chem.AddHs(m)
    ids=AllChem.EmbedMultipleConfs(m, n, ps)
    for id in ids:
        AllChem.UFFOptimizeMolecule(m, confId=id)
    # EmbedMultipleConfs returns a Boost-wrapped type which
    # cannot be pickled. Convert it to a Python list, which can.
    return m, list(ids), name

smi_input_file, sdf_output_file = sys.argv[1:3]


writer = Chem.SDWriter(ConfoutputFilePath)
# suppl = [x for x in Chem.SDMolSupplier(sdfFilePath,removeHs=False)]
#suppl = Chem.SmilesMolSupplier(smi_input_file, titleLine=False)

# for mol in suppl:
#     print(mol.GetPropsAsDict(includePrivate=True).get('_Name'))

with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
    # Submit a set of asynchronous jobs
    jobs = []
    for mol in inputMols:
        if mol:
            name = mol.GetProp('_Name')
            job = executor.submit(generateconformations, mol, n, name)
            jobs.append(job)

    widgets = ["Generating conformations; ", progressbar.Percentage(), " ",
               progressbar.ETA(), " ", progressbar.Bar()]
    pbar = progressbar.ProgressBar(widgets=widgets, maxval=len(jobs))
    for job in pbar(futures.as_completed(jobs)):
        mol, ids, name = job.result()
        mol.SetProp('_Name', name)
        for id in ids:
            writer.write(mol, confId=id)
writer.close()
Generating conformations; 100% Time: 0:00:03 |################################|
In [5]:
ms = [x for x in Chem.SDMolSupplier(ConfoutputFilePath,removeHs=False)]
# Assign atomic chirality based on the structures:
for m in ms: Chem.AssignAtomChiralTagsFromStructure(m)
len(ms) # check how many conformations
Out[5]:
520

Docking to Protein

After generating the conformations we can now do the docking. In this example we use smina which can be downloaded from https://sourceforge.net/projects/smina/ you will need to know where smina has been installed. The protein and ligand examples provided are taken from https://fragalysis.diamond.ac.uk/viewer/react/preview/target/MURD MURD-x0373.

Docking using smina
Need protein minus the ligand in pdb format,
the ligand extracted from binding site in pdb format,
Conformations to be docked as sdf from conformation generation above
DockedFilePath = 'All_Docked.sdf.gz' is the File for the Docked structures

In [9]:
ProteinForDocking = 'protein_minus_ligand.pdb'
LigandFromProtein = '373ligand_only.pdb'
DockedFilePath = 'All_Docked.sdf.gz'
FlexibleDockedFilePath = 'FlexDocked.sdf.gz'
In [7]:
!'/usr/local/bin/smina.osx' --cpu 10 --seed 0 --autobox_ligand '{LigandFromProtein}' -r '{ProteinForDocking}' -l '{ConfoutputFilePath}' -o '{DockedFilePath}'
   _______  _______ _________ _        _______ 
  (  ____ \(       )\__   __/( (    /|(  ___  )
  | (    \/| () () |   ) (   |  \  ( || (   ) |
  | (_____ | || || |   | |   |   \ | || (___) |
  (_____  )| |(_)| |   | |   | (\ \) ||  ___  |
        ) || |   | |   | |   | | \   || (   ) |
  /\____) || )   ( |___) (___| )  \  || )   ( |
  \_______)|/     \|\_______/|/    )_)|/     \|


smina is based off AutoDock Vina. Please cite appropriately.

Weights      Terms
-0.035579    gauss(o=0,_w=0.5,_c=8)
-0.005156    gauss(o=3,_w=2,_c=8)
0.840245     repulsion(o=0,_c=8)
-0.035069    hydrophobic(g=0.5,_b=1.5,_c=8)
-0.587439    non_dir_h_bond(g=-0.7,_b=0,_c=8)
1.923        num_tors_div

Using random seed: 0

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -5.3       0.000      0.000    
2       -5.1       2.016      3.142    
3       -4.8       1.423      1.851    
4       -4.7       3.069      4.407    
5       -4.7       3.339      4.565    
6       -4.6       1.612      2.046    
7       -4.4       1.960      2.913    
8       -4.1       3.758      6.191    
9       -4.0       3.359      5.149    
Refine time 1.448
Using random seed: 0

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -5.9       0.000      0.000    
2       -5.8       0.089      1.086    
3       -5.5       2.216      3.693    
4       -5.4       2.887      8.064    
5       -4.8       2.610      3.946    
6       -4.5       5.285      7.199    
7       -4.5       3.338      4.282    
8       -4.5       3.085      4.689    
9       -4.2       4.123      7.408    
Refine time 5.565
Using random seed: 0

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -5.6       0.000      0.000    
2       -5.5       2.266      3.747    
3       -5.2       2.040      3.152    
4       -4.8       2.595      3.944    
5       -4.8       4.747      5.513    
6       -4.7       1.222      1.401    
7       -4.6       0.767      1.328    
8       -4.6       1.943      2.996    
9       -4.6       3.083      4.577    
Refine time 5.425
Using random seed: 0

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -6.0       0.000      0.000    
2       -5.6       2.155      3.294    
3       -5.2       4.130      8.181    
4       -5.2       2.110      2.961    
5       -5.1       1.528      2.107    
6       -5.0       2.232      3.451    
7       -4.9       3.235      4.182    
8       -4.8       3.945      7.402    
9       -4.7       2.038      2.510    
Refine time 5.057
Using random seed: 0

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -5.8       0.000      0.000    
2       -5.5       2.218      3.536    
3       -5.4       2.838      8.036    
4       -4.7       1.125      1.185    
5       -4.7       4.835      5.557    
6       -4.7       2.587      3.921    
7       -4.6       3.366      4.265    
8       -4.5       5.322      6.170    
9       -4.5       3.061      4.534    
Refine time 6.167
Using random seed: 0

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -6.8       0.000      0.000    
2       -6.3       1.855      6.646    
3       -5.0       3.599      6.130    
4       -4.9       3.717      6.335    
5       -4.7       2.018      6.917    
6       -4.7       3.807      6.723    
7       -4.6       3.860      6.860    
8       -4.5       4.585      8.652    
9       -4.3       4.594      8.490    
Refine time 5.345
Using random seed: 0

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -5.5       0.000      0.000    
2       -5.2       3.829      6.237    
3       -4.9       3.755      6.272    
4       -4.8       2.030      6.219    
5       -4.7       3.609      6.320    
6       -4.7       3.691      7.025    
7       -4.4       4.213      6.779    
8       -4.3       4.441      7.514    
9       -4.2       5.020      7.691    
Refine time 4.510
Using random seed: 0

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -5.3       0.000      0.000    
2       -5.1       2.076      3.058    
3       -4.7       3.473      6.215    
4       -4.7       2.586      4.514    
5       -4.6       2.587      4.702    
6       -4.4       2.961      5.098    
7       -4.3       2.106      3.206    
8       -4.3       3.652      5.868    
9       -4.3       2.889      4.224    
Refine time 4.677
Using random seed: 0

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -6.6       0.000      0.000    
2       -6.2       1.655      6.528    
3       -5.1       3.742      6.675    
4       -4.9       3.559      6.232    
5       -4.8       3.717      6.443    
6       -4.6       5.836      8.638    
7       -4.4       4.421      9.008    
8       -4.4       4.010      7.120    
9       -4.1       6.044      10.137   
Refine time 5.058
Using random seed: 0

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -14.1      0.000      0.000    
2       -13.5      1.554      2.163    
3       -12.2      2.490      5.786    
4       -12.1      1.790      2.272    
5       -11.6      2.913      5.077    
6       -11.6      2.851      4.187    
Refine time 191.160
Loop time 78951.081

Rescoring using Random Forest Model

Optional, Rescore using a random forest model described in https://www.nature.com/articles/srep46710

Download from https://github.com/oddt/rfscorevs You will need the path to the binary

Path to protein containing ligand in pdb format
protein_plus_373ligand from Diamond
File to store rescored resukts

In [11]:
TargetProtein = 'protein_plus_373ligand.pdb'
scoreResults = 'DockedRescored.csv'
In [12]:
!/usr/local/bin/rf-score-vs --receptor '{TargetProtein}' '{DockedFilePath}' -o csv -O '{scoreResults}' --field name --field RFScoreVS_v2
In [ ]:
 
In [13]:
docked_df = PandasTools.LoadSDF(DockedFilePath,molColName='Molecule')
In [14]:
#docked_df.head(n=5)
In [15]:
scores_df = pd.read_csv(scoreResults)
In [16]:
#scores_df.head(n=5)
In [17]:
results_df = pd.concat([docked_df, scores_df], axis=1)
In [18]:
results_df.head(5)
Out[18]:
ID Molecule minimizedAffinity name RFScoreVS_v2
0 ART 15342099 Mol -5.26767 ART 15342099 6.127652
1 ART 15342099 Mol -5.05578 ART 15342099 6.112676
2 ART 15342099 Mol -4.83187 ART 15342099 6.085660
3 ART 15342099 Mol -4.74720 ART 15342099 6.063921
4 ART 15342099 Mol -4.72279 ART 15342099 6.109961

Saving the results

Now combine rescored file with docked structure file and export to "Alldata.sdf.gz" this is a big file so export compressed

In [19]:
combinedResults = 'Alldata.sdf.gz'
In [20]:
PandasTools.WriteSDF(results_df, combinedResults, molColName="Molecule", idName="ID", properties=list(results_df.columns))
In [ ]: