This notebook demonstates how to get the structures and data from the master worksheet, then convert the SMILES to molecule objects that allow some simple manipulisations and visualisations. SMILES (Simplified Molecular Input Line Entry System) is a line notation (a typographical method using printable characters) for entering and representing molecules and reactions. https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
#First get data from Google doc
!wget -O example.tsv https://docs.google.com/spreadsheets/d/1fAxwae9W--0BLCLU1KIGdXcVGvxiLiO7VxjKoEO2XHE/export?format=tsv '--no-check-certificate'
#The data is downloaded to a file called example.tsv in the same folder as the notebook, in tab separated format
Import the required python modules and then import the example.tsv file into a Pandas dataframe called datafile
from rdkit.Chem import AllChem as Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw
import pandas as pd
#datafile = pd.read_table('./export?format=tsv')
datafile = pd.read_csv('example.tsv', sep = '\t')
#Allow inline images
%matplotlib inline
#View first five rows
datafile.head(5)
#Find how many rows
len(datafile.index)
At the moment the molecule structures are represented by a SMILES string, we can convert the SMILES string to an RDKit molecular object and then display
#Remember the first row is row zero.
smiles = datafile['SMILES_parent'].loc[2]
#convert SMILES string to a RDKit molecular object
mol = Chem.MolFromSmiles(smiles)
mol
We can see the different datatypes in the dataframe
datafile.dtypes
We can now convert the SMILES string to a RDKit molecular object for every row in the dataframe
PandasTools.AddMoleculeColumnToFrame(datafile,'SMILES_parent','Molecule',includeFingerprints=True)
>>> print([str(x) for x in datafile.columns])
datafile.dtypes
If we view the dataframe the molecule object has been added to the last column. It would be better if the structure was more readily visible. So we change the column order.
datafile.head(3)
#display the current order
cols = list(datafile.columns.values)
cols
#change the column order
datafile = datafile[['OSA_ID',
'Molecule',
'Target',
'SMILES_parent',
'Name',
'InChiKey',
'PubChemCID',
'VendorID',
'MW_parent',
'Fragalysis_ref',
'Source',
'Status',
'Wiki link']]
datafile.head(3)
If we want to view all structures we can diaplay them as a grid
PandasTools.FrameToGridImage(datafile,column= 'Molecule', molsPerRow=4,subImgSize=(150,150),legendsCol="OSA_ID")
Now calculate a variety of properties using RDKit, adding them to the end of the dataframe. you can choose which properties to add here.
# Some of the availble descriptors are described here http://rdkit.org/docs/source/rdkit.Chem.rdMolDescriptors.html
from rdkit.Chem import rdMolDescriptors
hbdlist = [] #hydrogen bond donors
hbalist = [] #hydrogen bond acceptors
tpsalist = [] #Total polar surface area
mwtlist = [] #Exact molecular weight
logPlist = [] #Crippen LogP
mrlist = [] #Crippen MR
for mol in datafile['Molecule']:
hbd = rdMolDescriptors.CalcNumHBD(mol)
hbdlist.append(hbd)
hba = rdMolDescriptors.CalcNumHBA(mol)
hbalist.append(hba)
TPSA = rdMolDescriptors.CalcTPSA(mol)
tpsalist.append(TPSA)
mwt = rdMolDescriptors.CalcExactMolWt(mol)
mwtlist.append(mwt)
crippen = rdMolDescriptors.CalcCrippenDescriptors(mol) #returns a 2-tuple with the Wildman-Crippen logp,mr values
logPlist.append(crippen[0])#first is logP
mrlist.append(crippen[1])#second is mr
#mrlist
We now add each of the properties to the dataframe
datafile['HBD']=hbdlist
datafile['HBA']=hbalist
datafile['TPSA']=tpsalist
datafile['MWt']=mwtlist
datafile['LogP']=logPlist
datafile['MR']=mrlist
datafile.head(3)
We can also add a molecular properties
datafile['NumHeavyAtoms']=datafile.apply(lambda x: x['Molecule'].GetNumHeavyAtoms(), axis=1)
datafile.head(3)
We can using seaborn (http://seaborn.pydata.org/index.html) a Python visualization library based on matplotlib to generate a variety of plots.
import seaborn as sns
myTPSA = datafile['TPSA']
myMWt = datafile['MWt']
#Scatter plot
sns.regplot(myMWt, myTPSA, fit_reg=False)
#bar chart
sns.distplot(myMWt, kde=False, color='red', bins =10)