Rdkit: Ability to generate a list of possible smiles representation for a given molecule

Created on 15 Sep 2018  路  3Comments  路  Source: rdkit/rdkit

Description:

Can we add a parameter to bypass smiles "rule" based generator to be able to get a random smiles for a given starting atom number ?


  • RDKit Version:
  • Platform:

Your code sample here

Hackathon idea enhancement

Most helpful comment

I added this to #2059, but here is a simple python function that randomizes smiles:

```from rdkit import Chem
import random

def randomSmiles(m1):
m1.SetProp("_canonicalRankingNumbers", "True")
idxs = list(range(0,m1.GetNumAtoms()))
random.shuffle(idxs)
for i,v in enumerate(idxs):
m1.GetAtomWithIdx(i).SetProp("_canonicalRankingNumber", str(v))
return Chem.MolToSmiles(m1)

m1 = Chem.MolFromSmiles("CNOPc1ccccc1")
s = set()
for i in range(1000):
smiles = randomSmiles(m1)
s.add(smiles)

print(s)
```
generating ALL possible smiles is much, much harder to do efficiently than it seems at first blush.

All 3 comments

+1
Collecting a few (maybe) useful links:

  • Your mailing list post @thegodone 馃槃:

    • https://sourceforge.net/p/rdkit/mailman/message/36382511/

  • Esben's work presented last year:

    • Presentation - https://github.com/rdkit/UGM_2017/blob/master/Presentations/Bjerrum_RDKitUGM_Smiles_Enumeration_for_RNN.pdf

    • Publication: https://arxiv.org/pdf/1703.07076.pdf

    • Code: https://github.com/EBjerrum/SMILES-enumeration/blob/master/SmilesEnumerator.py

I added this to #2059, but here is a simple python function that randomizes smiles:

```from rdkit import Chem
import random

def randomSmiles(m1):
m1.SetProp("_canonicalRankingNumbers", "True")
idxs = list(range(0,m1.GetNumAtoms()))
random.shuffle(idxs)
for i,v in enumerate(idxs):
m1.GetAtomWithIdx(i).SetProp("_canonicalRankingNumber", str(v))
return Chem.MolToSmiles(m1)

m1 = Chem.MolFromSmiles("CNOPc1ccccc1")
s = set()
for i in range(1000):
smiles = randomSmiles(m1)
s.add(smiles)

print(s)
```
generating ALL possible smiles is much, much harder to do efficiently than it seems at first blush.

@bp-kelley: Randomizing the ranks certainly helps, but it doesn't solve the problem that the traversal algorithm still prefers non-ring bonds to ring bonds - this takes preference over the ranks. Here's an example using your randomSmiles() function:

In [13]: m2 = Chem.MolFromSmiles('CC1C(CC=1)O')

In [14]: set(randomSmiles(m2) for x in range(1000))
Out[14]:
{'C1(C)=CCC1O',
 'C1(O)C(C)=CC1',
 'C1(O)CC=C1C',
 'C1=C(C)C(O)C1',
 'C1C(O)C(C)=C1',
 'C1C=C(C)C1O',
 'CC1=CCC1O',
 'OC1C(C)=CC1',
 'OC1CC=C1C'}
Was this page helpful?
0 / 5 - 0 ratings