Rdkit: GetSubstructMatches gets stuck

Created on 12 Apr 2021  Â·  4Comments  Â·  Source: rdkit/rdkit

Describe the bug
mol.GetSubstructMatches gets stuck using a pattern with a lot of fragments.

To Reproduce

from rdkit.Chem import MolFromSmiles, AllChem
smiles = 'S=C(SSC(=S)N(C[34CH:34]([35CH2:35][36CH3:36])[37CH2:37][38CH2:38][39CH2:39][40CH3:40])C[38CH:38]([37CH2:37][34CH2:34][35CH2:35][36CH3:36])[39CH2:39][40CH3:40])N(C[34CH:34]([35CH2:35][36CH3:36])[37CH2:37][38CH2:38][39CH2:39][40CH3:40])C[34CH:34]([35CH2:35][36CH3:36])[37CH2:37][38CH2:38][39CH2:39][40CH3:40]'
smarts = '[CH2;+0:34].[CH2;+0:38].[CH2;+0:38].[CH2;+0:38].[CH3;+0:36].[CH3;+0:36].[CH3;+0:36].[CH3;+0:36].[CH3;+0:40].[CH3;+0:40].[CH3;+0:40].[CH3;+0:40].[CH;+0:34]-[C;H2;D2;+0]-[N;H0;D3;+0](-[C;H2;D2;+0]-[CH;+0:34])-[C;H0;D3;+0](=[S;H0;D1;+0])-[S;H0;D2;+0]-[S;H0;D2;+0]-[C;H0;D3;+0](=[S;H0;D1;+0])-[N;H0;D3;+0](-[C;H2;D2;+0]-[CH;+0:34])-[C;H2;D2;+0]-[CH;+0:38]'
mol = MolFromSmiles(smiles)
fragment = AllChem.MolFromSmarts(smarts)
mol.GetSubstructMatches(fragment, useChirality=True, maxMatches=5) # stuck here

Expected behavior
This function should end at some point, or a timeout option would be a great option here.

Configuration (please complete the following information):

  • RDKit version: 2021.03.1
  • OS: Ubuntu 20.10
  • Python version (if relevant): 3.7.9
  • Are you using conda? Yes
  • If you are using conda, which channel did you install the rdkit from? conda-forge
bug

All 4 comments

@RobinFrcd See here for a similar problem and a suggested workaround:
https://gist.github.com/ptosco/863cb55ace485c6664c21c244b2ca10a
Yes, I think a timeout would be good.

Hi,

We had the same problem and we opted to use a solution based on
https://github.com/pnpnpn/timeout-decorator.

Kind regards,

Christos

Christos Kannas

Research Software Engineer (Cheminformatics)

[image: View Christos Kannas's profile on LinkedIn]
http://cy.linkedin.com/in/christoskannas

On Mon, 12 Apr 2021 at 15:24, Paolo Tosco @.*> wrote:

@RobinFrcd https://github.com/RobinFrcd See here for a similar problem
and a suggested workaround:
https://gist.github.com/ptosco/863cb55ace485c6664c21c244b2ca10a
Yes, I think a timeout would be good.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/rdkit/rdkit/issues/4025#issuecomment-817808913, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AA4P6SS2JP4KSF6V24O36Y3TILYCRANCNFSM42ZINVPA
.

@RobinFrcd See here for a similar problem and a suggested workaround:
https://gist.github.com/ptosco/863cb55ace485c6664c21c244b2ca10a
Yes, I think a timeout would be good.

Yes, I saw your answer on sourceforge, thanks for the workaround!

Leaving the issue open because it would be great to have the timeout in RDKit itself!

Your smarts pattern here is really suboptimal. You have disconnected single atoms followed by large patterns. The search space here is enormous.

If you rearrange it to find the largest patterns first, at least for this pattern, the results are pretty instantaneous:

import time
# put the larger patterns fist
smarts2 = ".".join(sorted(smarts.split("."),key=lambda s: len(s), reverse=True))
fragment = AllChem.MolFromSmarts(smarts2)
t1 = time.time()
mol.GetSubstructMatches(fragment, useChirality=False, maxMatches=5)
t2 = time.time()
print("Found matches in", t1-t1, "seconds")

Now this doesn't preclude having a timeout or that this could be an optimization to the internal search engine for sure, but when running smarts pattens you always want to have the largest patterns or the ones most likely to fail first in the string.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

IgnacioJPickering picture IgnacioJPickering  Â·  3Comments

xjalencas picture xjalencas  Â·  5Comments

contrebande-labs picture contrebande-labs  Â·  5Comments

mc-robinson picture mc-robinson  Â·  3Comments

panpan2 picture panpan2  Â·  3Comments