Configuration:
Description:
When creating a mol from a smiles of a molecule that has charged atoms, rdkit is not able to add the charges and gives an error of Explicit valence being greater than permitted. I coded a function to explicit add charges in this case, but then I also got another problem. When the molecule has fused aromatic rings kekulize function fails to appropriate assign double and single bonds order to the fused rings.
As and example including both cases I'll use the isomeric smiles of the PDB ligand ZAS.
The problem doesn't occur if the smiles explicit indicates the charged atoms and the aromatic bonds order: 'N=[N+]=NCC1OC(N2C=NC3=C2N=CN=C3N)C(O)C1O'. But that's not how the smiles are provided in PDB (isomeric smiles).
I understand that not assigning a charge to atoms that violates their valence can be a choice for not creating wrong mols, but I think the kekulize error of fused aromatic rings is a bug. And both errors persists if the smiles explicit indicates the charge but not the aromatic bonds order: 'N=[N+]=NCC1OC(N2:C:N:C3:C(N):N:C:N:C:3:2)C(O)C1O'. In this last case the charge of the nitrogen should prevent the valence error for this atom, but it does not. When fixing the valence of this atom by adding a positive charge another error of valence occur for a carbon atom (probably due to wrong aromatic bonds order).
import rdkit.Chem as Chem
smiles = "Nc1ncnc2n(cnc12)[C@@H]3O[C@H](CN=[N]=N)[C@@H](O)[C@H]3O"
Chem.MolFromSmiles(smiles, sanitize=True)
# > ERROR: Explicit valence for atom # 15 N, 4, is greater than permitted
# use the following function to assign charges to the not sanitized molecule and then try to sanitize it again
def fix_valence_charge(mol):
for atom in mol.GetAtoms():
explicitValence = 0
for bond in atom.GetBonds():
explicitValence = explicitValence + bond.GetBondTypeAsDouble()
# print(atom.GetSymbol(), atom.GetDegree(), atom.GetFormalCharge(), explicitValence)
atom.SetFormalCharge(0)
if atom.GetSymbol() == 'N' and explicitValence > 3:
atom.SetFormalCharge(int(explicitValence - 3))
elif atom.GetSymbol() == 'O' and explicitValence > 2:
atom.SetFormalCharge(int(explicitValence - 2))
elif atom.GetSymbol() == 'C' and explicitValence > 4:
atom.SetFormalCharge(int(explicitValence - 4))
return Chem.SanitizeMol(mol)
fix_valence_charge(Chem.MolFromSmiles(smiles, sanitize=False))
# > ERROR: Can't kekulize mol. Unkekulized atoms: 1 2 3 4 5 6 7 8 9
Thanks
@crisfbazz let's start with the core problem: The SMILES provided by the PDB for that ligand is incorrect. This is not the RDKit being picky, it's the fact that the SMILES does not correspond to a molecule which can exist as a stable entity, does not match some of the additional information provided on that page (which says that the molecule should have a net positive charge), and does not correspond to what is actually in the publication of 3V7W (https://doi.org/10.1016/j.str.2012.03.024, take a look at Figure 2). Even if you end up "fixing" the molecule by adding a positive charge on the four-valent N to get you to this Nc1ncnc2n(cnc12)[C@@H]3O[C@H](CN=[N+]=N)[C@@H](O)[C@H]3O, you still are not going to have the molecule which the authors of that paper actually claim they crystallized, which is Nc1ncnc2n(cnc12)[C@@H]3O[C@H](CN=[N+]=[N-])[C@@H](O)[C@H]3O.
Having said all of that, if you still want to try to automatically "correct" structures like this, you could try a method like the following:
def add_nitrogen_charges(smiles):
m = Chem.MolFromSmiles(smiles,sanitize=False)
m.UpdatePropertyCache(strict=False)
ps = Chem.DetectChemistryProblems(m)
if not ps:
Chem.SanitizeMol(m)
return m
for p in ps:
if p.GetType()=='AtomValenceException':
at = m.GetAtomWithIdx(p.GetAtomIdx())
if at.GetAtomicNum()==7 and at.GetFormalCharge()==0 and at.GetExplicitValence()==4:
at.SetFormalCharge(1)
Chem.SanitizeMol(m)
return m
Applying that to the SMILES from ZAS gives:
In [15]: m = add_nitrogen_charges('Nc1ncnc2n(cnc12)[C@@H]3O[C@H](CN=[N]=N)[C@@H](O)[C@H]3O')
[08:02:30] Explicit valence for atom # 15 N, 4, is greater than permitted
In [16]: Chem.MolToSmiles(m)
Out[16]: 'N=[N+]=NC[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1O'
If there are other atom types you want to "correct" hopefully it's reasonably obvious how to do that from the example code.
@greglandrum thanks for investigating the problem and for this more robust code!
I'm having problems with the smiles deposited at PDB, some are really wrong like this one that is missing the positive charge. The link that I provided is for the ligand summary, it also appears in other structure (6FRP) so its not just related to this publication. I think the molecule reported by the publication of structure 3V7W (Nc1ncnc2n(cnc12)[C@@H]3O[C@H](CN=[N+]=[N-])[C@@H](O)[C@H]3O) will receive an hydrogen in the negative charged nitrogen and what's really in the crystal is the molecule of the following smiles Nc1ncnc2n(cnc12)[C@@H]3O[C@H](CN=[N+]=N)[C@@H](O)[C@H]3O, which should be the smiles for the ligand ZAS. Well I'm not that good in chemistry but this makes more sense to me.
The problem kekulezing fused aromatic rings happens with other smiles as well, I'll investigate the cases to see if the problem isn't in the smiles from PDB like in this one and if not I'll share with you other examples. Thanks again!
Just complementing, in the ligand interaction page of structure 3V7W the negative charged nitrogen is making an ionic bond with the protein residue. So the hydrogen in this smiles is implicit, but maybe could be present in other structures. In structure 6FRP the nitrogen is ionic bonding with a water molecule.