Gensim: SECURITY: bad regex pattern in 'gensim/corpora/wikicorpus.py' maybe cause 'ReDos' security problem.

Created on 19 Jan 2021  Â·  6Comments  Â·  Source: RaRe-Technologies/gensim

Problem description

i found two bad regex pattern in 'gensim/corpora/wikicorpus.py'

RE_P7 = re.compile(r'\n\[\[[iI]mage(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)
"""Keep description of images."""
RE_P8 = re.compile(r'\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)
"""Keep description of files."""

those pattern will cause 'ReDos' security problem, proof of code like below

import re
RE_P8 = re.compile(r'\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)
re.findall(RE_P8, "\n[[file"+"|a"*1000+"|]")

run the above code, cpu utilization will be 100% in a very long period.

more detail about 'ReDos' please see owasp.

effect of this security problem

because i did not see anywhere use 'RE_P7' and 'RE_P8' pattern, and not familiar with gensim api, so i can not decide what's the effect of this security problem.

difficulty easy impact LOW performance reach LOW

All 6 comments

Thanks for reporting! The DoS pattern is unlikely to occur in Wikipedia (or "deny" any service if it does occur), but I'll leave this ticket open in case someone wants to investigate and improve the regex.

@leveryd Did you find these by running the code on a wikidump, or via some other method?

Good question. I assumed it was some automated code checker / linter, because the user says i did not see anywhere use 'RE_P7' and 'RE_P8' pattern.

If that linter is easy to use, I'd like to plug it into my editor too. Maybe even into CI. Sounds useful.

i use this tool mainly to check a pattern is bad or not.

i do some wrapper on it, such as job "find all constant string". but my code is ugly now, after i make it better, i will share it.

If a linter finds code that's totally not used, pruning such code is usually good.

But finding some code (like regexes) that might perform-poorly on crafted worst-case inputs, never yet seen in the real-world, strikes me as such a low-priority – for wikicorpus & most other parts of Gensim – that such reports should be discouraged as moot issues. Why?

  • the backlog of issues people are actually hitting is already plentiful, and even any effort devoted to triaging hypothetical problems steals effort from tangible concerns
  • devoting any attention, or new coding effort, to purely theoretical issues creates a risk of new bugs for potentially no benefit to any real user ever.
  • specifically here: WikiText is already a steaming mess for Wikipedia, with lots of inefficient regex parsing & special casing in the PHP/etc processing that others do. Any text that winds up in a public dump is fairly likely to have already been cleaned of worst-case inputs. And even if it isn't, it's not that hard to quickly detect/correct the exact problem inputs, via examination of a spinning process or incremental testing. And, batch processing of wikitext dumps is highly unlikely to be in public-service loops vulnerable to malicious outside input.

So, unless/until a problem like this is triggered by real data "in the wild", I'd assign this concern so low priority that this issue could be closed as "no fix planned".

More generally on wikitext processing options: as I may have mentioned before, offline Wikipedia projects like 'Kiwix' use a 'ZIM' file format that's pretty simple and already includes the HTML versions of Wikipedia articles, rather than Wikitext. As these have already expanded the Wikitext mess to generate many new keywords/symbols, and stripping/transforming HTML tags more tenable than parsing Wikitext, I'm surprised more projects don't prefer to use those dumps for their wikitext input. Adding a corpus-iterator for that formal could give Gensim+Wikipedia projects an added benefit in better training input (while sidestepping nasty Wikitext regex issues entirely).

I see this report as near-zero priority too. I'll leave this ticket open in case someone wants to investigate was my attempt at a more diplomatic phrasing :) I didn't even label this as a "bug".

But I also wouldn't be opposed to someone cleaning up Wikicorpus, regexps and all. It's a valid and useful task. First time I hear about "Kiwix".

Was this page helpful?
0 / 5 - 0 ratings