i found two bad regex pattern in 'gensim/corpora/wikicorpus.py'
RE_P7 = re.compile(r'\n\[\[[iI]mage(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)
"""Keep description of images."""
RE_P8 = re.compile(r'\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)
"""Keep description of files."""
those pattern will cause 'ReDos' security problem, proof of code like below
import re
RE_P8 = re.compile(r'\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)
re.findall(RE_P8, "\n[[file"+"|a"*1000+"|]")
run the above code, cpu utilization will be 100% in a very long period.
more detail about 'ReDos' please see owasp.
because i did not see anywhere use 'RE_P7' and 'RE_P8' pattern, and not familiar with gensim api, so i can not decide what's the effect of this security problem.
Thanks for reporting! The DoS pattern is unlikely to occur in Wikipedia (or "deny" any service if it does occur), but I'll leave this ticket open in case someone wants to investigate and improve the regex.
@leveryd Did you find these by running the code on a wikidump, or via some other method?
Good question. I assumed it was some automated code checker / linter, because the user says i did not see anywhere use 'RE_P7' and 'RE_P8' pattern.
If that linter is easy to use, I'd like to plug it into my editor too. Maybe even into CI. Sounds useful.
i use this tool mainly to check a pattern is bad or not.
i do some wrapper on it, such as job "find all constant string". but my code is ugly now, after i make it better, i will share it.
If a linter finds code that's totally not used, pruning such code is usually good.
But finding some code (like regexes) that might perform-poorly on crafted worst-case inputs, never yet seen in the real-world, strikes me as such a low-priority – for wikicorpus & most other parts of Gensim – that such reports should be discouraged as moot issues. Why?
So, unless/until a problem like this is triggered by real data "in the wild", I'd assign this concern so low priority that this issue could be closed as "no fix planned".
More generally on wikitext processing options: as I may have mentioned before, offline Wikipedia projects like 'Kiwix' use a 'ZIM' file format that's pretty simple and already includes the HTML versions of Wikipedia articles, rather than Wikitext. As these have already expanded the Wikitext mess to generate many new keywords/symbols, and stripping/transforming HTML tags more tenable than parsing Wikitext, I'm surprised more projects don't prefer to use those dumps for their wikitext input. Adding a corpus-iterator for that formal could give Gensim+Wikipedia projects an added benefit in better training input (while sidestepping nasty Wikitext regex issues entirely).
I see this report as near-zero priority too. I'll leave this ticket open in case someone wants to investigate was my attempt at a more diplomatic phrasing :) I didn't even label this as a "bug".
But I also wouldn't be opposed to someone cleaning up Wikicorpus, regexps and all. It's a valid and useful task. First time I hear about "Kiwix".