When using the script,
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/segment_wiki.py#L69
with the include_interlinks=True kwarg, I get some links that are very long and do not match any enwiki page titles. After investigating the first one, I found that it is inside an image caption. For example, in the page "Anarchism" one of the links is,
'link_title': 'individualist anarchist and social anarchist thinkers.'
'raw_anchor_text': 'individualist anarchist and social anarchist thinkers.'
In the page xml you can see the wikitext,
== Anarchist schools of thought ==
{{Main|Anarchist schools of thought}}
[[File:Portrait of Pierre Joseph Proudhon 1865.jpg|thumb|upright|Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the primary proponent of anarchist mutualism, and influenced many future [[individualist anarchist]] and social anarchist thinkers.]]
Looks like the doubly nested double square brackets is throwing the parser off.
Example:
segment_wiki.segment_and_write_all_articles(
in_fname, out_fname, min_article_character=10, include_interlinks=True)
I expect the interlink result to have the anchor text = "individualist anarchist", title="individualist anarchist".
Linux-4.4.0-1049-aws-x86_64-with-debian-stretch-sid
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]
NumPy 1.14.0
SciPy 1.0.0
gensim 3.3.0
FAST_VERSION 1
Hello @galtay, thanks for the report, this looks like a bug, I agree.
@steremma can you fix this?
thanks for the quick reply! :) this is using the wikipedia dump enwiki-20180201-pages-articles-multistream.xml I've attached the first 1000 lines as head.xml.log (its xml but I changed the extension so github would let me attach it) for debugging ... might be useful
Hello. I will try to reproduce and fix, hopefully within next week. @galtay thanks a lot for submitting!
This is actually a bug in the filter_wiki, specifically the regex that removes images or files as it cannot handle nested patterns. In the specific example we have:
[[File:Portrait of Pierre Joseph Proudhon 1865.jpg|thumb|upright|Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the primary proponent of anarchist mutualism, and influenced many future [[individualist anarchist]] and social anarchist thinkers.]]
which after removing the image (and only keeping the caption) should return:
Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the primary proponent of anarchist mutualism, and influenced many future [[individualist anarchist]] and social anarchist thinkers
EDITED FOR MORE DETAILS
However the regex RE_P15 used in remove_file opens at [[File: but greedily closes at the first ]] it finds (which actually belong to the nested structure). As a result it returns:
Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the primary proponent of anarchist mutualism, and influenced many future [[individualist anarchist
then the rest of the text is kept, including the ending ]].
Fixing these regexes is a lot harder because:
If there is someone who can handle the regex fix (file and image regexes in particular) it would be really nice, if not I can get to it but it might take me some more time
My personal opinion (but I would prefer a maintainer to make that call) is to use a wikimedia markup parser such as this one. Manually writing and maintaining regexes seems a bit hacky anyway. @menshikh-iv thoughts?
EDIT
Another option would be to define a recursive regular expression in order to capture outer markup. This requires the regex package in addition to the re one which only handles finite (non recursive) patterns.
@steremma
Yes, this isn't the easy bug, I agree. About testing - you already have all that you needed (from the first look)
[[]]That's enough for writing concrete test-case for regexp.
About external parser - we definitely don't want to add one more "core" dependency for gensim (but this potentially simple to maintain, I agree).
Update before closing
The solution requires using the regex module or another regular expression engine that supports recursion. A fix that I found and manually tested (no guarantee) is to replace:
RE_P15 = re.compile(r'\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE)
with:
import regex
RE_P15 = regex.compile(r'\[\[(?:File:|Image:)?((?>[^\[\]]+|(?R))*)\]\]', regex.UNICODE)
after manually pip installing the regex package.
However after discussion with @menshikh-iv we decided that the bug is not significant enough to justify the additional dependency on regex, therefore this fix will not be merged into develop. This issue will therefore be closed for now.