Gensim: segment_all_articles(include_interlinks=True) errors in image links

Created on 9 Feb 2018 · 7Comments · Source: RaRe-Technologies/gensim

When using the script,

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/segment_wiki.py#L69

with the include_interlinks=True kwarg, I get some links that are very long and do not match any enwiki page titles. After investigating the first one, I found that it is inside an image caption. For example, in the page "Anarchism" one of the links is,

'link_title': 'individualist anarchist and social anarchist thinkers.'
'raw_anchor_text': 'individualist anarchist and social anarchist thinkers.'

In the page xml you can see the wikitext,

== Anarchist schools of thought ==
{{Main|Anarchist schools of thought}}
[[File:Portrait of Pierre Joseph Proudhon 1865.jpg|thumb|upright|Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the primary proponent of anarchist mutualism, and influenced many future [[individualist anarchist]] and social anarchist thinkers.]]

Looks like the doubly nested double square brackets is throwing the parser off.

Steps/Code/Corpus to Reproduce

Example:

segment_wiki.segment_and_write_all_articles(                                                                      
        in_fname, out_fname, min_article_character=10, include_interlinks=True)

Expected Results

I expect the interlink result to have the anchor text = "individualist anarchist", title="individualist anarchist".

Versions

Linux-4.4.0-1049-aws-x86_64-with-debian-stretch-sid
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]
NumPy 1.14.0
SciPy 1.0.0
gensim 3.3.0
FAST_VERSION 1

bug difficulty medium

Source

galtay

All 7 comments

Hello @galtay, thanks for the report, this looks like a bug, I agree.

@steremma can you fix this?

menshikh-iv on 9 Feb 2018

thanks for the quick reply! :) this is using the wikipedia dump enwiki-20180201-pages-articles-multistream.xml I've attached the first 1000 lines as head.xml.log (its xml but I changed the extension so github would let me attach it) for debugging ... might be useful

head.xml.log

galtay on 9 Feb 2018

👍1

Hello. I will try to reproduce and fix, hopefully within next week. @galtay thanks a lot for submitting!

steremma on 9 Feb 2018

🎉1

This is actually a bug in the filter_wiki, specifically the regex that removes images or files as it cannot handle nested patterns. In the specific example we have:

[[File:Portrait of Pierre Joseph Proudhon 1865.jpg|thumb|upright|Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the primary proponent of anarchist mutualism, and influenced many future [[individualist anarchist]] and social anarchist thinkers.]]

which after removing the image (and only keeping the caption) should return:

Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the primary proponent of anarchist mutualism, and influenced many future [[individualist anarchist]] and social anarchist thinkers

EDITED FOR MORE DETAILS
However the regex RE_P15 used in remove_file opens at [[File: but greedily closes at the first ]] it finds (which actually belong to the nested structure). As a result it returns:

Portrait of philosopher Pierre-Joseph Proudhon (1809–1865) by Gustave Courbet. Proudhon was the primary proponent of anarchist mutualism, and influenced many future [[individualist anarchist

then the rest of the text is kept, including the ending ]].

Fixing these regexes is a lot harder because:

I didnt write them and it will take me some time to wrap my head around them.
They are not tested and will be very hard to test.

If there is someone who can handle the regex fix (file and image regexes in particular) it would be really nice, if not I can get to it but it might take me some more time

steremma on 9 Feb 2018

My personal opinion (but I would prefer a maintainer to make that call) is to use a wikimedia markup parser such as this one. Manually writing and maintaining regexes seems a bit hacky anyway. @menshikh-iv thoughts?

EDIT
Another option would be to define a recursive regular expression in order to capture outer markup. This requires the regex package in addition to the re one which only handles finite (non recursive) patterns.

steremma on 9 Feb 2018

@steremma
Yes, this isn't the easy bug, I agree. About testing - you already have all that you needed (from the first look)

Example with nested [[]]
Current output (incorrect)
Expected output (correct)

That's enough for writing concrete test-case for regexp.

About external parser - we definitely don't want to add one more "core" dependency for gensim (but this potentially simple to maintain, I agree).

menshikh-iv on 12 Feb 2018

Update before closing

The solution requires using the regex module or another regular expression engine that supports recursion. A fix that I found and manually tested (no guarantee) is to replace:
RE_P15 = re.compile(r'\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE)

with:
import regex
RE_P15 = regex.compile(r'\[\[(?:File:|Image:)?((?>[^\[\]]+|(?R))*)\]\]', regex.UNICODE)
after manually pip installing the regex package.

However after discussion with @menshikh-iv we decided that the bug is not significant enough to justify the additional dependency on regex, therefore this fix will not be merged into develop. This issue will therefore be closed for now.

steremma on 15 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings