Gensim: segment_wiki doesn't extract only plain text

Created on 12 Nov 2017  路  7Comments  路  Source: RaRe-Technologies/gensim

The segment_wiki.py script doesn't filter formatting properly. Its plain text output contains rubbish like {| class=\"wikitable\" style=\"text-align: center, for example in the article Academy Award for Best Production Design, but also others.

This is contrary to the stated mission of the script.

bug difficulty medium

All 7 comments

This is connected with header part of table and styles, it really can be cut.
Way to found this: grep -o -P ".{0,50}class=.{0,70}" segmentwiki_output.json

Yeah, eyeballing search results for class= or even just { and | probably enough -- there's a low chance these would appear in a normal plain text.

Adding RE_P16 = re.compile(r'{\|.*class=.*', re.UNICODE) here will solve this problem.

@chaitaliSaini can you submit PR with fix & add tests?

Sure.

I dont think the problem can be solved just by checking for {| or class=[^=], because some other markup is also not removed, for eg,

uctive votes of no confidence have been attempted, and only one has been successful.\n\n{\nborder=0 cellpadding=\"1\" cellspacing=\"1\"\n\n '''Date'''\n '''Opposition candidate (party)'''\n '''Chancellor (party)
nstable: italics''odd\u2013odd isotopes coloured pink\n\n  3 \nlithium      \n 2    \n \u2014 \n \n style=\"background:pink;\"\n\u00a0\n\n 11 \nsodium       \n 1     \n \u2014 \n\n\u00a0\n\u00a0\n\n 19 \npot



md5-0fc96d7c99761a39a4c8251cbeba60bb



ructures such as N-substituted aziridines (quaternary ammonium salts are resolvable).\n\n |76px\n | style=\"font-size:200%\" |\u00a0\u21cc\u00a0\n |76px\n |-\n | colspan=3 style=\"font-size:small\" |Inversio



md5-0fc96d7c99761a39a4c8251cbeba60bb



hich was carried out in November 2005 on a representative 1% of the total population.\n\n\n class=\"wikitable sortable\"\n Rank\n Name\n Boys for Every 100 Girls\n\n1 \n Jiangxi \n143\n\n2 \n Henan \n142\n\n3



md5-0fc96d7c99761a39a4c8251cbeba60bb



ingement and an out of court settlement was reached. As Page explained:\n\n\n\n |- \n | rowspan='2' style='background-color: #fff' | ''Presence''\n | \"'''Nobody's Fault but Mine'''\"\n | Page, Plant \n | \"

@chaitaliSaini thanks, probably we need to go through all preprocessing steps and fix the problem with html extraction.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jeradf picture jeradf  路  4Comments

vlad17 picture vlad17  路  4Comments

coopwilliams picture coopwilliams  路  3Comments

k0nserv picture k0nserv  路  3Comments

sairampillai picture sairampillai  路  3Comments