The segment_wiki.py script doesn't filter formatting properly. Its plain text output contains rubbish like {| class=\"wikitable\" style=\"text-align: center, for example in the article Academy Award for Best Production Design, but also others.
This is contrary to the stated mission of the script.
This is connected with header part of table and styles, it really can be cut.
Way to found this: grep -o -P ".{0,50}class=.{0,70}" segmentwiki_output.json
Yeah, eyeballing search results for class= or even just { and | probably enough -- there's a low chance these would appear in a normal plain text.
Adding RE_P16 = re.compile(r'{\|.*class=.*', re.UNICODE) here will solve this problem.
@chaitaliSaini can you submit PR with fix & add tests?
Sure.
I dont think the problem can be solved just by checking for {| or class=[^=], because some other markup is also not removed, for eg,
uctive votes of no confidence have been attempted, and only one has been successful.\n\n{\nborder=0 cellpadding=\"1\" cellspacing=\"1\"\n\n '''Date'''\n '''Opposition candidate (party)'''\n '''Chancellor (party)
nstable: italics''odd\u2013odd isotopes coloured pink\n\n 3 \nlithium \n 2 \n \u2014 \n \n style=\"background:pink;\"\n\u00a0\n\n 11 \nsodium \n 1 \n \u2014 \n\n\u00a0\n\u00a0\n\n 19 \npot
md5-0fc96d7c99761a39a4c8251cbeba60bb
ructures such as N-substituted aziridines (quaternary ammonium salts are resolvable).\n\n |76px\n | style=\"font-size:200%\" |\u00a0\u21cc\u00a0\n |76px\n |-\n | colspan=3 style=\"font-size:small\" |Inversio
md5-0fc96d7c99761a39a4c8251cbeba60bb
hich was carried out in November 2005 on a representative 1% of the total population.\n\n\n class=\"wikitable sortable\"\n Rank\n Name\n Boys for Every 100 Girls\n\n1 \n Jiangxi \n143\n\n2 \n Henan \n142\n\n3
md5-0fc96d7c99761a39a4c8251cbeba60bb
ingement and an out of court settlement was reached. As Page explained:\n\n\n\n |- \n | rowspan='2' style='background-color: #fff' | ''Presence''\n | \"'''Nobody's Fault but Mine'''\"\n | Page, Plant \n | \"
@chaitaliSaini thanks, probably we need to go through all preprocessing steps and fix the problem with html extraction.