Openrefine: P304 as reference breaks the Wikidata schema

Created on 16 Oct 2020  路  9Comments  路  Source: OpenRefine/OpenRefine

Hi:

Here using OR v3.4 in Linux. Uploading data to Wikidata.

I found using the P304 brakes the schema: the UI doesn't provide any clear feedback and when looking at the logs I got this:

21:18:01.030 [                  command] Exception caught (15ms)
java.util.regex.PatternSyntaxException: Unknown inline modifier near index 6
(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-鈥揮(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?(, ?(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-鈥揮(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?)*
      ^
    at java.util.regex.Pattern.error(Pattern.java:1969)
    at java.util.regex.Pattern.group0(Pattern.java:2908)
    at java.util.regex.Pattern.sequence(Pattern.java:2065)
    at java.util.regex.Pattern.expr(Pattern.java:2010)
    at java.util.regex.Pattern.group0(Pattern.java:2919)
    at java.util.regex.Pattern.sequence(Pattern.java:2065)
    at java.util.regex.Pattern.expr(Pattern.java:2010)
    at java.util.regex.Pattern.compile(Pattern.java:1702)
    at java.util.regex.Pattern.<init>(Pattern.java:1352)
    at java.util.regex.Pattern.compile(Pattern.java:1028)
    at org.openrefine.wikidata.qa.scrutinizers.FormatScrutinizer.getPattern(FormatScrutinizer.java:68)
    at org.openrefine.wikidata.qa.scrutinizers.FormatScrutinizer.scrutinize(FormatScrutinizer.java:80)
    at org.openrefine.wikidata.qa.scrutinizers.SnakScrutinizer.scrutinizeSnakSet(SnakScrutinizer.java:71)
    at org.openrefine.wikidata.qa.scrutinizers.SnakScrutinizer.scrutinize(SnakScrutinizer.java:64)
    at org.openrefine.wikidata.qa.scrutinizers.StatementScrutinizer.scrutinize(StatementScrutinizer.java:36)
    at org.openrefine.wikidata.qa.EditInspector.inspect(EditInspector.java:107)
    at org.openrefine.wikidata.commands.PreviewWikibaseSchemaCommand.doPost(PreviewWikibaseSchemaCommand.java:92)
    at com.google.refine.RefineServlet.service(RefineServlet.java:187)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
    at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
    at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
    at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
    at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I'm not sure if this happens with any value for P304. In my case the precise value was _69-91_

An example of the reference I pretend is, exactly, the one for P31 at Q100407249.

Without using P304 everything seems to work as expected.

Hope this helps.

bug vulnerability wikibase

Most helpful comment

Something was changed in the support of non-capturing groups (in the past the colon : was non always needed after (?, but now it is).

This may have been caused by a more recent change in the regexp engine used by Wikidata (now compatible with PCRE 7.7 or higher, and supporting the "PCRE v2" syntax, where the (? can be followed by more flags, notably since it supports now named subroutines and other newer features of PCRE v2).

I commented, and fixed that in the P304 talk page.

All 9 comments

Looking at the stack trace that's a clear bug: since these regular expressions are user-supplied, we should make sure that they are ignored if they are invalid. They should be ignored rather than reported to the user since the user is not responsible for maintaining Wikidata constraints.

The regular expression currently present on P304 is

(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-鈥揮(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?(, ?(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-鈥揮(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?)*

which is incorrect in all regex variants known to https://regex101.com/

Note that we should also evaluate whether there are security vulnerabilities here: there might be options in Pattern.compile to reject regular expressions whose compilation/evaluation would be too expensive (DoS-like attack).

I have the same issue though I was using page(s) as a qualifier. The preview tab states "Your schema is incomplete, fix it to see the preview."

@VDK sorry about this! You should be able to temporarily fix this by fixing the regular expression in the property constraint of "page(s)".

Just for the record:

The bug is reported at Wikidata P304 discussion page.

Something was changed in the support of non-capturing groups (in the past the colon : was non always needed after (?, but now it is).

This may have been caused by a more recent change in the regexp engine used by Wikidata (now compatible with PCRE 7.7 or higher, and supporting the "PCRE v2" syntax, where the (? can be followed by more flags, notably since it supports now named subroutines and other newer features of PCRE v2).

I commented, and fixed that in the P304 talk page.

Thanks! Let's leave this issue open since we can improve how such cases are handled in OpenRefine.

Note: the bot that updates the constraint report has still not run. So the statistics are still not updated. So yues we can leave it open for now, until we get fresher statistics. For now we can still check using the "live" SPARQL request.

when I was pinged, I supposed this was caause by [-鈥揮 with the en-dash in the character class (I was wrong: so this test did not change the error; I then realized that the syntax (? ... ) which was accepted in the past for non*capturing groups is no longer valid without the colon (?: ... ). i think this occured after the upgrade of PCRE to support extended PCRE (aka "PCRE v2", even if this was supported in PCRE 7.7+ when it added more flags after (?, and even required at least one flag character in [:P'<] or longer "named" flags like (?(DEFINE) ... ): it was then impossible to delimit arbitrary regexps after the opening (? for groups, and the explicit colon became required, or could otherwise generate sometimes validation errors, while in the past it was not necessary and the "non-capturing group" semantic was implied by default.).

Anyway, this means that the regexp validator is still not correct in Wikidata: the error is detected a long time after, when the regexp will start being used to check actual data: even if the regexp is now invalid, it is still accepted "as is", and it seems that the validator jut merely checks that parentheses, braces or brackets are correctly paired, it does not check operators with more complex structure and their flags.

@verdy-p thanks for the analysis! That might be worth reporting to the Wikidata team then.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

thadguidry picture thadguidry  路  3Comments

wetneb picture wetneb  路  3Comments

tfmorris picture tfmorris  路  3Comments

dantexier picture dantexier  路  4Comments

anchardo picture anchardo  路  3Comments