Openrefine: Unicode support for regex

Created on 19 Oct 2018  路  3Comments  路  Source: OpenRefine/OpenRefine

Hello,

As a user, I want to deal with regex facets including accent selection into words. It is a solution to apply the regex modifier 'u' that would simplify my life. Strangely, it is possible to bypass this limitation using jython into columns creation, but it remains difficult because you have to code each transformation rather using a simple regex modifier.

As a solution, I wonder if the default modifier u could be used on project based on UTF files... For the moment, very annoying...

Best regards,

bug help wanted localization logic

Most helpful comment

I'd be tempted to say that we should enable UNICODE_CHARACTER_CLASS and UNICODE_CASE (for use in conjunction with the Lower Case flag) by default, but in the mean time, you can use the embedded flag ?U to enable Unicode character classes in your regex:

replace("脡abc 脡abc", /\w+/,"x") gives "脡x 脡x"

replace("脡abc 脡abc", /(?U)\w+/,"x") gives "x x"

All 3 comments

Could you please be specific about the issue or the steps which can reproduce the problem?

Sure :
1- Go on https://regex101.com/
2- type in the regexp part : \w+ (word matching)
3- The text part : "脡abc"
3- You can change the modifier on the right on the regexp part to get U for unicode.

You will see as a word 脡abc with the modifier, and abc without the modifier. The same thing happens inside openrefine with UTF8/16 file. It is not a bug but an interpretation of the regexp. In fact, the problem is far more complex than this simple example. If you want a complete explanation on unicode regexp, please this document is exceptionnal : https://www.regular-expressions.info/unicode.html. I invite you to read on the dot operator to understand why it could be useful to have a U modifier.

I find a way to get good results with the \p{} tag, but sometimes it would be far more simple to do it with the modifier. By the way, a very interesting problem for my own experience.

Thank you for your interest.

I'd be tempted to say that we should enable UNICODE_CHARACTER_CLASS and UNICODE_CASE (for use in conjunction with the Lower Case flag) by default, but in the mean time, you can use the embedded flag ?U to enable Unicode character classes in your regex:

replace("脡abc 脡abc", /\w+/,"x") gives "脡x 脡x"

replace("脡abc 脡abc", /(?U)\w+/,"x") gives "x x"

Was this page helpful?
0 / 5 - 0 ratings