Openrefine: Menu and function for removing diacritics

Created on 18 Jan 2020  路  6Comments  路  Source: OpenRefine/OpenRefine

Is your feature request related to a problem or area of OpenRefine? Please describe.

It could be useful to have a a menu and a GREL function to remove diacritics in strings.

Ex :

"茅cole" -> "ecole"

enhancement grel localization

All 6 comments

Isn't that already available as a fingerprint function? If not it could potentially be added as such since it is possible to call clustering functions from GREL.

I was thinking of something less agressive than fingerprint : "L'茅cole et les ecoles" -> "L'ecole et les ecoles"

@msaby see if the following helps you out:

  1. https://github.com/OpenRefine/OpenRefine/wiki/Recipes#9-encoding-issues
  2. https://github.com/OpenRefine/OpenRefine/wiki/Extending-Jython-with-pypi-modules#how-to-replace-diacritic-characters

This seems to be fairly easy enough to do now if we simply use Apache StringUtils stripAccents

I suggest for labeling simplicity (translations) to call the new GREL function the same, stripAccents().

I'd like to see a more general approach to text normalization than just removing diacritics. We also need to deal with normalizing the various composed vs decomposed forms. Other related issues include #409 and #650.

I'm removing the "good second issue" label until we have the design nailed down. One possible approach would be to create a normalize function with different "strengths" of normalization to apply (decomposition, diacritic removal, case folding, etc).

@tfmorris Sounds good Tom. I would always trust you for expertise with localization and international support anyways :-)

Was this page helpful?
0 / 5 - 0 ratings