This is definitely the week of technical questions on Stack Overflow.
In this case, I wonder if the original poster did not point to a real anomaly. Indeed, fingerprint() does not give the same results on "école école ecole" as on "école école école". This is because the non-ascii characters normalization is done last.
Example :
1. remove leading and trailing whitespace
" école école école " -> "école école école"
2. change all characters to their lowercase representation
"éCole écoLe école" -> "école école école"
3. remove all punctuation and control characters
"école-école, école" -> "école école école"
4. split the string into whitespace-separated tokens
"école école école" -> ["école", "école", "école"]
5. sort the tokens and remove duplicates
["école", "école", "école"] -> ["école"]
BUT
["école", "école", "ecole"] -> ["ecole", "école"]
6. join the tokens back together
["école"] -> "école"
["ecole", "école"] -> "ecole école"
7. normalize extended western characters to their ASCII representation**
"école" -> "ecole"
"ecole école" -> "ecole ecole"
Should the operation 7 not be done before to ensure more uniform results with ngramm-strings containing diacritics?
@ettorerizza Thank you! It is really valuable to have someone dealing with questions on StackOverflow and reporting as issues those which are related to bugs.
I agree with your analysis and have proposed a PR accordingly. I think this change is pretty safe. Maybe the original author of this fingerprint method had some weird case in mind, but they did not include it in the unit tests.
Thank you Ettore for migrating my questions to github ;-)
I think I found an other few strange things, Should I open issues directly on github ?
M. Saby
@msaby sure! we try to avoid GitHub for simple requests for help ("how do I do this thing in OpenRefine") but reports of bugs (or other "unexpected features") are always welcome!
@msaby But before, my dream would be that you accept my answers on StackOverflow. I can not sleep anymore ("but why does not he accept ???") :p
@wetneb It was @stefanom as the original author of fingerprint() and he would tell you that normalization is a tricky thing, even within a single language because of differences in dialects, writing habits, and then there's history where things change, i.e., the NEW normal :)
Having said that, its probably because he was looking for efficiency and doing the sort and remove duplicate tokens so that he didn't have to convert all of them to their ASCII equivalent.
@thadguidry there is no difference in efficiency here, it's really equivalent. But it's easy to get these things wrong, we just need to gather test cases to document the choice of order in the operations.
@wetneb I'm just saying that you will find that some folks (including me) might treat this as a false positive in some other languages. I'm fine with the changes since they take care of the 90% common cases anyways. David's tests did not account for @ettorerizza case and you have fixed that. Thanks.
@thadguidry I don't understand, what are the 10% cases the changes do not deal with? Do you have an example in mind?
"I'm just saying that you will find that some folks (including me) might treat this as a false positive in some other languages."
@thadguidry That's exactly what I said on SO. Imho, "ecole" and "école école école" are two strings far too different to be merged by a fingerprint, which is supposed to be the most conservative operation among the clustering algos. One might ask whether the deduplication of tokens is really indispensable.
@wetneb I am not a complete language expert so I don't know every use case.
@ettorerizza So the idea of fingerprint() is to generalize language string tokens. Some folks might not want to generalize, which is your use case I think... so you would not want to use fingerprint() in that case. fingerprint() will remove duplicates to perform the generalization and normalization and I would say that is not conservative...not all users want that to happen and that's fine... we have other algorithms to use. What makes 2 strings different is opinionated and is why we have many algorithms available for clustering. But perhaps there's a need for one more ? flatFingerprint() or somesuch that does not remove duplicates ? Dunno up to you and what you want. But changing the functionality of fingerprint() to fit every use case is not something I want to do, bear that in mind.
There may be room for a "safe clustering" based on fingerprint, something that could be advertised as "click on 'select all' and then on 'merge', your results will be correct in 99% of the cases."
At the same time, new algorithms of string similarity have been developped since 2012, or in any case ported in Java. If you wish, I can make a list of those that deserve to be added to OR.
@ettorerizza that would be great!
@ettorerizza it can be 100% of the cases, if it only deals with whitespace and nothing else...but then you have cases like mine and other scientists that deal with Unicode(160) non-breaking spaces .... non-breaking means don't breakup the string on this space because its significant, I even put a recipe towards this on year 1 of OpenRefine https://github.com/OpenRefine/OpenRefine/wiki/Recipes#question-marks--showing-in-your-data
Curious, @ettorerizza So which rules would you consider 'safe' to allow for your 'safe clustering' ?
Try to capture what your thinking about exactly in a new issue so we can track it.
Fingerprint in Openrefine has always treated "ecole ecole ecole ecole" as the same as "ecole", so please don't change it. Some people may rely on that... But if you think it is needed you can always implement a new algo more specific.
Normalisation/translitteration is tricky, I know that, I work in a library ;-) But this case is not a tricky one, this is a consistance issue. For me, if the "é" is supposed to be processed as a "e" by fingerprint, it has to be true in every circumstance.
Most helpful comment
@msaby But before, my dream would be that you accept my answers on StackOverflow. I can not sleep anymore ("but why does not he accept ???") :p