Describe the bug
I had some issues with some of the synonyms working and some not and it seems I managed to identify a bug.
It seems words with accents are not used from the synonym table (and/or the query is deaccentized before looking up synonyms (?))
To Reproduce
Steps to reproduce the behavior:
Add synonyms:
Expected behavior
In a previous issue (#949) it was mentioned there is no typo tolerance with synonyms, however it now seems when considering synonyms, there is some sort of deaccenting the query before looking up in the synonyms table (?)
Hey @mzperix,
I just looked into the code base and it seems like we forgot to unidecode (standardize: remove accents and lowercase words) therefore words in the query doesn't match the non-standardized synonyms.
I advise you to do that by hand: remove accents and lowercase the words on both sides, until we fix that.
We will fix that in the next release, thank you for this bug report :)
Hi, I wanted to work on this as my first issue! I was planning on using this crate, however, I'm unsure of exactly where the search queries are parsed in the code base. Thanks!
Like what we did with facets (which are lowercased), we will need to store the synonyms in two different places. On one side the one we currently store need to be de-unicased, but we also need the original user input and keep the two lists in sync, so when the user request for the synonyms, the unicased versions are returned. I am currently on implementing this one.
In the end, this is not possible to do it in a straighforward manner without impacting user that use non-latin scripts. This involves work done on the tokenizer. In the meantime, I suggest that the synonyms are registered in a lowercase and de-unicoded format.
Re-opening this issue, and linking it to the tokenizer tracking issue (#624)