There might be multiple ways to do this, and there are a few goals:
purpleair when searching for purple-air (see https://github.com/publiclab/plots2/issues/3677#issuecomment-432246848)purple-air results when searching for purpleair (harder because there is no simple algorithmic way to add hyphens in, like maneatingcabbage => man-eating-cabbage or maneating-cabbage)discussion lists we could return mailing lists if we make a list of these term pairings. This could address the challenge of (2) above, but not sure what the implementation looks like. Could be related to stemming/lemmatizing as in https://github.com/publiclab/plots2/issues/3666 and solution https://github.com/publiclab/plots2/pull/4533 by @shubhscoder.
Lemmatizer uses these dictionaries: https://github.com/yohasebe/lemmatizer/tree/master/lib/dict and I've opened an issue to ask about supplying custom dictionaries (this may also be helpful for different languages): https://github.com/yohasebe/lemmatizer/issues/5
However we may not want to "reduce" a word to a common core, rather, we may want to add additional search terms -- so for example, we could simply supply a YAML file of pairs like:
purple-air: purpleair
purpleair: purple-air
h2s: hydrogen-sulfide
maybe in /config/initializers/matchwords.yml? - and we could match search terms against this in the transform() function here, adding them in:
So that a search for h2s would become a search for the query h2s hydrogen-sulfide
@shubhscoder this seems like it wouldn't be too difficult. Do you have any interest in implementing this?
Just noting that I'd like to start by writing a test for this as we did in the Lemmatizer PR - #4533 - where we start by showing a search result that /doesn't/ return correct results, and fails the test, then we implement the term-adding feature and are able to see the test pass. Thanks!
Yes @jywarren this is interesting and I would like to work on it. Thank you so much!
Thank you too!!
On Thu, Feb 14, 2019, 10:37 PM Shubham Sangamnerkar <
[email protected] wrote:
Yes @jywarren https://github.com/jywarren this is interesting and I
would like to work on it. Thank you so much!—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4823#issuecomment-463894190,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABfJ1h7WbmKEMKBfqVK66CVjZPlAKdRks5vNitkgaJpZM4a8Lnn
.
Wow, Lemmatizer added the extra feature and released a new version! Cool!
lem = Lemmatizer.new("sample.dict.txt")
@jywarren I have thought of the following solution : -
This two functions could probably solve issue 1 and 2.
Secondly, in the search controller(plots2/app/controllers/search_controller.rb) we create a object of search_criteria in set_search_criteria that is called before any other function and all the other controller functions use this object.
Now instead, we would create a array of objects of class search_criteria each object having the modified version of the query(Using text_search) and after that we would have a search on query of each object.
Then we would simply concatenate the results of all searches and remove duplicates. Thus we would have the search results of everything expected.
The only thing I am not able to understand is the third issue regarding mailing lists and discussion lists.Could you please clarify that?
If you have some better ideas we can use them as well. If you think that the above solution could probably solve the problem, I will imediately make a PR for this and also write appropriate tests.
Thank you!
I'm sorry i missed this message! Reading through now. Just also noting that we could ask @publiclab/community-reps to compile a list of search term pairings in a doc, let's say:
OK, just to start with your initial 2 ideas; i think i suggest part of this in #4848, but:
a. non_hypenated_results (This would remove the hyphens from the query and return it as it is)
b. results_with_probable_hyphens
For a. (removing hyphens) i agree it can be automated. I like doing this as its own filter - 👍 - this is (1) above in my orig. post.
For b. I think we can treat this almost identically to related terms that don't share terms, so like "h2s: hydrogen-sulfide" can use a dict lookup just like "purpleair: purple-air". Sound right? - this is (2) above in my orig. post.
I think i'm saying that (2) from my orig post could be solve the same way as (3) - just with a key-value pair dict as you've done in #4848. So this looks great, and if we modify #4848 to include a "strip hyphens" filter we should be done and can start compiling a list of key/value pairs.
To your question of concatenating searches, I think it gets complicated when we try to paginate. This is complex enough that I think we can simply add the extra related terms onto the end of the search query, instead of trying to run separate searches.
One final thing -- we should allow suppressing the related results with a flag. Let's remember to open a new issue/PR for this, and not solve it now, but we should be able to pass in a new param like search/SEARCHTERM?related=false to disable all the related additions. This may help us debug, and help us see if our additions are causing us trouble. If it's easy to add to #4848, we can go ahead, but it's also something we can do in follow-up.
Thanks a million for your great work on this! It's really exciting and is a long-awaited feature.
@jywarren,
For a. (removing hyphens) i agree it can be automated. I like doing this as its own filter - - this is (1) above in my orig. post.
So do you want it be included in the same dict file as well? Because we have implemented a strip hyphens filter in text-search that gets rid of all the hyphens in the query.
For b. I think we can treat this almost identically to related terms that don't share terms, so like "h2s: hydrogen-sulfide" can use a dict lookup just like "purpleair: purple-air". Sound right? - this is (2) above in my orig. post.
I think #4848 implements it the right way? we would just populate the dictionary and searches like h2s: hydrogen sulphide should work fine right?
. So this looks great, and if we modify #4848 to include a "strip hyphens" filter we should be done and can start compiling a list of key/value pairs.
I have already included a strip hyphen filter in text_search in #4848. I think you also are talking about a similar filter. Please correct me if I am wrong.
To your question of concatenating searches, I think it gets complicated when we try to paginate. This is complex enough that I think we can simply add the extra related terms onto the end of the search query, instead of trying to run separate searches.
I had somehow handled the pagination in #4848 after concatenating results, but your idea is much better, it significantly reduces number of database querys.
One final thing -- we should allow suppressing the related results with a flag. Let's remember to open a new issue/PR for this, and not solve it now, but we should be able to pass in a new param like
search/SEARCHTERM?related=falseto disable all the related additions. This may help us debug, and help us see if our additions are causing us trouble. If it's easy to add to #4848, we can go ahead, but it's also something we can do in follow-up.
I will open a follow up issue and complete that as soon we finish with #4848 . Thank you so much for your help. Also, can you tell me when would you be online today, I ll try to come online at the same time, and we could finish this asap. #4848 is taking time because of our difference in time zones. :see_no_evil:
So do you want it be included in the same dict file as well? Because we have implemented a strip hyphens filter in text-search that gets rid of all the hyphens in the query.
I have already included a strip hyphen filter in text_search in #4848. I think you also are talking about a similar filter. Please correct me if I am wrong.
no no, i think you have it right and that's solved! Sorry!
I think #4848 implements it the right way? we would just populate the dictionary and searches like h2s: hydrogen sulphide should work fine right?
Yep!
OK, and sorry about the asynchronous nature of the progress here; i think to be honest it's kind of a necessity because (as i just demonstrated) sometimes I have to be away for some days (we had a big event!) and then have to come back. Thank you for sticking with it, though!! Again, amazing work!
OK, and sorry about the asynchronous nature of the progress here; i think to be honest it's kind of a necessity because (as i just demonstrated) sometimes I have to be away for some days (we had a big event!) and then have to come back. Thank you for sticking with it, though!! Again, amazing work!
No problem @jywarren , you have helped me a lot along, Thank you so much!
Most helpful comment
Wow, Lemmatizer added the extra feature and released a new version! Cool!
lem = Lemmatizer.new("sample.dict.txt")