Plots2: Add extra search results for related terms

Created on 14 Feb 2019 · 10Comments · Source: publiclab/plots2

There might be multiple ways to do this, and there are a few goals:

return search results for non-hyphenated terms when entering hyphenated terms, like purpleair when searching for purple-air (see https://github.com/publiclab/plots2/issues/3677#issuecomment-432246848)
the reverse; returning results for hyphenated terms when entering non-hyphenated ones, so adding purple-air results when searching for purpleair (harder because there is no simple algorithmic way to add hyphens in, like maneatingcabbage => man-eating-cabbage or maneating-cabbage)
we could also simply add in search results from associated terms based on a key-value listing, so when searching for discussion lists we could return mailing lists if we make a list of these term pairings. This could address the challenge of (2) above, but not sure what the implementation looks like.

Could be related to stemming/lemmatizing as in https://github.com/publiclab/plots2/issues/3666 and solution https://github.com/publiclab/plots2/pull/4533 by @shubhscoder.

Lemmatizer uses these dictionaries: https://github.com/yohasebe/lemmatizer/tree/master/lib/dict and I've opened an issue to ask about supplying custom dictionaries (this may also be helpful for different languages): https://github.com/yohasebe/lemmatizer/issues/5

However we may not want to "reduce" a word to a common core, rather, we may want to add additional search terms -- so for example, we could simply supply a YAML file of pairs like:

purple-air: purpleair
purpleair: purple-air
h2s: hydrogen-sulfide

maybe in /config/initializers/matchwords.yml? - and we could match search terms against this in the transform() function here, adding them in:

https://github.com/publiclab/plots2/blob/e3cf2112469c8be284e541579d2bc02b62d39e7a/app/services/search_criteria.rb#L44-L48

So that a search for h2s would become a search for the query h2s hydrogen-sulfide

@shubhscoder this seems like it wouldn't be too difficult. Do you have any interest in implementing this?

Ruby help wanted search

Source

jywarren

Most helpful comment

Wow, Lemmatizer added the extra feature and released a new version! Cool!

lem = Lemmatizer.new("sample.dict.txt")

jywarren on 16 Feb 2019

🎉2

All 10 comments

Just noting that I'd like to start by writing a test for this as we did in the Lemmatizer PR - #4533 - where we start by showing a search result that /doesn't/ return correct results, and fails the test, then we implement the term-adding feature and are able to see the test pass. Thanks!

jywarren on 14 Feb 2019

Yes @jywarren this is interesting and I would like to work on it. Thank you so much!

shubhscoder on 15 Feb 2019

Thank you too!!

On Thu, Feb 14, 2019, 10:37 PM Shubham Sangamnerkar <
[email protected] wrote:

Yes @jywarren https://github.com/jywarren this is interesting and I
would like to work on it. Thank you so much!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4823#issuecomment-463894190,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABfJ1h7WbmKEMKBfqVK66CVjZPlAKdRks5vNitkgaJpZM4a8Lnn
.

jywarren on 15 Feb 2019

Wow, Lemmatizer added the extra feature and released a new version! Cool!

lem = Lemmatizer.new("sample.dict.txt")

jywarren on 16 Feb 2019

🎉2

@jywarren I have thought of the following solution : -

We add the following helpers in module TextSearch (https://github.com/publiclab/plots2/blob/master/lib/text_search.rb) : -
a. non_hypenated_results (This would remove the hyphens from the query and return it as it is)
b. results_with_probable_hyphens (As you suggested we would have a file containing key value pairs of words along with their correct version of hyphenated word. This function would thus return a hyphenated version of a continuous query. )

This two functions could probably solve issue 1 and 2.

Secondly, in the search controller(plots2/app/controllers/search_controller.rb) we create a object of search_criteria in set_search_criteria that is called before any other function and all the other controller functions use this object.
Now instead, we would create a array of objects of class search_criteria each object having the modified version of the query(Using text_search) and after that we would have a search on query of each object.
Then we would simply concatenate the results of all searches and remove duplicates. Thus we would have the search results of everything expected.

The only thing I am not able to understand is the third issue regarding mailing lists and discussion lists.Could you please clarify that?

If you have some better ideas we can use them as well. If you think that the above solution could probably solve the problem, I will imediately make a PR for this and also write appropriate tests.
Thank you!

shubhscoder on 16 Feb 2019

🚀1

I'm sorry i missed this message! Reading through now. Just also noting that we could ask @publiclab/community-reps to compile a list of search term pairings in a doc, let's say:

http://pad.publiclab.org/p/searches

jywarren on 21 Feb 2019

OK, just to start with your initial 2 ideas; i think i suggest part of this in #4848, but:

a. non_hypenated_results (This would remove the hyphens from the query and return it as it is)
b. results_with_probable_hyphens

For a. (removing hyphens) i agree it can be automated. I like doing this as its own filter - 👍 - this is (1) above in my orig. post.

For b. I think we can treat this almost identically to related terms that don't share terms, so like "h2s: hydrogen-sulfide" can use a dict lookup just like "purpleair: purple-air". Sound right? - this is (2) above in my orig. post.

I think i'm saying that (2) from my orig post could be solve the same way as (3) - just with a key-value pair dict as you've done in #4848. So this looks great, and if we modify #4848 to include a "strip hyphens" filter we should be done and can start compiling a list of key/value pairs.

To your question of concatenating searches, I think it gets complicated when we try to paginate. This is complex enough that I think we can simply add the extra related terms onto the end of the search query, instead of trying to run separate searches.

One final thing -- we should allow suppressing the related results with a flag. Let's remember to open a new issue/PR for this, and not solve it now, but we should be able to pass in a new param like search/SEARCHTERM?related=false to disable all the related additions. This may help us debug, and help us see if our additions are causing us trouble. If it's easy to add to #4848, we can go ahead, but it's also something we can do in follow-up.

Thanks a million for your great work on this! It's really exciting and is a long-awaited feature.

jywarren on 21 Feb 2019

@jywarren,

For a. (removing hyphens) i agree it can be automated. I like doing this as its own filter - - this is (1) above in my orig. post.

So do you want it be included in the same dict file as well? Because we have implemented a strip hyphens filter in text-search that gets rid of all the hyphens in the query.

For b. I think we can treat this almost identically to related terms that don't share terms, so like "h2s: hydrogen-sulfide" can use a dict lookup just like "purpleair: purple-air". Sound right? - this is (2) above in my orig. post.

I think #4848 implements it the right way? we would just populate the dictionary and searches like h2s: hydrogen sulphide should work fine right?

. So this looks great, and if we modify #4848 to include a "strip hyphens" filter we should be done and can start compiling a list of key/value pairs.

I have already included a strip hyphen filter in text_search in #4848. I think you also are talking about a similar filter. Please correct me if I am wrong.

To your question of concatenating searches, I think it gets complicated when we try to paginate. This is complex enough that I think we can simply add the extra related terms onto the end of the search query, instead of trying to run separate searches.

I had somehow handled the pagination in #4848 after concatenating results, but your idea is much better, it significantly reduces number of database querys.

One final thing -- we should allow suppressing the related results with a flag. Let's remember to open a new issue/PR for this, and not solve it now, but we should be able to pass in a new param like search/SEARCHTERM?related=false to disable all the related additions. This may help us debug, and help us see if our additions are causing us trouble. If it's easy to add to #4848, we can go ahead, but it's also something we can do in follow-up.

I will open a follow up issue and complete that as soon we finish with #4848 . Thank you so much for your help. Also, can you tell me when would you be online today, I ll try to come online at the same time, and we could finish this asap. #4848 is taking time because of our difference in time zones. :see_no_evil:

shubhscoder on 22 Feb 2019

So do you want it be included in the same dict file as well? Because we have implemented a strip hyphens filter in text-search that gets rid of all the hyphens in the query.
I have already included a strip hyphen filter in text_search in #4848. I think you also are talking about a similar filter. Please correct me if I am wrong.

no no, i think you have it right and that's solved! Sorry!

I think #4848 implements it the right way? we would just populate the dictionary and searches like h2s: hydrogen sulphide should work fine right?

Yep!

OK, and sorry about the asynchronous nature of the progress here; i think to be honest it's kind of a necessity because (as i just demonstrated) sometimes I have to be away for some days (we had a big event!) and then have to come back. Thank you for sticking with it, though!! Again, amazing work!

jywarren on 28 Feb 2019

OK, and sorry about the asynchronous nature of the progress here; i think to be honest it's kind of a necessity because (as i just demonstrated) sometimes I have to be away for some days (we had a big event!) and then have to come back. Thank you for sticking with it, though!! Again, amazing work!

No problem @jywarren , you have helped me a lot along, Thank you so much!

shubhscoder on 28 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings