Update: this is a long conversation and there are some next steps being broken out. Please continue to use this issue for brainstorming! Thanks :)
Original issue continues below:
The system by which autosuggested results seems to choose and rank content suggestions is mysterious, and seems like a black box.
Autosuggested results have a display limit of 15 assorted content types, but do not provide an overview of Public Lab resources on a topic.
What did you expect to see that you didn't?
I expect to understand what the results mean.
The Search box in the menu bar
It is actually a black box! Full text search is a complex problem and we solve it with the "fulltext" module of MySQL, our database system; some pretty arcane (but thorough) documentation is here: https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html
It does seem we can tune/adjust it, though. There is, for example, a "natural language" option which attempts to algorithmically determine "relevance" -- https://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html
We use this fulltext feature on this line:
It does look like we could "turn on" natural language mode by making that say:
Revision.where('MATCH(node_revisions.body, node_revisions.title) AGAINST(? IN NATURAL LANGUAGE MODE)', query)
We may also need to then add ordering by relevance
-- so, i /think/ that would be:
Revision.select('node_revisions.body, node_revisions.title, MATCH(node_revisions.body, node_revisions.title) AGAINST("' + query.to_s + '" IN NATURAL LANGUAGE MODE) AS score')
.where('MATCH(node_revisions.body, node_revisions.title) AGAINST(? IN NATURAL LANGUAGE MODE)', query)
It might take some testing out.
Would you like to try this out? I have to point out that I do NOT know what will happen. The documentation for "natural language" says:
Every correct word in the collection and in the query is weighted according to its significance in the collection or query. Thus, a word that is present in many documents has a lower weight, because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are combined to compute the relevance of the row. This technique works best with large collections.
As to the second issue, --
...but do not provide an overview of Public Lab resources on a topic.
I expect to understand what the results mean.
How might we break this down a bit? Do you mean that you'd like to show a mix of types, or that you'd like to show explanatory information about what different types are?
Thanks!
I tested the above query and it does run, although again, I'm not super clear on how it works. But it'd be pretty easy to put it into production if you'd like!
What I'd like to see in the auto-suggest is a list of search terms based on weight (popular, busy pages first). On the results page I would like to see keyword results weighted by relevance (popularity, whether the word in question is included in a tag or a title, etc), and then sorted by type (note, profile, question, comment, etc). I would then like to be able to search within the keyword results (say, I'm interested in spectrometers, but would like to narrow down my search to find examples of how they've been used in schools)
Hi, Bronwen, thanks. Let's break this into separate features:
Thanks! This is super helpful.
And for the second one up there, do you mean not "relevance" as is defined in my comment above about "natural language search" but a definition of popularity such as "likes" or "views"?
I think we'd probably want to create a rubric for relevance could includes likes/views, but also weights results based on KIND of page (a wiki page with search term in the title might always show up higher on a list than, say, a comment).
One example where we're struggling with kinds of results is a search for "open hour. On our website, this search brings up 15 research notes in the auto suggest, and two research notes on the keyword search, but none of them direct to our Open Hour page. I do think a popularity ranking would help with this, and might be simpler than introducing a semantic search feature, but I can see either offering improvements.
When I perform the same search on google (without boolean operators), I see a list or results that starts with our main open hour page, followed by items tagged with "openhour" and "open-hour", followed by links to pages for individual open hours. This would seem to be a sensible rubric for page-type sorting (providing that it's still possible to browse or narrow searchers for all occurrences of a search term on our site)
Cool - super helpful. I think there's probably a way to do a more complex
ranking (maybe not Google-level pageRank but something) however I wonder if
we took a few proposals and made them testable, and examined the results.
For example it'd be pretty easy to set up views-based or likes-based
ordering, and not much harder to do natural language relevance as I
outlined above. If we made an option to view results for a given search
query in all three, we could see which seems to work better for us.
If that sounds good, we can start those code changes and have something to
look at in a week or so; what do you think of that as a next step? We could
tackle this iteratively and look at more advanced search rubrics as a
follow-up?
Thanks!!
On Thu, Mar 1, 2018, 10:02 AM bronwen9 notifications@github.com wrote:
I think we'd probably want to create a rubric for relevance could includes
likes/views, but also weights results based on KIND of page (a wiki page
with search term in the title might always show up higher on a list than,
say, a comment).One example where we're struggling with kinds of results is a search
for "open hour. On our website, this search brings up 15 research notes in
the auto suggest, and two research notes on the keyword search, but none of
them direct to our Open Hour page. I do think a popularity ranking would
help with this, and might be simpler than introducing a semantic search
feature, but I can see either offering improvements.When I perform the same search on google (without boolean operators), I
see a list or results that starts with our main open hour page, followed by
items tagged with "openhour" and "open-hour", followed by links to pages
for individual open hours. This would seem to be a sensible rubric for
page-type sorting (providing that it's still possible to browse or narrow
searchers for all occurrences of a search term on our site)[image: openhour]
https://user-images.githubusercontent.com/8331717/36850950-07ea6d18-1d36-11e8-8ed6-e80faf55bba4.gif[image: openhour2]
https://user-images.githubusercontent.com/8331717/36851397-1cfe5466-1d37-11e8-89bc-bc21bf98c4a7.gif—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/2421#issuecomment-369619020,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABfJxGXx4qzmp9kf39jrk3Rly_N9qa7ks5taA05gaJpZM4SXKWA
.
Ah, sorry for the late response, but I think that it would be great to try some of these. I think at some point we're going to need the ability to work with boolean operators (whether that's through additional search fields or allowing for more than one word or phrase in the field), but I think any of these options would help get us closer to understanding where things are going haywire in the existing search. Plus-one to trying all three!
Work now ongoing in #2518 -- this will result in:
Soon!
(update: now live on the site!)
Hi, this needs some review and reorganization now that the above searches work -- @bronwen9 and @ebarry -- thanks for your help so far! Some additional steps might be:
Also just cleaning up the lead of this issue a bit or starting a new one with our next steps clearly laid out would be helpful! Thanks!
As the dynamic search work is upcoming (as per your original schedule), I'm not sure if this one is on your radar, @milaaraujo and @stefannibrasil -- what do you think?
we have some few things to finish this week, we are planning to start working on improving the search next week!
@ebarry @bronwen9 @jywarren we started addressing some of your concerns here #3295. Please keep in mind that this PR is mostly on the front-end, but it will help with our planning! :)
I have some notes to share with you, but I need to organize them better before sharing with you xD
So I left some maybe not super helpful comments on https://github.com/publiclab/plots2/pull/3286 -- and just pulling it back here, I want to highlight that one of the questions we try to answer may need to be:
What is the best default sorting AND default search type for /each result type/ -- acknowledging that the best ordering for nodes might not make sense for profiles.
Make sense?
Most helpful comment
So I left some maybe not super helpful comments on https://github.com/publiclab/plots2/pull/3286 -- and just pulling it back here, I want to highlight that one of the questions we try to answer may need to be:
Make sense?