galaxy 🚀 - improve tool panel search

utvalg_999 019

hexylena on 8 Aug 2016

@erasche since recently Galaxy now searches tool IDs

screenshot 2016-08-08 13 49 51

I think improvements might be made regarding the interchangeability of ' ', '_', '-'

martenson on 8 Aug 2016

@martenson that's great! +1 for allowing users to substitute in ' ' for the _. I know I look for my tools by ID sometimes and fail to find them.

hexylena on 8 Aug 2016

+1 to that, requiring users to know _'s is unfortunate. Is the input not tokenized and matched? (I guess not, if the broken string doesn't match?)

dannon on 8 Aug 2016

@martenson RFC: Things I would like to see indexed and available to search, along with my feelings on their boosts:

tool name (5)
tool id (4)
tool help text (2)
tool parameter helps (0.3)
tool input data formats (1)
tool output data formats (0.6)

I just find myself frustrated when I cannot find the tool I want or the results are very limited because of what is searched upon. Of course, I do not know what the state of 16.07/dev is, have not gotten there yet.

hexylena on 8 Aug 2016

👍1

@erasche we have these boosts on Main and these are the defaults

# tool_name_boost = 9
# tool_section_boost = 3
# tool_description_boost = 2
# tool_label_boost = 1
# tool_stub_boost = 5
# tool_help_boost = 0.5

martenson on 8 Aug 2016

@martenson hey that's most of the things I need. In that case, then it would be nice to have more space to display this and where the search actually "hit". Apologies, have not been following along with this stuff closely enough to make informed comments.

hexylena on 8 Aug 2016

we have the 'hit' information but I did not figure out a good place to display it - related to the limited canvas

martenson on 8 Aug 2016

I would add in that it would be nice to have the underlying tool (binary) name be part of the search, if wrapped under a slightly different tool name or short label of some type.

Related utilized/dependent binaries would be included in this. (lower "boost" probably)

jennaj on 8 Aug 2016

xref https://github.com/galaxyproject/galaxy/issues/1084

martenson on 11 Oct 2018

digging through Main usage metrics any improvements to toolpanel search should be very well worth it

screenshot 2018-10-11 11 19 36

martenson on 11 Oct 2018

👍1

from @erasche

https://usegalaxy.eu/api/tools?q=compute+an+expression - 0 results
https://usegalaxy.eu/api/tools?q=compute+expression - numerous results + correct one
https://usegalaxy.eu/api/tools?q=compute+an - numerous results + correct one

martenson on 5 Dec 2018

Another concrete issue:

https://usegalaxy.eu/api/tools?q=peakachu returns 2 results, neither are displayed on frontend. client issue.

hexylena on 18 Jan 2019

@erasche I cannot reproduce
screenshot 2019-01-18 11 02 38

martenson on 18 Jan 2019

Firefox on linux, cannot repro in chrome.

hexylena on 18 Jan 2019

two subsequent searches for peakachu yielded different results for me in firefox, the first does not show in client, the second does

["toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.1", "toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.0"]

["toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.1", "toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.0", "toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.2"]

martenson on 18 Jan 2019

@erasche I also can reproduce ~50% of the times on the UI, on both Firefox and Chrome on Linux. One of the web handler hasn't reloaded the toolbox probably.

nsoranzo on 18 Jan 2019

xref new issue for the display bug: https://github.com/galaxyproject/galaxy/issues/7238

martenson on 18 Jan 2019

another search term returning unexpected results: ncbi

browser might not matter, same results using chrome or safari under mac osx (but didn't test firefox)

usegalaxy.org == finds "get data > NCBI bam" download tool but not "get data > NCBI fastq". This server doesn't include "get data > NCBI pileup" anymore (tool routinely failed -- data usually too large plus any represents ambiguous scientific content)
usegalaxy.eu == doesn't find any of these three (all are present under "get data")
usegalaxy.org.au == finds all three (under "get data")
usegalaxy.be == doesn't find any of these three (all are present under "get data")

jennaj on 20 Feb 2019

another search term: convert
It did not find the convert tool (Text Manipulation>Convert delimiters to TAB)

It works

https://usegalaxy.eu/ :+1:

Tries made with Firefox.

I discover boosters! What can I set in order to find a result as usegalaxy.eu?

# tool_name_boost = 9
# tool_section_boost = 3
# tool_description_boost = 2
# tool_label_boost = 1
# tool_stub_boost = 5
# tool_help_boost = 0.5

FredericBGA on 24 May 2019

@FredericBGA .eu's boosts are here https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/group_vars/gxconfig.yml#L1076 but they're pretty aggressive / strange compared to other sites'

hexylena on 24 May 2019

@FredericBGA .eu's boosts are here https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/group_vars/gxconfig.yml#L1076 but they're pretty aggressive / strange compared to other sites'

@erasche thank you! The link in Martin post above is broken. I will try with something between default and .eu

FredericBGA on 24 May 2019

@FredericBGA we have this on Main atm

  tool_name_boost: 12
  tool_section_boost: 5

We should probably experiment with tool_enable_ngram_search

I created a PR to mimic EU and enable ngram too: https://github.com/galaxyproject/usegalaxy-playbook/pull/228

martenson on 24 May 2019

We use tool_enable_ngram_search: true, which works fine.

nsoranzo on 24 May 2019

thank you all for sharing your config with me!
It works now, with:
tool_name_boost: 20

FredericBGA on 27 May 2019

🎉1

tool_name_boost: 20

Wonder if Main would benefit from that much higher boost, specifically. Searches are still a bit unpredictable and result too limited imho. martin probably is on that already... is not new and we've tried a few variations already but still could use some tuning.

Has to be frustrating to search for a tool and not find it -- as the stats above he posted backup.

jennaj on 31 May 2019

@jennaj I have proposed radical change for Test here: https://github.com/galaxyproject/usegalaxy-playbook/pull/228

The search stats above do not include anything about the results, it is a boolean for 'has the user searched at least once?'.

martenson on 31 May 2019

Great, I really do like the EU search results. Finds everything, and even though outputs more results, totally worth it imo. Glad we are exploring that.

The stats sort of indicate that people using tool searches spend more time in a Galaxy session. Suggesting are new and/or running tools directly from history, and possibly are spending more time "hunting" for tools (the "hard way", eg expanding/scrolling through all). Non-tool searching sessions look they might be biased from those running workflows - so quick login/out, no tool searches, bounce rate higher because they get whatever they want to do done quicker. But maybe am reading too much into that :)

jennaj on 31 May 2019

New issue reported when looking for blast and expecting to find blastp.

tool_search_limit = 160
https://usegalaxy.eu/api/tools?q=blast reports 160 elements, including blastp
Searching for blast on usegalaxy.eu only shows a couple of results

Screenshot at time of reporting

hexylena on 20 Jun 2019

an idea: A hybrid approach where the search result limit is high but will cut off at certain hit score if there are enough results. This could prune the less important results.

martenson on 24 Jun 2019

searching for full name + title doesn't find the tool. Search select lines and it is returned.

hexylena on 30 Aug 2019

also found https://github.com/galaxyproject/galaxy/issues/3276 when searching for this one, think that's one of the points in this thread somewhere.

hexylena on 30 Aug 2019

👍1

@erasche I think the point made in the linked ticket is an important one. People don't care about the actual tool order in whatever tool panel they are working with. They want to find the tool. The ranking is an intuitive way to do a search -- could even be a toggle in the GUI.

Could even be extended if ranked (_exact_ tool name match): eg: "show all" vs "I'm feeling lucky" type of thing

jennaj on 11 Sep 2019

Agreed. I understand (what I assume was) the original intention to help users find the section on their own later, but on eu with 2k tools, they will basically always use search. Would love to have a ranking.

hexylena on 12 Sep 2019

👍1

meantime on Main:

search for fastq has first result in response the fastqc tool, panel never shows it
search for fastqc has middle result the multiqc tools, panel never shows it

martenson on 8 Oct 2019

Main/usegalaxy.org

search for ncbi does not find any of the "NCBI SRA" Get data tools: https://toolshed.g2.bx.psu.edu/view/iuc/sra_tools/f5ea3ce9b9b0

Eu/usegalaxy.eu

search for ncbi also does not find any of the "NCBI SRA" Get data tools, but does find other Get Data tools from NCBI not in the same tool suite (none of those are loaded at .org)

jennaj on 11 Nov 2019

On eu:

The following two queries return different results:

Importantly, the first doesn't include random_lines1, the tool I'm looking for.

hexylena on 9 Jan 2020

On eu, a search for snpeff does not find any of the following tools:

SnpEff eff
SnpEff download
Snpeff databases

Works just fine on .org.

wm75 on 10 Jan 2020

... and, possibly related to @hexylena's example just above:
trailing whitespace in a search term seems to mess up results completely

This one on main AND eu

wm75 on 10 Jan 2020

a question: is there really no way to express an AND between search terms?

wm75 on 10 Jan 2020

@wm75 not at the moment, can you please provide examples of searches that don't behave as you'd expect?

martenson on 10 Jan 2020

Odd result on EU:

Multiqc appears in two sections

afbeelding

Searching for it yields only one:
afbeelding

Contrast with fastqc which appears in two and searches yield two (of the same version)

afbeelding

hexylena on 21 Jan 2020

xref: https://github.com/galaxyproject/galaxy/issues/10030

martenson on 4 Aug 2020

"UCSC main" is unfindable on EU: https://usegalaxy.eu/api/tools?q=ucsc+main doesn't include ucsc_table_direct1, but it does include 150 other things. @bgruening

It does on .org, but not nearly the top hit for a search on the exact tool title

hexylena on 29 Oct 2020

on EU searchingucsc has 52 results and "Main" is the last one 😭

martenson on 29 Oct 2020

right? saw that one too, and :joy: seems very odd given our boosts. https://github.com/usegalaxy-eu/infrastructure-playbook/blob/45c98a0baec381ccb0acd6cca78016985bd58fe4/group_vars/gxconfig.yml#L1190

hexylena on 29 Oct 2020

@hexylena maybe https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/webapps/galaxy/config_schema.yml#L1955 could improve things?

martenson on 29 Oct 2020

Possibly! I just expected the tool_name boost to have the biggest effect. I would love to debug the internals sometime, and see what scores x boost are being returned for each of these results that are doing 'better' than the direct text match. Like, if those are returning first, clearly they say "ucsc main" dozens of time in their descriptions or something?

hexylena on 29 Oct 2020

is it possible that exact matches overflow in score ? This is a search for ucsc:
Screenshot 2020-10-29 at 11 54 15

mvdbeek on 29 Oct 2020

@mvdbeek neat! How did you obtain that?

hexylena on 29 Oct 2020

Paused at a breakpoint in https://github.com/dannon/galaxy/blob/c0d1a915a056b89b24f567664e7c02daf40deb2e/lib/galaxy/tools/search/__init__.py#L222

mvdbeek on 29 Oct 2020

Ahh ok, wondered if it was a secret api I was missing.

hexylena on 29 Oct 2020

so I booted up a copy of the app against EU because I always feel worried about reproducing locally with the v. different toolboxes. This looks odd to me:

(Pdb) galaxy_app.toolbox_search.parser.parse('*' + 'ucsc main' + '*')
Or([Wildcard('name', '*ucsc'), Wildcard('old_id', '*ucsc'), Wildcard('description', '*ucsc'), Wildcard('section', '*ucsc'), Wildcard('help', '*ucsc'), Wildcard('labels', '*ucsc'), Wildcard('stub', '*ucsc'), Prefix('name', 'main'), Prefix('old_id', 'main'), Prefix('description', 'main'), Prefix('section', 'main'), Prefix('help', 'main'), Prefix('labels', 'main'), Prefix('stub', 'main')])

why does only ucsc stay prefixed with *, and main loses it's one?

(Pdb) for idx, hit in enumerate(galaxy_app.toolbox_search.searcher.search(galaxy_app.toolbox_search.parser.parse('*ucsc main*'), limit=400)): print((idx, hit, hit.score) if 'ucsc' in hit['id'] else None)
...
(296, <Hit {'id': 'ucsc_table_direct1'}>, 0.4618992716030244)
(297, <Hit {'id': 'ucsc_table_direct_archaea1'}>, 0.22288633588616671)
None

or without *

(Pdb) for idx, hit in enumerate(galaxy_app.toolbox_search.searcher.search(galaxy_app.toolbox_search.parser.parse('ucsc main'), limit=400)): print((idx, hit, hit.score) if 'ucsc' in hit['id'] else None)
...
None
(103, <Hit {'id': 'ucsc_table_direct1'}>, 0.4371187999893639)
None
None
None
(107, <Hit {'id': 'ucsc_table_direct_archaea1'}>, 0.1950255439003959)

trying out the individual fields of a search, seems like description is a negative in this case:

(Pdb) for hit in galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc* *main*'), limit=40): print(hit, hit.score)
<Hit {'id': 'vcf_to_maf_customtrack1'}> 1.9662951360360124
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_rpstblastn_wrapper/2.10.1+galaxy0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_rpsblast_wrapper/2.10.1+galaxy0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: notseq61/5.0.0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/tmhmm2/0.0.16'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 0.4456963370285034
<Hit {'id': 'ucsc_table_direct1'}> 0.44038679685244836
<Hit {'id': 'ucsc_table_direct_archaea1'}> 0.20059770229755006
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 0.1750118444883364
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 0.13424605571325632
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 0.11250224762822575
<Hit {'id': 'bwtool-lift'}> 0.05599209141540291

tool | name | description
--- | --- | ---
vcf_to_maf_customtrack1 | VCF to MAF Custom Track | for display at UCSC
ucsc_table_direct1 | UCSC Main | table browser

feels very odd that vcf scores higher.

hexylena on 29 Oct 2020

Some more debugging

(Pdb) print(galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True).termdocs)
{('name', b'ucsc'): array('I', [101, 1460, 2546]), ('description', b'ucsc'): array('I', [967, 1215, 1559, 2255, 2427]), ('description', b'maintaining'): array('I', [2122]), ('name', b'main'): array('I', [2546])}

So that's matching maintaing (hmm. I get why but. surely that should score lower than an exact word boundary match?)

and doc 2546 which hits both main + ucsc is indeed our tool:

(Pdb) print(list(galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True))[1].docnum)
2546

aha (ish)

(Pdb) for hit in galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'vcf_to_maf_customtrack1'}> 1.7205082440315107 967 [('description', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 0.44322627964299954 2546 [('name', b'main'), ('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 0.38998429489994046 1215 [('description', b'ucsc')]
<Hit {'id': 'ucsc_table_direct_archaea1'}> 0.1755229895103563 101 [('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 0.1736500008575602 2122 [('description', b'maintaining')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 0.15313536392729438 1559 [('description', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 0.11746529874909928 2427 [('description', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 0.09843946667469754 1460 [('name', b'ucsc')]
<Hit {'id': 'bwtool-lift'}> 0.048993079988477545 2255 [('description', b'ucsc')]

orgroup changed from 0.1 to 0.9 doesn't produce a big different. Oddly I've specified old_id in the MultifieldParser, but there are no ID matches? I'd exepect

<Hit {'id': 'ucsc_table_direct1'}> 0.44322627964299954 2546 [('name', b'main'), ('name', b'ucsc'), ('old_id', b'ucsc')]

but old_id isn't anywhere there? It's when help is included that the results become garbage:

<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/xpath/xpath/1.0.0'}> 5.7824381765403645 1006 [('help', b'maintainers')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/jjohnson/rsem/rsem_prepare_reference/1.1.17'}> 5.74646047844554 676 [('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_get_communitytype/mothur_get_communitytype/1.39.5.0'}> 5.6337872716676785 1183 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 5.628083628215019 2427 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_lefse/mothur_lefse/1.39.5.0'}> 5.6050434590571285 1771 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/samtools_merge/samtools_merge/1.9'}> 5.4913612182626235 947 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_classify_rf/mothur_classify_rf/1.36.1.0'}> 5.4325805833938325 2391 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_pcr_seqs/mothur_pcr_seqs/1.39.5.0'}> 5.3072620955901515 121 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_get_mimarkspackage/mothur_get_mimarkspackage/1.39.5.0'}> 5.187549416742254 1622 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_merge_files/mothur_merge_files/1.39.5.0'}> 5.187549416742254 1767 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_primer_design/mothur_primer_design/1.39.5.0'}> 5.18227372171205 286 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/openbabel/ctb_subsearch/0.1'}> 5.105499296205204 1339 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_fastq_info/mothur_fastq_info/1.39.5.0'}> 5.105499296205204 1414 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_lookup/mothur_make_lookup/1.39.5.0'}> 5.0794508304082395 1662 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_cluster_classic/mothur_cluster_classic/1.39.5.0'}> 5.0794508304082395 1710 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_fastq/mothur_make_fastq/1.39.5.0'}> 5.0794508304082395 1723 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_chimera_vsearch/mothur_chimera_vsearch/1.39.5.1'}> 5.039159295530664 161 [('help', b'main_page')]

so they're all matching on the term main, even though EU's balances should preclude these getting ANY points:

(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['help']._field_B
{'help': 1.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['name']._field_B
{'name': 40.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['description']._field_B
{'description': 40.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['name']._field_B
{'name': 40.0}

So constructing my own weightings

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), help=BM25F(name_B=float(1.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
...
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
...
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

vs

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(2.0)), help=BM25F(name_B=float(1.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
...
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
....
<Hit {'id': 'ucsc_table_direct1'}> 5.63599403557272 2546 [('name', b'main'), ('name', b'ucsc')]

so name boost of 2 is worse than a name boost of 1? ucsc_table_direct1 goes from 8 to 5? Swapping the weights for name=1, help=2

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), help=BM25F(name_B=float(2.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'wig_to_bigWig'}> 12.279176289129241 1072 [('help', b'_ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 11.821254960210167 1559 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'maintained')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 10.435593913196673 1460 [('name', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/ebi_metagenomics_run_downloader/ebi_metagenomics_run_downloader/0.1.0'}> 10.190349240221241 2105 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_flankbed/2.29.2'}> 9.717608110624447 1931 [('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 9.255265497863771 2427 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/replace_column_by_key_value_file/replace_column_with_key_value_file/0.1'}> 8.858243038340541 2046 [('help', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

like, are boosts inverse? Fixing description to 40, name=1 returns ucsc_table_direct1 with the same score but vcf_to_maf_customtrack1 is finally gone?

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), description=BM25F(description_B=float(40.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 14.699808940243486 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'wig_to_bigWig'}> 12.279176289129241 1072 [('help', b'_ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 10.435593913196673 1460 [('name', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/ebi_metagenomics_run_downloader/ebi_metagenomics_run_downloader/0.1.0'}> 10.190349240221241 2105 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_flankbed/2.29.2'}> 9.717608110624447 1931 [('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/replace_column_by_key_value_file/replace_column_with_key_value_file/0.1'}> 8.858243038340541 2046 [('help', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

Got ucsc main above for the first time:

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(0.1)), description=BM25F(description_B=float(0.1)))).search(MultifieldParser(['name', 'old_id', 'description', 'section'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'ucsc_table_direct1'}> 12.409585144761694 2546 [('name', b'main'), ('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 6.3525702972838705 2122 [('description', b'maintaining')]
<Hit {'id': 'vcf_to_maf_customtrack1'}> 6.111259177622381 967 [('description', b'ucsc')]

With... both terms boosted to 0.1. This seems like black magic?

hexylena on 29 Oct 2020

Boosts shouldn't be inverse: https://whoosh.readthedocs.io/en/latest/schema.html?highlight=boost#field-boosts (I am sorry I do not have time atm to dive into this)

martenson on 29 Oct 2020

My thought too after reading the doc!! but, it definitely seems to be behaving like it is? it's the only time I can get ucsc_table_direct1 to have a high score (10+) is whenever I do name=0.1, desc=0.1, rest=1

hexylena on 29 Oct 2020

I am circling around a bug in whoosh's MultiWeighting class, which alters the scores in a non-sense way. Haven't finished this thoug.

mvdbeek on 29 Oct 2020

👍1

Compare the results for 'snpeff eff':

0.1name/desc → 25

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=0.1), section=BM25F(section_B=1.0), description=BM25F(description_B=0.1), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*snpeff eff*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff_sars_cov_2/snpeff_sars_cov_2/4.5covid19'}>, 35.753471381397226, 448, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1'}>, 33.047009486777924, 832, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects'), ('help', b'eff')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/jjohnson/snpeff_to_peptides/snpeff_to_peptides/0.0.1'}>, 25.2224812642123, 1511, [('help', b'_snpeff'), ('help', b'snpeff'), ('name', b'snpeff'), ('help', b'effects'), ('help', b'eff')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_databases/4.3+T.galaxy2'}>, 25.031613016029844, 1223, [('help', b'snpeff'), ('name', b'snpeff'), ('help', b'eff')])

vs

10.0 name/desc → 19

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=10.0), section=BM25F(section_B=1.0), description=BM25F(description_B=10.0), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*snpeff eff*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff_sars_cov_2/snpeff_sars_cov_2/4.5covid19'}>, 22.617891323807093, 448, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1'}>, 19.91142942918779, 832, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects'), ('help', b'eff')])

edit: sorry, had an old help boost.

Or the query "select lines that match an expression"

0.1/0.1 → Grep1 = 40.0, 1st place

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=0.1), section=BM25F(section_B=1.0), description=BM25F(description_B=0.1), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*select lines that match an expression*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'Grep1'}>, 40.085088543825, 621, [('help', b'match'), ('help', b'lines'), ('help', b'expression'), ('description', b'expression'), ('description', b'lines'), ('description', b'match'), ('name', b'select'), ('help', b'select')])

40/40 → Grep1 = 16, 2nd place

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=40.0), section=BM25F(section_B=1.0), description=BM25F(description_B=40.0), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*select lines that match an expression*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1'}>, 18.069423636987338, 1181, [('help', b'match'), ('help', b'lines'), ('help', b'expressions'), ('help', b'expression'), ('help', b'select')])
(<Hit {'id': 'Grep1'}>, 16.10481479336669, 621, [('help', b'match'), ('help', b'lines'), ('help', b'expression'), ('description', b'expression'), ('description', b'lines'), ('description', b'match'), ('name', b'select'), ('help', b'select')])

hexylena on 29 Oct 2020

@mvdbeek did you have any more information about what that issue was with whoosh?

hexylena on 5 Nov 2020

So we deployed the new boosts on eu, to see how those work. I.... think they're a huge improvement? I was discussing with @shiltemann and her test query was 'group', expecting the full match of Grouping1 to be found. We need some way to rank by "this term or terms constitutes the entire name field", but I'm not sure how we'd accomplish that given that we currently break into individual words :/

hexylena on 6 Nov 2020

@bgruening provides 'tail-to-head' which doesn't return useful things (but don't know about before.) and same for tail

@wm75 provides

only exception I found so far is mimodd vcf which only returns general vcf stuff as top hits. Strangely, reverting words to vcf mimodd does much better.

hexylena on 6 Nov 2020

Galaxy: improve tool panel search

All 61 comments

Related issues