Galaxy: improve tool panel search

Created on 29 Apr 2016  Â·  61Comments  Â·  Source: galaxyproject/galaxy

reported by @jennaj
given the number of tools on Main the results of search needs to be better, mainly:

  • give more results and let people scroll
  • give more weight to name and section
  • let people search for tool IDs
  • make search understand hyphens

I am trying to address the first two (for Main) with: https://github.com/galaxyproject/usegalaxy-playbook/pull/19

areUI-UX

All 61 comments

utvalg_999 019

@erasche since recently Galaxy now searches tool IDs

screenshot 2016-08-08 13 49 51

I think improvements might be made regarding the interchangeability of ' ', '_', '-'

@martenson that's great! +1 for allowing users to substitute in ' ' for the _. I know I look for my tools by ID sometimes and fail to find them.

+1 to that, requiring users to know _'s is unfortunate. Is the input not tokenized and matched? (I guess not, if the broken string doesn't match?)

@martenson RFC: Things I would like to see indexed and available to search, along with my feelings on their boosts:

  • tool name (5)
  • tool id (4)
  • tool help text (2)
  • tool parameter helps (0.3)
  • tool input data formats (1)
  • tool output data formats (0.6)

I just find myself frustrated when I cannot find the tool I want or the results are very limited because of what is searched upon. Of course, I do not know what the state of 16.07/dev is, have not gotten there yet.

@erasche we have these boosts on Main and these are the defaults

# tool_name_boost = 9
# tool_section_boost = 3
# tool_description_boost = 2
# tool_label_boost = 1
# tool_stub_boost = 5
# tool_help_boost = 0.5

@martenson hey that's most of the things I need. In that case, then it would be nice to have more space to display this and where the search actually "hit". Apologies, have not been following along with this stuff closely enough to make informed comments.

we have the 'hit' information but I did not figure out a good place to display it - related to the limited canvas

I would add in that it would be nice to have the underlying tool (binary) name be part of the search, if wrapped under a slightly different tool name or short label of some type.

Related utilized/dependent binaries would be included in this. (lower "boost" probably)

digging through Main usage metrics any improvements to toolpanel search should be very well worth it

screenshot 2018-10-11 11 19 36

Another concrete issue:

https://usegalaxy.eu/api/tools?q=peakachu returns 2 results, neither are displayed on frontend. client issue.

@erasche I cannot reproduce
screenshot 2019-01-18 11 02 38

Firefox on linux, cannot repro in chrome.

two subsequent searches for peakachu yielded different results for me in firefox, the first does not show in client, the second does

["toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.1", "toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.0"]
["toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.1", "toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.0", "toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.2"]

@erasche I also can reproduce ~50% of the times on the UI, on both Firefox and Chrome on Linux. One of the web handler hasn't reloaded the toolbox probably.

xref new issue for the display bug: https://github.com/galaxyproject/galaxy/issues/7238

another search term returning unexpected results: ncbi

browser might not matter, same results using chrome or safari under mac osx (but didn't test firefox)

  • usegalaxy.org == finds "get data > NCBI bam" download tool but not "get data > NCBI fastq". This server doesn't include "get data > NCBI pileup" anymore (tool routinely failed -- data usually too large plus any represents ambiguous scientific content)

  • usegalaxy.eu == doesn't find any of these three (all are present under "get data")

  • usegalaxy.org.au == finds all three (under "get data")

  • usegalaxy.be == doesn't find any of these three (all are present under "get data")

another search term: convert
It did not find the convert tool (Text Manipulation>Convert delimiters to TAB)

It works

Tries made with Firefox.

I discover boosters! What can I set in order to find a result as usegalaxy.eu?

# tool_name_boost = 9
# tool_section_boost = 3
# tool_description_boost = 2
# tool_label_boost = 1
# tool_stub_boost = 5
# tool_help_boost = 0.5

@FredericBGA .eu's boosts are here https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/group_vars/gxconfig.yml#L1076 but they're pretty aggressive / strange compared to other sites'

@FredericBGA .eu's boosts are here https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/group_vars/gxconfig.yml#L1076 but they're pretty aggressive / strange compared to other sites'

@erasche thank you! The link in Martin post above is broken. I will try with something between default and .eu

@FredericBGA we have this on Main atm

  tool_name_boost: 12
  tool_section_boost: 5

We should probably experiment with tool_enable_ngram_search

I created a PR to mimic EU and enable ngram too: https://github.com/galaxyproject/usegalaxy-playbook/pull/228

We use tool_enable_ngram_search: true, which works fine.

thank you all for sharing your config with me!
It works now, with:
tool_name_boost: 20

tool_name_boost: 20

Wonder if Main would benefit from that much higher boost, specifically. Searches are still a bit unpredictable and result too limited imho. martin probably is on that already... is not new and we've tried a few variations already but still could use some tuning.

Has to be frustrating to search for a tool and not find it -- as the stats above he posted backup.

@jennaj I have proposed radical change for Test here: https://github.com/galaxyproject/usegalaxy-playbook/pull/228

The search stats above do not include anything about the results, it is a boolean for 'has the user searched at least once?'.

Great, I really do like the EU search results. Finds everything, and even though outputs more results, totally worth it imo. Glad we are exploring that.

The stats sort of indicate that people using tool searches spend more time in a Galaxy session. Suggesting are new and/or running tools directly from history, and possibly are spending more time "hunting" for tools (the "hard way", eg expanding/scrolling through all). Non-tool searching sessions look they might be biased from those running workflows - so quick login/out, no tool searches, bounce rate higher because they get whatever they want to do done quicker. But maybe am reading too much into that :)

New issue reported when looking for blast and expecting to find blastp.


Screenshot at time of reporting

an idea: A hybrid approach where the search result limit is high but will cut off at certain hit score if there are enough results. This could prune the less important results.

image

searching for full name + title doesn't find the tool. Search select lines and it is returned.

also found https://github.com/galaxyproject/galaxy/issues/3276 when searching for this one, think that's one of the points in this thread somewhere.

@erasche I think the point made in the linked ticket is an important one. People don't care about the actual tool order in whatever tool panel they are working with. They want to find the tool. The ranking is an intuitive way to do a search -- could even be a toggle in the GUI.

Could even be extended if ranked (_exact_ tool name match): eg: "show all" vs "I'm feeling lucky" type of thing

Agreed. I understand (what I assume was) the original intention to help users find the section on their own later, but on eu with 2k tools, they will basically always use search. Would love to have a ranking.

meantime on Main:

search for fastq has first result in response the fastqc tool, panel never shows it
search for fastqc has middle result the multiqc tools, panel never shows it

Main/usegalaxy.org

search for ncbi does not find any of the "NCBI SRA" Get data tools: https://toolshed.g2.bx.psu.edu/view/iuc/sra_tools/f5ea3ce9b9b0

Eu/usegalaxy.eu

search for ncbi also does not find any of the "NCBI SRA" Get data tools, but does find other Get Data tools from NCBI not in the same tool suite (none of those are loaded at .org)

On eu:

The following two queries return different results:

Importantly, the first doesn't include random_lines1, the tool I'm looking for.

On eu, a search for snpeff does not find any of the following tools:

  • SnpEff eff
  • SnpEff download
  • Snpeff databases

Works just fine on .org.

... and, possibly related to @hexylena's example just above:
trailing whitespace in a search term seems to mess up results completely

This one on main AND eu

a question: is there really no way to express an AND between search terms?

@wm75 not at the moment, can you please provide examples of searches that don't behave as you'd expect?

Odd result on EU:

Multiqc appears in two sections

afbeelding

Searching for it yields only one:
afbeelding

Contrast with fastqc which appears in two and searches yield two (of the same version)

afbeelding

"UCSC main" is unfindable on EU: https://usegalaxy.eu/api/tools?q=ucsc+main doesn't include ucsc_table_direct1, but it does include 150 other things. @bgruening

It does on .org, but not nearly the top hit for a search on the exact tool title

on EU searchingucsc has 52 results and "Main" is the last one 😭

Possibly! I just expected the tool_name boost to have the biggest effect. I would love to debug the internals sometime, and see what scores x boost are being returned for each of these results that are doing 'better' than the direct text match. Like, if those are returning first, clearly they say "ucsc main" dozens of time in their descriptions or something?

is it possible that exact matches overflow in score ? This is a search for ucsc:
Screenshot 2020-10-29 at 11 54 15

@mvdbeek neat! How did you obtain that?

Ahh ok, wondered if it was a secret api I was missing.

so I booted up a copy of the app against EU because I always feel worried about reproducing locally with the v. different toolboxes. This looks odd to me:

(Pdb) galaxy_app.toolbox_search.parser.parse('*' + 'ucsc main' + '*')
Or([Wildcard('name', '*ucsc'), Wildcard('old_id', '*ucsc'), Wildcard('description', '*ucsc'), Wildcard('section', '*ucsc'), Wildcard('help', '*ucsc'), Wildcard('labels', '*ucsc'), Wildcard('stub', '*ucsc'), Prefix('name', 'main'), Prefix('old_id', 'main'), Prefix('description', 'main'), Prefix('section', 'main'), Prefix('help', 'main'), Prefix('labels', 'main'), Prefix('stub', 'main')])

why does only ucsc stay prefixed with *, and main loses it's one?

(Pdb) for idx, hit in enumerate(galaxy_app.toolbox_search.searcher.search(galaxy_app.toolbox_search.parser.parse('*ucsc main*'), limit=400)): print((idx, hit, hit.score) if 'ucsc' in hit['id'] else None)
...
(296, <Hit {'id': 'ucsc_table_direct1'}>, 0.4618992716030244)
(297, <Hit {'id': 'ucsc_table_direct_archaea1'}>, 0.22288633588616671)
None

or without *

(Pdb) for idx, hit in enumerate(galaxy_app.toolbox_search.searcher.search(galaxy_app.toolbox_search.parser.parse('ucsc main'), limit=400)): print((idx, hit, hit.score) if 'ucsc' in hit['id'] else None)
...
None
(103, <Hit {'id': 'ucsc_table_direct1'}>, 0.4371187999893639)
None
None
None
(107, <Hit {'id': 'ucsc_table_direct_archaea1'}>, 0.1950255439003959)

trying out the individual fields of a search, seems like description is a negative in this case:

(Pdb) for hit in galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc* *main*'), limit=40): print(hit, hit.score)
<Hit {'id': 'vcf_to_maf_customtrack1'}> 1.9662951360360124
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_rpstblastn_wrapper/2.10.1+galaxy0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_rpsblast_wrapper/2.10.1+galaxy0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: notseq61/5.0.0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/tmhmm2/0.0.16'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 0.4456963370285034
<Hit {'id': 'ucsc_table_direct1'}> 0.44038679685244836
<Hit {'id': 'ucsc_table_direct_archaea1'}> 0.20059770229755006
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 0.1750118444883364
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 0.13424605571325632
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 0.11250224762822575
<Hit {'id': 'bwtool-lift'}> 0.05599209141540291

tool | name | description
--- | --- | ---
vcf_to_maf_customtrack1 | VCF to MAF Custom Track | for display at UCSC
ucsc_table_direct1 | UCSC Main | table browser

feels very odd that vcf scores higher.

Some more debugging

(Pdb) print(galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True).termdocs)
{('name', b'ucsc'): array('I', [101, 1460, 2546]), ('description', b'ucsc'): array('I', [967, 1215, 1559, 2255, 2427]), ('description', b'maintaining'): array('I', [2122]), ('name', b'main'): array('I', [2546])}

So that's matching maintaing (hmm. I get why but. surely that should score lower than an exact word boundary match?)

and doc 2546 which hits both main + ucsc is indeed our tool:

(Pdb) print(list(galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True))[1].docnum)
2546

aha (ish)

(Pdb) for hit in galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'vcf_to_maf_customtrack1'}> 1.7205082440315107 967 [('description', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 0.44322627964299954 2546 [('name', b'main'), ('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 0.38998429489994046 1215 [('description', b'ucsc')]
<Hit {'id': 'ucsc_table_direct_archaea1'}> 0.1755229895103563 101 [('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 0.1736500008575602 2122 [('description', b'maintaining')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 0.15313536392729438 1559 [('description', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 0.11746529874909928 2427 [('description', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 0.09843946667469754 1460 [('name', b'ucsc')]
<Hit {'id': 'bwtool-lift'}> 0.048993079988477545 2255 [('description', b'ucsc')]

orgroup changed from 0.1 to 0.9 doesn't produce a big different. Oddly I've specified old_id in the MultifieldParser, but there are no ID matches? I'd exepect

<Hit {'id': 'ucsc_table_direct1'}> 0.44322627964299954 2546 [('name', b'main'), ('name', b'ucsc'), ('old_id', b'ucsc')]

but old_id isn't anywhere there? It's when help is included that the results become garbage:

<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/xpath/xpath/1.0.0'}> 5.7824381765403645 1006 [('help', b'maintainers')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/jjohnson/rsem/rsem_prepare_reference/1.1.17'}> 5.74646047844554 676 [('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_get_communitytype/mothur_get_communitytype/1.39.5.0'}> 5.6337872716676785 1183 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 5.628083628215019 2427 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_lefse/mothur_lefse/1.39.5.0'}> 5.6050434590571285 1771 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/samtools_merge/samtools_merge/1.9'}> 5.4913612182626235 947 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_classify_rf/mothur_classify_rf/1.36.1.0'}> 5.4325805833938325 2391 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_pcr_seqs/mothur_pcr_seqs/1.39.5.0'}> 5.3072620955901515 121 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_get_mimarkspackage/mothur_get_mimarkspackage/1.39.5.0'}> 5.187549416742254 1622 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_merge_files/mothur_merge_files/1.39.5.0'}> 5.187549416742254 1767 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_primer_design/mothur_primer_design/1.39.5.0'}> 5.18227372171205 286 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/openbabel/ctb_subsearch/0.1'}> 5.105499296205204 1339 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_fastq_info/mothur_fastq_info/1.39.5.0'}> 5.105499296205204 1414 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_lookup/mothur_make_lookup/1.39.5.0'}> 5.0794508304082395 1662 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_cluster_classic/mothur_cluster_classic/1.39.5.0'}> 5.0794508304082395 1710 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_fastq/mothur_make_fastq/1.39.5.0'}> 5.0794508304082395 1723 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_chimera_vsearch/mothur_chimera_vsearch/1.39.5.1'}> 5.039159295530664 161 [('help', b'main_page')]

so they're all matching on the term main, even though EU's balances should preclude these getting ANY points:

(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['help']._field_B
{'help': 1.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['name']._field_B
{'name': 40.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['description']._field_B
{'description': 40.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['name']._field_B
{'name': 40.0}

So constructing my own weightings

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), help=BM25F(name_B=float(1.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
...
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
...
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

vs

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(2.0)), help=BM25F(name_B=float(1.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
...
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
....
<Hit {'id': 'ucsc_table_direct1'}> 5.63599403557272 2546 [('name', b'main'), ('name', b'ucsc')]

so name boost of 2 is worse than a name boost of 1? ucsc_table_direct1 goes from 8 to 5? Swapping the weights for name=1, help=2

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), help=BM25F(name_B=float(2.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'wig_to_bigWig'}> 12.279176289129241 1072 [('help', b'_ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 11.821254960210167 1559 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'maintained')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 10.435593913196673 1460 [('name', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/ebi_metagenomics_run_downloader/ebi_metagenomics_run_downloader/0.1.0'}> 10.190349240221241 2105 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_flankbed/2.29.2'}> 9.717608110624447 1931 [('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 9.255265497863771 2427 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/replace_column_by_key_value_file/replace_column_with_key_value_file/0.1'}> 8.858243038340541 2046 [('help', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

like, are boosts inverse? Fixing description to 40, name=1 returns ucsc_table_direct1 with the same score but vcf_to_maf_customtrack1 is finally gone?

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), description=BM25F(description_B=float(40.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 14.699808940243486 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'wig_to_bigWig'}> 12.279176289129241 1072 [('help', b'_ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 10.435593913196673 1460 [('name', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/ebi_metagenomics_run_downloader/ebi_metagenomics_run_downloader/0.1.0'}> 10.190349240221241 2105 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_flankbed/2.29.2'}> 9.717608110624447 1931 [('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/replace_column_by_key_value_file/replace_column_with_key_value_file/0.1'}> 8.858243038340541 2046 [('help', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

Got ucsc main above for the first time:

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(0.1)), description=BM25F(description_B=float(0.1)))).search(MultifieldParser(['name', 'old_id', 'description', 'section'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'ucsc_table_direct1'}> 12.409585144761694 2546 [('name', b'main'), ('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 6.3525702972838705 2122 [('description', b'maintaining')]
<Hit {'id': 'vcf_to_maf_customtrack1'}> 6.111259177622381 967 [('description', b'ucsc')]

With... both terms boosted to 0.1. This seems like black magic?

Boosts shouldn't be inverse: https://whoosh.readthedocs.io/en/latest/schema.html?highlight=boost#field-boosts (I am sorry I do not have time atm to dive into this)

My thought too after reading the doc!! but, it definitely seems to be behaving like it is? it's the only time I can get ucsc_table_direct1 to have a high score (10+) is whenever I do name=0.1, desc=0.1, rest=1

I am circling around a bug in whoosh's MultiWeighting class, which alters the scores in a non-sense way. Haven't finished this thoug.

Compare the results for 'snpeff eff':

0.1name/desc → 25

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=0.1), section=BM25F(section_B=1.0), description=BM25F(description_B=0.1), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*snpeff eff*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff_sars_cov_2/snpeff_sars_cov_2/4.5covid19'}>, 35.753471381397226, 448, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1'}>, 33.047009486777924, 832, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects'), ('help', b'eff')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/jjohnson/snpeff_to_peptides/snpeff_to_peptides/0.0.1'}>, 25.2224812642123, 1511, [('help', b'_snpeff'), ('help', b'snpeff'), ('name', b'snpeff'), ('help', b'effects'), ('help', b'eff')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_databases/4.3+T.galaxy2'}>, 25.031613016029844, 1223, [('help', b'snpeff'), ('name', b'snpeff'), ('help', b'eff')])

vs

10.0 name/desc → 19

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=10.0), section=BM25F(section_B=1.0), description=BM25F(description_B=10.0), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*snpeff eff*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff_sars_cov_2/snpeff_sars_cov_2/4.5covid19'}>, 22.617891323807093, 448, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1'}>, 19.91142942918779, 832, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects'), ('help', b'eff')])

edit: sorry, had an old help boost.

Or the query "select lines that match an expression"

0.1/0.1 → Grep1 = 40.0, 1st place

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=0.1), section=BM25F(section_B=1.0), description=BM25F(description_B=0.1), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*select lines that match an expression*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'Grep1'}>, 40.085088543825, 621, [('help', b'match'), ('help', b'lines'), ('help', b'expression'), ('description', b'expression'), ('description', b'lines'), ('description', b'match'), ('name', b'select'), ('help', b'select')])

40/40 → Grep1 = 16, 2nd place

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=40.0), section=BM25F(section_B=1.0), description=BM25F(description_B=40.0), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*select lines that match an expression*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1'}>, 18.069423636987338, 1181, [('help', b'match'), ('help', b'lines'), ('help', b'expressions'), ('help', b'expression'), ('help', b'select')])
(<Hit {'id': 'Grep1'}>, 16.10481479336669, 621, [('help', b'match'), ('help', b'lines'), ('help', b'expression'), ('description', b'expression'), ('description', b'lines'), ('description', b'match'), ('name', b'select'), ('help', b'select')])

@mvdbeek did you have any more information about what that issue was with whoosh?

So we deployed the new boosts on eu, to see how those work. I.... think they're a huge improvement? I was discussing with @shiltemann and her test query was 'group', expecting the full match of Grouping1 to be found. We need some way to rank by "this term or terms constitutes the entire name field", but I'm not sure how we'd accomplish that given that we currently break into individual words :/

@bgruening provides 'tail-to-head' which doesn't return useful things (but don't know about before.) and same for tail

@wm75 provides

only exception I found so far is mimodd vcf which only returns general vcf stuff as top hits. Strangely, reverting words to vcf mimodd does much better.

Was this page helpful?
0 / 5 - 0 ratings