Incubator-superset: Filter Box Caching Incorrectly for Multi-Query Use Case

Created on 6 Jun 2019 · 19Comments · Source: apache/incubator-superset

A filter box that generates multiple queries (multiple filter controls) does not generate the correct cache entries for each filter control. It appears that only 1 (the first) query gets cached correctly. A correct cache entry should have a single query and associated results. What I see for most filter box cache entries are multiple queries from the filter box. So when the filter box gets reloaded, it only finds one of the cached queries and has cache misses for the rest. In each successive reload, the cache misses decrease by one. I suspect only the first query is generating the correct cache entry.

I am thinking the issue might be related to 'class FilterBoxViz' and/or 'run_extra_queries'

Expected results

If I have a filter box with 16 filter controls, then a reload of the page within the cache TTL should generate 16 cache hits.

Actual results

Only 1 cache hit and 15 cache misses. I see 15 more logs like
2019-06-06 14:09:09,893:INFO:root:Caching 1180 chars at key c4dceda1ab0aef7047ef06ebb6063d88

Screenshots

Here is my filter box (16 controls)
Screen Shot 2019-06-06 at 9 59 17 AM

How to reproduce the bug

Set logging to INFO. I am using REDIS cache but probably any cache would do.

Load a dashboard containing a filter box with more than 1 filter control. Note occurrences of
Caching xxxx chars at key yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Reload the page and note occurrences of "Caching abcd chars at key ..." and occurrences of "Serving from cache"
Note that there is only 1 cache hit for the filter box and that superset generated different cache keys then it did in step 1
Inspect the cache for the cache keys observed in step 1 and note that they contain multiple queries...they should contain a single query.

Environment

(please complete the following information):

superset version: 0.26.3
python version: 2.7
node.js version: not installed
npm version: `not installed

Checklist

Make sure these boxes are checked before submitting your issue - thank you!

[x] I have checked the superset logs for python stacktraces and included it here as text if there are any.
[x ] I have reproduced the issue with at least the latest released version of superset.
[x] I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

Add any other context about the problem here.

#bug

Source

smokemonster99

All 19 comments

Issue-Label Bot is automatically applying the label #bug to this issue, with a confidence of 0.95. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] on 6 Jun 2019

👍1

Took a quick 👀 at the code and things look ok. 0.26.3 is old, I'm guessing this is addressed on master, can you or someone else confirm/infirm on master

mistercrunch on 7 Jun 2019

We're looking to upgrade to latest in next few weeks, I will be checking it then and will reply back.

smokemonster99 on 7 Jun 2019

I have upgraded to 0.33.rc1 in my test environment and should be able to test today

smokemonster99 on 21 Jun 2019

@mistercrunch I have verified the same issue exists, this bug seriously impacts performance when dashboards use filter boxes with multiple queries.

smokemonster99 on 21 Jun 2019

I can take a look at this.

villebro on 23 Jun 2019

👍1

Hmm, I was unable to replicate the bug with three filters on master. Can you rebuild a Filter Box from scratch on a recent build and verify that the same bug still persists?

villebro on 26 Jun 2019

Sure, I reproduced it on v33.0. I have a filter box with 8 filters then i have 5 charts on the same dashboard. I will get you logs shortly. I wonder if you need more filters to reproduce it...

smokemonster99 on 26 Jun 2019

Here are 3 log files.
first-db-load-zero-loaded-from-cache.txt
This one is logs from a fresh load of the dashboard with no items in cache. You see 13 'loaded_from_source' as expected.

2nd-db-load-6fromcache-7fromsource.txt
This file is the next dashboard load and here you see 6 items loaded from cache and 7 from source. All the items should have loaded from cache.

3rd-db-load-7fromcache-6fromsource.txt
This next dashboard load you see 7 items loaded from cache and 6 from source.

If I was to continue then the next load would serve 8 from cache. Again I have a dashboard with 8 filter queries and 5 charts. I believe it is the filter box queries generating different cache keys instead of same cache key. I am not setting any filter values just reloading the db.

Dashboard image
Screen Shot 2019-06-26 at 4 27 07 PM

smokemonster99 on 26 Jun 2019

Thanks @smokemonster99 will take a look.

villebro on 29 Jun 2019

Ok I think I see the problem. You've defined your filter box to for the last two hours without time grain. This gets baked into the underlying query, which is in practice accurate to the second, hence the once per second changing cache key. The time filter and granularity should answer the question: for what time period and to which granularity should the retrieved values in the dropdowns be set? Default values are a dashboard property, and are set there (you choose the values, then save the dashboard, after which the chosen filters are the defaults).

villebro on 29 Jun 2019

On second thought that's not it. If you have a dev box where you can easily change the code, adding some extra logging in the following place and collecting a new round of logs should show exactly how those keys are generated:

diff --git a/superset/viz.py b/superset/viz.py
index efbf892c..49989c67 100644
--- a/superset/viz.py
+++ b/superset/viz.py
@@ -357,6 +357,7 @@ class BaseViz(object):
         cache_dict['time_range'] = self.form_data.get('time_range')
         cache_dict['datasource'] = self.datasource.uid
         json_data = self.json_dumps(cache_dict, sort_keys=True)
+        logging.info(json_data)
         return hashlib.md5(json_data.encode('utf-8')).hexdigest()

     def get_payload(self, query_obj=None):

villebro on 29 Jun 2019

v33-filterbox-cache-issuewithdebug.txt

I made the change and performed same test. It is all in 1 file but search for 'NOTE' to see my notes. As I noticed before I see multiple queries (SELECT statements) in a single cache key which might point to the issue in some way, not sure.

To save you time, you can focus on cache key 6910e9e73779023aa4dbcd4e23138aec which got generated on the first reload.
thanks!

smokemonster99 on 1 Jul 2019

I'm trying to understand why there are prequeries here, that seems wrong.

mistercrunch on 2 Jul 2019

Hi @villebro do you need me to perform any further tests on this? What is next step? Thanks for looking at this!

smokemonster99 on 16 Jul 2019

I noticed the cache keys with prequeries were the ones with later cache misses. I removed 'prequeries' from the cache key inside method cache_key using "del cache_dict['prequeries']". Now I get the desired behavior caching wise...I get 13 cache hits on first reload (and no cache misses)

But I am not sure how or why prequeries are getting set and if this hacky change could cause issues with caching of non-filter box charts. Thoughts?

smokemonster99 on 17 Jul 2019

Hmm, I am not familiar with prequeries, and I'm having a very hard time understanding how they are created in this case. But tit seems like the query for the first dropdown is being added as a prequery to some of the following dropdowns' query_objs. I will keep looking, but if anyone can add context that would be helpful.

villebro on 18 Jul 2019

I think I was able to replicate, will investigate.

villebro on 18 Jul 2019

👍1

Very interesting regression introduced by #4163. I have a working fix, will try to put through a PR soon.

villebro on 18 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings