Redash: Memory leaking from version 7?

Created on 20 Aug 2019  路  9Comments  路  Source: getredash/redash

Issue Summary

After upgrading from redash5 to redash7, we notice redash webserver is leaking memory linear compared to redash5.

Steps to Reproduce

Deployed redash7 in kubernetes, watching metrics.

Technical details:

See https://github.com/helm/charts/pull/5071#issuecomment-522905076 for my upgrade notes to existing merge request about a public redash helm chart for kubernetes.

Last hours:
2019-08-20-09-29-27-window-select

Last 30 days: (you can see when we did the upgrade at 6. august. The restarts is basically pods being restarting due to kubernetes evicting em to memory usage, using more then requested etc)

2019-08-20-10-23-56-window-select

Most helpful comment

That doesn't answer the graphs memory usage.

Sure. I just mentioned that to give context to why we implemented the restart...

Thank you for the tips on how to hunt this down. We will probably wait with it until the Python 3 migration is done and some other ongoing things, but definitely would like to address it.

All 9 comments

Oh, I just noticed that https://github.com/getredash/redash/commit/f1651688601c191ca09012b9029fc59ff85d0870 is already commited to master.

I'll try the latest docker image and redeploy our redash7 to see if setting max requests and max request jitter has an effect. Even if it fixes the problem, I would say redash7 still has a memory leak issue.

But this is a "decent" work around for now, I'll post back how it goes.

Hm, https://hub.docker.com/r/redash/redash/tags doesn't seem to have any newer builds for 7.x that contains https://github.com/getredash/redash/commit/f1651688601c191ca09012b9029fc59ff85d0870 as I understand it.

Is there anything special that is required for this upgrade without maybe database migrations?
I see nothing in https://version.redash.io/ or https://github.com/getredash/redash/blob/master/CHANGELOG.md for redash8

Feature for max requests and max request jitter comes from https://github.com/getredash/redash/pull/4013

2019-08-21-08-17-46-window-select

This seems to be a work around for the memory leak at least.

Last 30 days: (you can see when we did the upgrade at 6. august.

Was it the only change that happened around this time?

Btw, for reference, Gunicorn can accept additional command line options via an en var:

Settings can be specified by using environment variable GUNICORN_CMD_ARGS. All available command line arguments can be used. For example, to specify the bind address and number of workers:

$ GUNICORN_CMD_ARGS="--bind=127.0.0.1 --workers=3" gunicorn app:app

So you didn't have to upgrade to benefit from the new options.

@arikfr : That could be that I didn't need to upgrade to get the GUNICORN_CMD_ARGS, but what I found at the time was https://github.com/getredash/redash/commit/f1651688601c191ca09012b9029fc59ff85d0870 .

And having workers restarting clearly indicates something is leaking somewhere. Where, I don't know.

We are happy enough with having workers being restarted.

In regards if there was other things that happened around this time, only thing I can say is that
there probably were a few "start migration over again" because I did a silly mistake or two while upgrading, but the end results is what matters here, clearly shows pods increasing in memory usage until they were killed by out of memory.

Graphs are directly from our production environment.

ATM we are on some alpha build of redash8 I think which had a docker image with https://github.com/getredash/redash/commit/f1651688601c191ca09012b9029fc59ff85d0870 in it, and it has been working perfectly. So much gracious for a great product!

And having workers restarting clearly indicates something is leaking somewhere.

The reason we added auto restarts was because Python doesn't release memory. As query results are loaded into memory before being served, if a large query result was served it might cause the memory usage to go up.

@arikfr That doesn't answer the graphs memory usage. Earlier versions still had to load result sets into memory before serving em, and we didn't see the growing of memory usage as the application lived on in redash5. So yes, python doesn't release the memory, but that due to it's "garbage collector" cannot do its job I believe.

I suspect there is something holding references, so refcounting never gets cleared for the "garbage collector" to kick in and do its job. Circular references comes to mind. ( https://stackoverflow.com/questions/2428301/should-i-worry-about-circular-references-in-python )

Best way to hunt down the cause would be by bisecting between the two major versions, but you would need to make a reproducible test first.

sys.getrefcount(a) can be a tool in the search for it maybe.

The one who wants to devote time into finding exactly the cause, is a super hero! This work is both tricky and cumbersome.

For other finding this issue, restarting the gunicorn workers is a work-around working just fine.

(Again, Thanks for a superb product)

That doesn't answer the graphs memory usage.

Sure. I just mentioned that to give context to why we implemented the restart...

Thank you for the tips on how to hunt this down. We will probably wait with it until the Python 3 migration is done and some other ongoing things, but definitely would like to address it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dclong picture dclong  路  4Comments

susodapop picture susodapop  路  3Comments

koooge picture koooge  路  3Comments

Leedorado picture Leedorado  路  3Comments

arikfr picture arikfr  路  3Comments