Synapse: Unbounded increases in GC time

Created on 30 Mar 2020  路  23Comments  路  Source: matrix-org/synapse

Description

After updating to 1.12.0, average event send times have gone up. Metrics are showing GC time going up a lot more than before, around 1 second per day. Memory usage hasn't changed significantly though. @lub:imninja.net's metrics showed the same problem when I asked in #synapse:matrix.org.

The problem did not happen with the 1.12 release candidate, which leads me to believe it could have something to do with the Twisted update.

image
image
image

Version information

  • Version: 1.12.0 (https://mau.dev/maunium/synapse)
  • Install method: Docker
  • Platform: buster-slim, tested both with and without jemalloc
performance upstream-bug

All 23 comments

image

Although I don't see such a clear difference in other metrics, but maybe I just have less load

image

I use matrixdotorg/synapse:latest. Also had the rc1 on release day, but it seems it didn't cause that behaviour

I tried downgrading to the previous Twisted version, but that didn't seem to help, so I guess it wasn't that after all.


New graphs

image
image
image

For comparison, here are 30 day graphs from before 1.12. The GC time still went up and didn't come down, but the growth was much slower:


Graphs from before 1.12

image
image
image

It's hard to see why GC time would increase when memory usage and the number of objects being collected stays the same.

The only change in 1.12.0 which is vaguely related to GC is #6953. Could you patch out that change and see if it helps at all?

After something like a week of uptime synapse managed to use all the RAM on my server. A restart later (a little after 12:00) all was smooth again. I can provide more precise information if needed.

image

image

image

I am the only user on my server and I am in a few big rooms. Also I only started sending messages today, it was on #synapse:matrix.org

I think I reverted it in https://mau.dev/maunium/synapse, but still seems to grow quickly:

image

@tulir yeah; it was a long shot tbh. I have no real idea what could cause this. Are you sure it didn't happen with 1.12.0rc1 ?

It feels like something has changed outside of synapse. I don't suppose you've had a python version change or something?

I guess it could just reflect a different traffic pattern from your clients/federated servers.

I am seeing similar issues, at first I thought it might have been related to open file handles and/or use_presence: true (now false) in our configuration but those only seemed to help a bit. The only way around this is periodically restarting matrix-synapse.

I was starting to suspect maybe the version of Python we are running may not be good for the Twisted update (we are on Debian 9 with Python v3.5.3 and matrix-synapse-py3 on 1.12.3).

Unfortunately, we weren't scraping Prometheus metrics until we got onto the 1.12.x branch so we cannot compare GC collection time from a prior release.

So, my gc times are still growing linearly:

image

It seems mostly the federation reader and federation sender are affected. My caches are already full and are not growing anymore:

image

But it seems like memory usage for the federation workers is still climbing:

image

This may suggest, that there is a resource leak regarding DNS or connections or similar. Could of course be something else, but I would expect the memory usage of the federation sender to flatten off after a few days. In any case, the GC times shouldn't grow linearly over time, which they do at the moment.

as I've said previously, I don't think this is a regression: it's more likely that there was some change in the federation which coincided with the upgrade and which made synapse want to chew more memory.

We're not seeing any sort of leaking on the servers we administer, so I guess the issues being seen here are either due to misconfigured caches, or due to some strange interaction of environment/configuration.

It is certainly true that synapse can be a memory hog, and it's a bit of a black-box in terms of what it is doing. Making the caching more intelligent is on the medium-term roadmap, and as part of that work I'd like to add some analytics which reveal how much memory is being used in caches (and how much elsewhere).

However, in the shorter term, it's hard for us to do much to investigate individual cases of unexpectedly high memory usage. The only thing I can really suggest is that folks have a play with python memory-profiling tools to see if you can find out where the memory is being used.

@richvdh How often are you restarting your federation workers? In my case I can simply plot a line through my increased GC times and memory usage for my federation workers. This is while cache size is up to its max value after a few hours, the GC increases are over the period of a few days. A week is pretty much the limit, since at some point my server simply runs out of memory. I can work around that by restarting my synapse/workers every few days. Of the people I've asked, about half of them see the same pattern. Some don't, but many of them don't even have metrics enabled. So while this may not affect everyone, it does at least affect a few people with very different configurations.

I also wouldn't jump to this being cache related. As I said, my caches don't grow after a certain size, so this can be any kind of resource leak or it is an unbounded cache. I'll see, if I can figure something out with a memory profiler at some point. I'm not sure, if this is a regression or has only become noticeable recently, but unbounded increase in memory use and GC times does sound like a bug to me.

@richvdh How often are you restarting your federation workers?

we have no problems with running them for up to a couple of weeks. There's no particular change in GC time over this period.

If its limited to federation senders, it could be #7828 which has a fix in the forthcoming v1.20.0 release.

If its limited to federation senders, it could be #7828 which has a fix in the forthcoming v1.20.0 release.

Yeah, I think I linked to a similar issue before. We'll see, the graphs don't match up exactly, but both are growing, that is true.

If you can run you workers for a few weeks, then it really seems to be system specific. Usually synapse updates tend to restart my workers often enough, that memory use doesn't become an issue, so I thought that may mask the effects.

I'm fairly certain now, that #7828 is unrelated, after running the 1.20 rcs for a while now (unless the fix didn't actually land yet):

grafik

Didn't have the time to profile yet though.

I think I'm actually seeing something similar to this on my personal HS:

image

I'll have a poke about to see if I can figure out what's going on.

I've found some interesting things:

Firstly, as time goes by, there is an accumulation of objects being tracked by the garbage collector, as returned by len(gc.get_objects()):

image

This is consistent with GC taking more and more time, as the GC must inspect each of those objects on each gen2 garbage collection.

The majority of these objects are tuples:

>>> objgraph.show_most_common_types(5,gc.get_objects())
tuple       879076
dict        73675
set         62514
list        60363
ProfileInfo 43990

And more interestingly still, objgraph fails to find any reference to most of those tuples:

>>> unreffed = objgraph.get_leaking_objects()
>>> sum(1 for t in unreffed if isinstance(t, tuple))
844957

Of those leaking tuples, they are almost all 1- or 2-tuples:

>>> collections.Counter(len(t) for t in unreffed if isinstance(t, tuple))
Counter({1: 531264, 2: 313618, 3: 30, 4: 26, 5: 9, 7: 3, 6: 2, 8: 1, 9: 1, 11: 1, 13: 1, 15: 1})

the 1-tuples almost all contain a zope.interface.declarations.Implements:

>>> onetups = [t for t in unreffed if isinstance(t, tuple) and len(t) == 1]
>>> implonetups = [t for t in onetups if isinstance(t[0], zope.interface.declarations.Implements)]
>>> len(implonetups)
531032

and the 2-tuples are almost all (zope.interface.interface.InterfaceClass, zope.interface.declarations.Implements) pairs:

>>> impltwotups = [
...     t
...     for t in twotups
...     if isinstance(t[0], zope.interface.interface.InterfaceClass)
...     and isinstance(t[1], zope.interface.declarations.Implements)
... ]
>>> len(impltwotups)
313596

so, to recap, of the 1.4M objects being tracked by the garbage collector in my synapse process, 845000 of them (60%) are tuples containing zope.interface.declarations.Implements objects which appear to be unreferenced by any other python object. This suggests some sort of leak in the C parts of zope.interface, though it's also possible that we or Twisted are misusing zope in some way that causes this.

It's maybe a little early to say, but disabling zope's C extensions (by moving _zope_interface_coptimizations.cpython-35m-x86_64-linux-gnu.so aside) seems to stop the leak, or at least significantly change the behaviour:

About 15 hours after restart:

>>> len(gc.get_objects())
540302
>>> unreffed = objgraph.get_leaking_objects()
>>> objgraph.show_most_common_types(5,unreffed)
dict   1696
list   949
CType  103
tuple  95
method 46

Next step is to see if I can repro this behaviour outside of synapse, I think.

Downgrading to zope.interface<5 also seems to solve it. The release date of v5 and this issue starting match pretty much exactly

image

@tulir aha, seems we have a culprit then. Will continue investigating and raise an issue with zope once I have more details.

Looks like zope released a 5.1.1 with the fix. Do we want to do anything about this? (Blacklist versions or anything?) Or is this "fixed" from our POV?

I'm just running with updated zope to confirm. Assuming it works I don't think we need to go as far as blacklisting versions; the most we might want to do is suggest in a changelog or somewhere that people upgrade.

Yup looks good to me. Thanks to @jamadden for jumping on this from the zope side.

Thanks @richvdh and everyone here for finding and reporting this! Apologies for the trouble it caused.

Was this page helpful?
0 / 5 - 0 ratings