Element-web: Riot/Web locks Chrome 80 solid for ~40s with ~12 cores of kernel

Created on 19 Feb 2020  路  27Comments  路  Source: vector-im/element-web

Every hour or so I'm getting spectacular deadlocks on my account, which beachballs Chrome 80 on Catalina for 30-40s at a time. JS heap is ~300MB, but it looks like something may be leaking sprites? This began ~1 week ago.

Hitting the box with System Trace in Instruments (given Catalina has broken dtrace...) shows that Chrome is thrashing away doing dispatches which turn into kevents for managing a CoreImage cache?

Screenshot 2020-02-13 at 16 54 04

Screenshot 2020-02-18 at 19 48 30

bug p1 performance

Most helpful comment

The multi-minute pauses were driving me nuts, so I went on a leak mission with help from @t3chguy and @jryans. Conclusions are:

we have leaks in:

  • [x] selectionchange in BasicMessageComposer (which leaks the whole DOM tree whenever you change rooms due to race between focus/blur)
  • [x] pillifyLinks (which leaks the whole DOM tree whenever you render a pill in the timeline, via BaseAvatar's onClientSync for the avatar in the pill)
  • velociraptor (for RRs) (perhaps, although it might be just for RRs in flight)
  • ...and possibly everywhere we do a ratelimitedfunction or timed promise (at least until the timeout fires, although this might have been me failing to GC). I specifically noticed _updateE2eStatus in RoomTile and the LL _membersPromise in js-sdk.

This looks to be a dup of https://github.com/vector-im/riot-web/issues/10890.

This probably deserves a writeup similar to https://blog.discordapp.com/discord-react-memory-leak-565c89763e8 to explain the methodology and for future readers to avoid experiencing the same pain of working out how to drive dev tools to find the leak. My notes are:

  • take heap snapshots using the memory profiler.
  • pick something which really shouldn't ever leak (e.g. MessageComposer or RightPanel) - not detached HTMLDivElements
  • pick something which isn't leaking much (leaking 250K detached divs is hard to track, but 8 messagecomposers is much easier to reason about)
  • spot the granularity of what's leaking by looking at other objects which leak a similar number (e.g. 8 right panels, 8 messagecomposers, etc)
  • perhaps check when the leaks happen, using the allocation timeline profile mode of the memory profilers.
  • use roughly consistent repro steps (view the same series of rooms each time, in this instance)
  • hit the GC button before you capture a heap (is this really needed?)
  • look at the various top level retain chains for each leaked object, not just the first one (which by default will be the shortest path to the GC root, but may not be the most informative)
  • look at the reactInternalFiber objects in the retain graph ("reveal in summary window") to figure out precisely which React nodes are retaining

    • for instance, turns out that the BaseAvatar retaining the composer is consistently the ones for @thelink:mozilla.org and @haxen:matrix.org and me.

  • look at the __reactInternalInstance of the detached DIV in the retain graph

    • its outerText is: "14:36 TheLink: https://github.com/vector-im/riot-web/issues/12475 Layout problems in german i18n 路 Issue #12475 路 vector-im/riot-web - GitHubTruncated buttons and overflowing labels."

    • this tells us what div has leaked; it's the BaseAvatar in the pill. Check where this pill comes from, and discover pillifyLinks manually messing with the DOM => leak.

  • comment out any dodgy looking retainers in the code (e.g. promises which claim to be holding onto the object). these could be real leaks or could be side-effects of the actual leak (i.e. legit stuff which would get cleaned up if the real leak wasn't leaking).

    • all our RateLimitFuncs might be leaking however, if they're not cleaned up properly on unmount?

    • likewise promises for things like LL loading?

  • when reading the retain chains, keep a close eye out for recurring bits of non-React code.
  • by elimination, find the real leak (in this instance, BasicMessageComposer's _onSelectionChange being leaked, holding onto a this which in turn retains the whole ReactDOM, due to sloppy handling of the selectionchange event listener).

stuff which didn't work:

  • moving to react@next (although it might have helped debug a bit by reducing number of spurious references between fibernodes thanks to stuff like https://github.com/facebook/react/pull/16115 and https://github.com/facebook/react/pull/17666)
  • inspecting the inspector (probably easier to write a script to walk a heapdump than script chrome dev tools to do it)
  • using React Dev Tools, due to https://github.com/facebook/react/issues/16138 or similar meaning that the dev tools references made it even harder to read the retainer graphs.

All 27 comments

Looks like a DOM leak - i have 127,134 detached div elements hanging around.

They look like:
image

I took a (1.5GB) heap dump.

The unreleased https://github.com/facebook/react/pull/17666 looks like it might be related, so i'm trying on react@next.

The multi-minute pauses were driving me nuts, so I went on a leak mission with help from @t3chguy and @jryans. Conclusions are:

we have leaks in:

  • [x] selectionchange in BasicMessageComposer (which leaks the whole DOM tree whenever you change rooms due to race between focus/blur)
  • [x] pillifyLinks (which leaks the whole DOM tree whenever you render a pill in the timeline, via BaseAvatar's onClientSync for the avatar in the pill)
  • velociraptor (for RRs) (perhaps, although it might be just for RRs in flight)
  • ...and possibly everywhere we do a ratelimitedfunction or timed promise (at least until the timeout fires, although this might have been me failing to GC). I specifically noticed _updateE2eStatus in RoomTile and the LL _membersPromise in js-sdk.

This looks to be a dup of https://github.com/vector-im/riot-web/issues/10890.

This probably deserves a writeup similar to https://blog.discordapp.com/discord-react-memory-leak-565c89763e8 to explain the methodology and for future readers to avoid experiencing the same pain of working out how to drive dev tools to find the leak. My notes are:

  • take heap snapshots using the memory profiler.
  • pick something which really shouldn't ever leak (e.g. MessageComposer or RightPanel) - not detached HTMLDivElements
  • pick something which isn't leaking much (leaking 250K detached divs is hard to track, but 8 messagecomposers is much easier to reason about)
  • spot the granularity of what's leaking by looking at other objects which leak a similar number (e.g. 8 right panels, 8 messagecomposers, etc)
  • perhaps check when the leaks happen, using the allocation timeline profile mode of the memory profilers.
  • use roughly consistent repro steps (view the same series of rooms each time, in this instance)
  • hit the GC button before you capture a heap (is this really needed?)
  • look at the various top level retain chains for each leaked object, not just the first one (which by default will be the shortest path to the GC root, but may not be the most informative)
  • look at the reactInternalFiber objects in the retain graph ("reveal in summary window") to figure out precisely which React nodes are retaining

    • for instance, turns out that the BaseAvatar retaining the composer is consistently the ones for @thelink:mozilla.org and @haxen:matrix.org and me.

  • look at the __reactInternalInstance of the detached DIV in the retain graph

    • its outerText is: "14:36 TheLink: https://github.com/vector-im/riot-web/issues/12475 Layout problems in german i18n 路 Issue #12475 路 vector-im/riot-web - GitHubTruncated buttons and overflowing labels."

    • this tells us what div has leaked; it's the BaseAvatar in the pill. Check where this pill comes from, and discover pillifyLinks manually messing with the DOM => leak.

  • comment out any dodgy looking retainers in the code (e.g. promises which claim to be holding onto the object). these could be real leaks or could be side-effects of the actual leak (i.e. legit stuff which would get cleaned up if the real leak wasn't leaking).

    • all our RateLimitFuncs might be leaking however, if they're not cleaned up properly on unmount?

    • likewise promises for things like LL loading?

  • when reading the retain chains, keep a close eye out for recurring bits of non-React code.
  • by elimination, find the real leak (in this instance, BasicMessageComposer's _onSelectionChange being leaked, holding onto a this which in turn retains the whole ReactDOM, due to sloppy handling of the selectionchange event listener).

stuff which didn't work:

  • moving to react@next (although it might have helped debug a bit by reducing number of spurious references between fibernodes thanks to stuff like https://github.com/facebook/react/pull/16115 and https://github.com/facebook/react/pull/17666)
  • inspecting the inspector (probably easier to write a script to walk a heapdump than script chrome dev tools to do it)
  • using React Dev Tools, due to https://github.com/facebook/react/issues/16138 or similar meaning that the dev tools references made it even harder to read the retainer graphs.

@turt2live why is riot/web locking Chrome for a minute every few hours not a fire?

As far as I know it's only affecting 1 person and not widespread. It doesn't sound like a drop-everything-and-fix-it issue, but definitely one that should be p1 for when the maintainer role is out of fires to put out.

It was also missing from the board and presumed to be not a fire if it hasn't been tracked fully.

This should be affecting anyone with a relatively big account. I suspect it's got worse recently either due to a Chrome upgrade (such that it now takes out the whole browser rather than killing the tab), or possibly with a recent version of React 16, causing retaining a full DOM tree to have a larger footprint. I suspect it's also the root cause of some of the symptoms described at https://github.com/vector-im/riot-web/issues/10890.

So yes, i'm afraid that crashing my browser every few hours counts as a drop everything and fix it, which is why I dropped everything and fixed it.

The answer to it being missing from the board is to go put it on the board rather than unhelpfully untag it.

As someone with an updated Chrome and a large account, I can't say I've seen this as frequently as you've described.

(let the record show that I failed to read key parts of information which do in fact make this a fire rather than arbitrary performance issues)

thanks.

https://github.com/matrix-org/matrix-react-sdk/pull/4111 fixes the two main leaks, but leaves Velociraptor alone. This seems to result in 3x additional DOM trees being leaked, although it's not immediately clear from the snapshot that it's Velociraptor which is retaining them. Empirically commenting out Velociraptor in RR's render method makes them go away.

Leaving this open until it's confirmed that 4111 fixes the main problem.

The Velociraptor leaks tend to look like this (when they show up), ftr:

Screenshot 2020-02-23 at 00 51 55

Worth trying to update velocity? Looks like you guys forked it in 2015 and haven't updated it since.

hm, i thought we went back to mainline in matrix-org/matrix-react-sdk#2808.

i'm only seeing a few trees leaked from velocity atm (whereas previously they were leaking linearly over time), plus the plan as per https://github.com/vector-im/riot-web/issues/10890#issuecomment-543686594 is to remove it entirely. So not sure it's worth trying to patch up velocity further right now.

well, good news: with velociraptor commented out, i'm fairly reliably sitting with one or two messagecomposers (with the 2nd typically being linked to some promise/debounce/ratelimit which is hanging around, or possibly fiber's backbuffer).

bad news: with it enabled it's hovering between 4 and 6 leaked trees. not the end of the world, but annoying.

So, I've had one instance of Chrome deadlocking since this while running riot.im/develop. On snapshotting afterwards I found 10 leaked DOM trees unfortunately, all with a tree like this:

Screenshot 2020-02-24 at 22 41 40

The line of MessageComposer which retains is the ref of:

                <SendMessageComposer
                    ref={(c) => this.messageComposerInput = c}

...but I don't think this is necessarily a leak (although I do recall something about refs leaking).

I wasn't seeing this at all on a local branch with Velociraptor commented out, so I think it may be related - but tracking it here.

I suggest we keep this open (but de-fired) to gather more datapoints on tree leaks.

just had another; lasted 2 minutes:

Screenshot 2020-02-25 at 09 39 06

This time with 9x dom trees leaked (mix of velocity and the retain failure mode from the previous comment), but only 1200 detached divs (as opposed to the 250K odd I've seen in the past).

The Chrome task manager claims the tab is using 950MB of RAM total, and 280MB of GPU RAM - which seems relatively high but not crazy.

So, I think whatever the problem is still happening (although I am very tempted to suggest we take the time to remove velociraptor, otherwise debugging stuff like this is a nightmare).

...and again. feels like it's happening about as often as before the mem leak fixes \o/

...and again. dmesg has:

X86PlatformPlugin fSMC_PLimitNotification
X86PlatformPlugin fSMC_PLimitNotification
X86PlatformPlugin::evaluateHardLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
[IGFB][INFO   ] [Flip_Transaction] Disabling Render Compression
[IGFB][INFO   ] [Flip_Transaction] CCS Param update: AuxOffset:0x0 AuxStride:0x0
X86PlatformPlugin fSMC_PLimitNotification
X86PlatformPlugin fSMC_PLimitNotification
X86PlatformPlugin::evaluateHardLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback

...which stopped once it recovered. This sounds like the intel GPU cocking up.

going to try resetting SMC :|

didn't help. 3 or 4 times a day, i spend my life staring at this. i'm not even sure it's riot/web at this point :/

Screenshot 2020-02-28 at 11 40 50

What happens if you spend your day in not-Chrome as a test?

yup, already switched :)

...and it happened again. without any Riot/Web open.

I've switched back to Firefox, although will leave this open a bit longer just in case FF starts melting too.

also not fixed as far as I'm aware?

Yes, it's not fixed - I just had it again after absentmindedly opening Riot in Chrome for the first time in a while.

(please don't close unfixed bugs...)

I've since had it a few times on other webapps without Riot being open in Chrome at all - in other words, this is some bad interaction between Chrome, Catalina, and possibly my GPU. Therefore this whole wild goose chase really looks to have nothing to do with Riot. (Meanwhile I've switched to Firefox and/or Riot Nightly when using Riot itself).

Hello my mac is experiencing same problem. Have you found a fix for that? I'm worried that It may caused by GPU malfunction or overheating.

Hello my mac is experiencing same problem. Have you found a fix for that? I'm worried that It may caused by GPU malfunction or overheating.

I think it would be best for you to file a new issue with more details on what you're seeing, as could be something entirely different, system specific, etc.

Almost certain it's a V8/macOS issue; running any Chromium/Electrum app causes heavy power consumption and lagging, including Chrome, Chromium, Brave (all Chromium), and VSCode (Electrum). Running macOS 10.15.4 (19E287) on a 2015 MBP.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

richvdh picture richvdh  路  3Comments

richvdh picture richvdh  路  3Comments

arthurlutz picture arthurlutz  路  3Comments

turt2live picture turt2live  路  3Comments

turt2live picture turt2live  路  3Comments