Neovim: memory leak with multiple timer based plugins

Created on 18 Jan 2018 · 71Comments · Source: neovim/neovim

nvim --version:
VIM v0.2.2
Build type: RelWithDebInfo
LuaJIT 2.0.4
Compilation: /usr/bin/x86_64-linux-gnu-gcc -g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -DDISABLE_LOG -Wdate-time -D_FORTIFY_SOURCE=2 -Wconversion -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=1 -DNVIM_UNIBI_HAS_VAR_FROM -O2 -g -DMIN_LOG_LEVEL=3 -Og -g -Wall -Wextra -pedantic -Wno-unused-parameter -Wstrict-prototypes -std=gnu99 -Wvla -fstack-protector-strong -fdiagnostics-color=auto -DINCLUDE_GENERATED_DECLARATIONS -D_GNU_SOURCE -I/build/neovim-HJnLxJ/neovim-0.2.2/build/config -I/build/neovim-HJnLxJ/neovim-0.2.2/src -I/usr/include -I/usr/include -I/usr/include -I/usr/include -I/usr/include -I/usr/include -I/usr/include -I/build/neovim-HJnLxJ/neovim-0.2.2/build/src/nvim/auto -I/build/neovim-HJnLxJ/neovim-0.2.2/build/include
Compiled by [email protected]
Vim (version: ) behaves differently? Haven't checked. Issue seems to be new with 0.2.2.
Operating system/version: Ubuntu 16.04
Terminal name/version: tmux 2.5
$TERM: screen-256color

Steps to reproduce

Just start neovim with the following config cutdown.

nvim -u nvim-leak.vim

nvim-leak.vim

call plug#begin('~/.vim/plugged')
Plug 'vim-airline/vim-airline'
Plug 'enricobacis/vim-airline-clock'

Plug 'airblade/vim-gitgutter'
set updatetime=250
let g:gitgutter_realtime = 1
let g:gitguter_eager = 1

call plug#end()

Actual behaviour

Leaks memory. About 1MB a second.

If either vim-airline-clock or vim-gitgutter is removed, memory is stable. With both together, memory increases uncontrollably.

Expected behaviour

Not leaking memory

bug-vim event-loop help wanted needs-repro vim-patch

Source

CptnKirk

All 71 comments

Are you sure this isn't a gitgutter bug? Timers need to be stopped manually if they are repeating.

If it's not reproducible in some older version of Nvim, bisecting to the problem commit will help a lot.

justinmk on 18 Jan 2018

gitgutter by itself works ok. vim-airline-clock by itself works ok. Together something happens.

I had this setup working well in the past. Stepped away for a few months, updated to neovim 0.2.2 and :PlugUpdate to grab the latest plugins, that's when I noticed this issue.

vim-airline-clock hasn't been updated in a year. But both neovim and gitgutter have been, since Nov 2017. The combination of plugins causing a problem make me lean towards a change in neovim. But I don't know enough about how plugins interoperate to be sure.

CptnKirk on 18 Jan 2018

Ah, I thought this sounded familiar. This was reported before (in chat ,I think), regarding the same plugin "vim airline clock". https://github.com/enricobacis/vim-airline-clock/issues/3

Does this bring the memory back down?

:call timer_stopall()
:call garbagecollect()

justinmk on 18 Jan 2018

:call timer_stopall()

immediately brings the memory back down. With no further increase.

CptnKirk on 18 Jan 2018

That suggests the leak is in the plugin. Try adding some logging statements in the plugin or vim-airline itself, to see if timer_start() is being called many times. Or maybe its callback is doing something weird.

justinmk on 18 Jan 2018

But the problem only happens if both of these plugins are active at the same time. They behave normally when they're the only ones active.

Won't both plugins claim to be working properly when they work in isolation?

CptnKirk on 18 Jan 2018

Without steps to reproduce anything is possible.

justinmk on 18 Jan 2018

It's easily reproducible.

nvim -u nvim-leak.vim

CptnKirk on 18 Jan 2018

10,000 lines of plugin code doesn't point to a case that can be isolated as a Nvim bug. If timer_stopall frees the memory, there doesn't seem to be a low-level leak. Maybe Nvim isn't stopping timers (after they timeout) for some reason, but so far there's no evidence of that. Only evidence that some big plugins are creating lots of timers.

So someone needs to add logging to these plugins to see how often they create timers, and how often they stop them.

justinmk on 18 Jan 2018

It would appear that vim-airline-clock does its timer logic here: https://github.com/enricobacis/vim-airline-clock/blob/95d452ec07b12d884b508ba8be686c0f913df6ac/autoload/airline/extensions/clock.vim#L36

It creates a single timer that calls a function every tick, forever. vim-gitgutter doesn't have "timer_start" anywhere in the codebase. Yet the two don't play nice together, for some reason.

CptnKirk on 18 Jan 2018

There are plenty of other plugins that also use timer_start that appear to behave normally. I just didn't include them in the cutdown.

Could this be an interaction between:

a continuously repeating timer ({'repeat':-1})
a continuously repeating timer assigned to a global variable
a continuously repeating timer and interaction with the jobstart call (that gitgutter uses)

Seems like these 3 put together (across all plugins) cause problems.

global {'repeat':-1} + other timer_starts = fine
other timer_starts + jobstart() = fine
no other plugins use {'repeat':-1}
no other plugins assign a timer to a global variable (I assume g: denotes global)

CptnKirk on 19 Jan 2018

What is the call to kill a single timer? I don't think this is an issue with an explosion of timers.

Rather it appears that one plugin creates a global {'repeat':-1} timer that calls a function in another plugin that holds references to another plugin. Each tick of the global holds more and more through this chain of events.

Does a global {'repeat':-1} timer clean up resources generated in a prior tick? Killing all timers allows for a GC. How do I kill just 1 timer to prove this theory?

CptnKirk on 19 Jan 2018

Rather it appears that one plugin creates a global {'repeat':-1} timer that calls a function in another plugin that holds references to another plugin.

Something like that.

How do I kill just 1 timer to prove this theory?

Get the timer id. You need to modify the plugin code.

justinmk on 19 Jan 2018

Does a global {'repeat':-1} timer clean up resources generated in a prior tick?

If the timer callback is a closure (or partial in general) it could unexpectedly hold on to resources (that the closure could use in principle, even if it actually doesn't) . Otherwise it shouldn't.

bfredl on 19 Jan 2018

It could be an idea to test with g:gitgutter_realtime set to zero, as CursorHold,CursorHoldI functions as a very limited timer of sorts.

bfredl on 19 Jan 2018

Does a global {'repeat':-1} timer clean up resources generated in a prior tick?

If the timer callback is a closure (or partial in general) it could unexpectedly hold on to resources (that the closure could use in principle, even if it actually doesn't). Otherwise it shouldn't.

Even if the closure did close over other values, that garbage should be collected when the execution context used within the timer drops out of scope. I would assume that a new execution context is used for each invocation of the closure when the {'repeat':-1} option is used. Otherwise, the timer itself could pin application memory, which is never the intent of a timer.

function! airline#extensions#clock#timerfn(timer)
  call airline#update_statusline()
endfunction

let g:airline#extensions#clock#timer = timer_start(
      \ g:airline#extensions#clock#updatetime,
      \ 'airline#extensions#clock#timerfn',{'repeat':-1})

This seems pretty straightforward and this plugin has been unchanged for a while. But I'm not a vim or vim plugin author so my abilities here are limited. We have determined that if timers are killed memory is released. So either a timer is holding onto memory (which it shouldn't), or an extraordinary number of timers are being created (or both). Because the identical code works without an additional plugin loaded, I think it's unlikely to be a leak caused by too many timers. Rather a combination of call airline#update_statusline() generating way too much garbage and the neovim timer not letting go of it. This same ecosystem of plugins and neovim worked under neovim 0.2.0 and possibly 0.2.1.

It would be great if someone with experience on the neovim team could take a few minutes and do some of this triage for me, and either identify the issue with timers holding plugin memory, or help point the finger to a particular plugin, so I can chase that author down.

Absent that, could someone point me to documentation on the timer_start/3 call. The call to kill an individual timer. And the call that logs to somewhere.

CptnKirk on 19 Jan 2018

It could be an idea to test with g:gitgutter_realtime set to zero, as CursorHold,CursorHoldI functions as a very limited timer of sorts.

g:gitgutter_enabled = 0 still has this problem. g:gitgutter_realtime = 0 also no positive impact.

CptnKirk on 19 Jan 2018

Because the identical code works without an additional plugin loaded, I think it's unlikely to be a leak caused by too many timers.

Also, if the code was creating new timers it would keep creating them after the old timers were killed. But once killed no new memory growth happened. Unless a timer within gitgutter was creating new timers. That doesn't appear to be the case because gitgutter doesn't make a single start_timer call.

CptnKirk on 19 Jan 2018

. I would assume that a new execution context is used for each invocation of the closure when the {'repeat':-1} option is used.

The only such "context" I could think of is the ordinary function invocation context, when the function returns it is gone. If no closure is used, which isn't in this case, the timer _should_ be perfectly equivalent to call airline#extensions#clock#timerfn(number) invoked at the regular interval (say by external RPC).

I would try both timer_stop() and timer_pause() which should be equivalent (but maybe isn't, if timer implementation indeed has the bug)

bfredl on 19 Jan 2018

call timer_stop(g:airline#extensions#clock#timer) does indeed free the memory and stop the growth.

CptnKirk on 19 Jan 2018

I don't know the proper way to call timer_pause()

call timer_pause(g:airline#extensions#clock#timer, 5000) stops the growth, but does not free the memory

CptnKirk on 19 Jan 2018

That is correct (it only checks if 2nd arg is non-zero, so it works) and very interesting. So keeping the timer in the table seems to pin some resources. Beyond the non-existing closure, timer struct only holds on to libuv resources. But I would expect an uv handle leak to be in the order of kb/s per seconds and not megabytes..

bfredl on 19 Jan 2018

Taking a harder look... I removed gitgutter from my cutdown vimrc and there does appear to be a very slow leak, even without gitgutter. I didn't even notice at first. I also removed gitgutter from my large vimrc and the growth is still there. So gitgutter itself was only one possible contributor, not the only contributor.

Leads me to assume that much more than a uv handle is actually being pinned. And that airline without any extensions is pretty light.

CptnKirk on 19 Jan 2018

Trying to reproduce with the script. I don't seem to get any leak per time, however easily 0.5-1MB for a mode-switch pair i<esc>. But I also get this with _only_ airline and neither airline-clock nor gitgutter (completely commented out from the script). I guess the underlying bug is either in airline and nvim itself and the airline-clock+gitgutter combination just makes the airline update more frequently.

Time for some heap analysis...

bfredl on 20 Jan 2018

Unfortunately that was probably only ASANs own allocations. Using instead a logging profiler (heaptrack) I can't reproduce any leak at all even with the original script.

bfredl on 20 Jan 2018

What commands should I throw at my install? I can repro pretty easily here.

CptnKirk on 22 Jan 2018

Also, what branch are you on? I'm having problems building master locally, so still on whatever the Ubuntu 16 maintainers are calling 0.2.2-1.

CptnKirk on 22 Jan 2018

I finally got HEAD to build locally. Still same problem. Built with included (auto-dl'd) deps. How can I generate/submit a heap dump?

CptnKirk on 23 Jan 2018

Vim (version: ) behaves differently? Haven't checked. Issue seems to be new with 0.2.2.

It should checked in Vim.

Shougo on 23 Jan 2018

Sigh. My setup is for neovim. Plugins break horribly when I launch regular vim. There's probably a way to fix this, but I'm short on time atm. Now that I have my neovim build building, I'm going to try with prior versions 0.2.1 and 0.2.0 and see if I can get the issue to go away that way. I'm pretty sure this was introduced recently.

CptnKirk on 23 Jan 2018

v0.2.0 - no problem!
v0.2.1 - leak
v0.2.2 - leak

CptnKirk on 23 Jan 2018

In addition to leaking memory, there's also some moderate CPU usage (~15%) in the problem versions.

0% while idle and timers not running (all versions). If that helps anyone triage anything.
0% while timers running, no memory growth, v0.2.0

CptnKirk on 23 Jan 2018

v0.2.0 - no problem!
v0.2.1 - leak

Can you bisect between these two tags to see when it was introduced?

jamessan on 23 Jan 2018

It should checked in Vim.

In the plugin issues linked earlier, there was mention of it being seen in both editors.

jamessan on 23 Jan 2018

Can you bisect between these two tags to see when it was introduced?

I can try. But there are a zillion commits per release. If someone could point me to particularly interesting commits, that would be nice. Otherwise, I can try my binary-search foo.

I hear there's a git-bisect command but I've never used it. Anyone have a quick primer or is Google my friend?

CptnKirk on 23 Jan 2018

Tell git-bisect to find the first bad commit where v0.2.0 is known good and v0.2.1 is known bad.
- git bisect start v0.2.1 v0.2.0
Run your build and test
If things don't work, git bisect bad will mark it as a bad commit and jump to the next commit. Otherwise use git bisect good to mark it good and jump to the next commit.
If you can't determine whether the commit is good or bad (e.g., it fails to build) then use git bisect skip to move to another commit without marking the current one as good/bad.

jamessan on 23 Jan 2018

Here are the bisect results

```06ec15121a4da439684cfa1909d920c3f22f2894 is the first bad commit
commit 06ec15121a4da439684cfa1909d920c3f22f2894
Author: James McCoy jamessan@jamessan.com
Date: Sat Apr 8 22:50:50 2017 -0400

Bump Vim version to 8.0 and applied patches

:040000 040000 d0cac841eacf7453526514093906e63d96779af2 ef9a0c2276831237aaf6461feb4e78ee23689d61 M scripts
:040000 040000 e8f707704c364560202c31d78ec3d507549eccc1 2e6001ec9b87c7f21ddab32efc233a582c01b3d3 M src
```

CptnKirk on 23 Jan 2018

Direct link: https://github.com/neovim/neovim/commit/06ec15121a4da439684cfa1909d920c3f22f2894

That is just version related stuff, though it is possible the plugin could behave differently on a version/patch check.

bfredl on 23 Jan 2018

Plugin guys say it isn't possible for them to get worse per-tick. If they have a bug, things should be bad during a tick, then get cleaned up, then get bad again. The only way for things to get worse, per-tick of a single timer, would be a container issue.

Does neovim change logic based upon vim patch or version? Maybe the patch specific logic within the plugin exposes some issue within neovim. Worst of both worlds. Where is my :crystal_ball: ?

CptnKirk on 23 Jan 2018

Does neovim change logic based upon vim patch or version? Maybe the patch specific logic within the plugin exposes some issue within neovim

Sounds like a reach. Are you sure the parent of 06ec15121a4da439684cfa1909d920c3f22f2894 doesn't have the issue?

justinmk on 23 Jan 2018

Does neovim change logic based upon vim patch or version?

No. Either the plugin changed the behavior, or the bisect got troubled with indeterminism (we could easily imagine a codebase state where the bug _sometimes_ trigger)

bfredl on 23 Jan 2018

Sounds like a reach. Are you sure the parent of 06ec151 doesn't have the issue?

Lemme double check that...parent is dabd9ef44bb30b403e5a13ab1b6534cef6d8d8d0

CptnKirk on 23 Jan 2018

Nope...issue isn't in dabd9ef44bb30b403e5a13ab1b6534cef6d8d8d0
and is in 06ec15121a4da439684cfa1909d920c3f22f2894

CptnKirk on 23 Jan 2018

Search the plugin(s) for checks like version >= 800 and has('patch-8...').

justinmk on 23 Jan 2018

Looking...
There is switching but mostly on v:version >=800 && has("job") elseif has("nvim") and that should drop into nvim since has("job") == 0

CptnKirk on 23 Jan 2018

Try reverting the changes to version.h and rebuild Nvim.

justinmk on 23 Jan 2018

issues goes away with reverted version.h. Seems to be version >=800 test in the plugin and associated logic. Going to retest version.h hack against master.

CptnKirk on 23 Jan 2018

But has('job') is false in neovim, so the v:version check shouldn't matter... Are there any <= 704 or < 800 checks?

jamessan on 23 Jan 2018

There are additional 704 related checks. Lemme try and enumerate them.

Can confirm that if master pretends to be 704, the issue goes away.

CptnKirk on 23 Jan 2018

Maybe remove the check in vim-airline-clock and redo the rebase? I suppose you could test all the way back to #4624 (https://github.com/neovim/neovim/pull/4624/commits/61e8adb25e6f37c3c9df0dd8f4fea5e23b62bacd) unless something else is missing.

bfredl on 23 Jan 2018

I think the real issue is within the core vim-airline (and interaction with nvim) and not the clock, but I'll try that too.

CptnKirk on 23 Jan 2018

Maybe remove the check in vim-airline-clock

Ooh, didn't notice that. That makes sense since it doesn't setup the timer if v:version < 800.

jamessan on 23 Jan 2018

There's a bunch of version based logic in vim-airline. I'm trying to find stuff that would be added/subtracted between 704 and 800.

CptnKirk on 23 Jan 2018

I would first start with changing vim-airline-clock. That's most likely to help.

jamessan on 23 Jan 2018

Ugh. Wait. The problem could be in prior versions too, we just will never start our problem timer if version < 800. We've already established that we don't have a problem without this timer.

CptnKirk on 23 Jan 2018

👍1

Yea...cross post.

CptnKirk on 23 Jan 2018

Hacked up the clock. The problem is present in neovim all the way back to at least 0.1.7 (I had trouble building versions earlier than v0.1.7).

Seems there's been a problem for a while, only highlighted recently due to the clock exposing the problem based upon version.

Now the "problem" could also be airline doing something that leaks if the timer pins memory across ticks. I couldn't follow the airline logic. But there did appear to be events and other things being pushed to common queue/stack/whatever, possibly with references to files and io buffers. I suppose it's possible that if neovim doesn't use a fresh thread for each invocation of the tick that the plugin is unintentionally generating things that are never cleaned up, since this wasn't an execution path the author originally intended.

But that's a wild guess. The airline guys argue that it's still neovim pinning within their timer in this case, no matter what they generate and fail to clean up within a single tick. So they might have a problem, but it wouldn't be a problem if the timer allowed the memory to be freed.

Sigh.

CptnKirk on 23 Jan 2018

The airline guys argue that it's still neovim pinning within their timer in this case

Even though there was talk in one of the issues that it could reproduce in Vim?

jamessan on 23 Jan 2018

The airline guys argue that it's still neovim pinning within their timer

Please stop spreading FUD. To be clear, airline does not use any timer. Any problem that comes up from using a timer in either Vim or Neovim must be solved by vim/neovim itsself. In any case memory shouldn't leak and this is not a thing that can be fixed by airline.

chrisbra on 23 Jan 2018

That's what I said. Their == neovim. Maybe it could have been "its timer". Sorry for the confusion.

The airline guys argue that it's still neovim pinning within their timer

CptnKirk on 24 Jan 2018

Again, neovim timer uses no "threads" or any sort of context that is supposed to be able to own or "pin" memory. Timer invocations are ordinary vimL function calls if closure is not used.

bfredl on 24 Jan 2018

I don't understand. Something associated with whatever represents an individual timer is preventing memory allocated as a result of that timer's execution from being reclaimed. Maybe there isn't an intentional timer context, but there appears to be an unintentional one.

A plugin-level hack could be to stop the currently running timer every time the timer ticks and then reregister itself under a new ID as part of the tick logic, decrementing the iteration count as it goes.

That might free this non-existent context, as well as prove it somehow exists. I'll try and hack the clock and see how that goes.

CptnKirk on 24 Jan 2018

Here we go...

function! airline#extensions#clock#timerfn(timer)
  call airline#update_statusline()
endfunction

let g:airline#extensions#clock#timer = timer_start(
      \ 1000,
      \ 'airline#extensions#clock#timerfn',{'repeat':-1})

" let g:airline#extensions#clock#timer = timer_start(
"       \ g:airline#extensions#clock#updatetime,
"       \ 'airline#extensions#clock#timerfn',{'repeat':-1})

The variable updatetime was 250. If updatetime is 250 I see the leak. If it's 1000 I don't. What happens within the timing system if the function doesn't complete before the next tick should have been scheduled?

CptnKirk on 24 Jan 2018

https://github.com/neovim/neovim/issues/6829 gives an overview of the differences. Sounds like there may be some more detail in 2b. that needs to be looked at.

jamessan on 24 Jan 2018

It is possible that a timer callback gets "chained" immediately after the previous one like in that issue, but I wouldn't expect that with 250ms airline timer, and certainly not consistently.

However this code in os_inchar looks a bit suspicious

    if ((result = inbuf_poll((int)p_ut)) == kInputNone) {
      if (trigger_cursorhold() && !typebuf_changed(tb_change_cnt)) {
        create_cursorhold_event();
      } else {
        before_blocking();
        result = inbuf_poll(-1);
      }
    }

Here before_blocking() invokes GC. It is not the _only_ entry point for GC, but it suggests that under _some_ conditions, having events more often than 'updatetime' inhibits GC. Probably GC should be invoked already before the first inbuf_poll, possibly taking the time first and later subtracting it from p_ut (or not bother, as timing is not exact anyway)

bfredl on 24 Jan 2018

👍1

Would be a good time to refactor CursorHold to use a libuv timer instead of the os_inchar special-case.

justinmk on 24 Jan 2018

Ah, I thought this sounded familiar. This was reported before (in chat ,I think), regarding the same plugin "vim airline clock".

Right, it was me who reported on IRC, I still have the screenshots from that time. There was no gitgutter involved, I got it with plain vim-airline and vim-airline-clock alone.

oblitum on 30 Jan 2018

It takes many hours to leak 16GB, then OOM killer kicks in and it's nightmare.

oblitum on 30 Jan 2018

Can someone port that patch to Neovim?

justinmk on 16 Mar 2019

👍1

vim-airline@53aa65 seems to fix issue (or slow it way down) for above nvim-leak.vim. I had gitgutter@1e60667 and airline-clock@a752ae8, then bisected to that commit in vim-airline.

Looks like the change eliminates that 'opts' local. Would be interested to hear how that plays into the memory leak. The memory isn't reclaimed once variable goes out of scope?

Assume merge of 8.1.1007 still needs to be addressed.

grifcj on 19 May 2019

👍1

Here's a minimal for re-produce, derived from the test cases in vim-patch 8.1.1007. For me, sourcing the snippet below would cause memory to jump ~20 MB each time.

func! s:f(...)
  " All these cause issues
  let x = a:000
  " let x = a:
  " let x = l:
  " Maybe some more...?

  " This one doesn't
  " let x = g:
endfunc

for i in range(10000)
  call s:f(0)
endfor

I attempted porting that patch, as the tests use vim job-api. But, they didn't expose the bug when I finished. So, I must not have done that great of job of it. But I think it might just be easier to add the coverage in lua, the below snippet did work for me.

    local getmem = function()
      local pid = eval('getpid()')
      local handle = io.popen('ps -o rss= '..pid)
      return handle:read('*a')
    end

    start_mem = getmem()

    source([[
    func! s:f(...)
      let x = a:000
    endfunc
    for i in range(10000)
      call s:f(0)
    endfor
      ]])

    memory_increase = getmem() - start_mem
    assert.is_true(memory_increase < 1000)

I did a survey of existing leak tests in the suite, and it wasn't quite clear to me the strategy they were using to detect leaks. I assume relying on valgrind? For example, this one:

test/functional/api/window_spec.lua|75| 18:    it('does not leak memory when using invalid window ID with invalid pos',

So I was wondering if using ps (and some other windows equivalent) was what would be desired. Or if we can rely on valgrind to complain or use some other strategy to reliably cover the bug.

grifcj on 20 May 2019

Was this page helpful?

0 / 5 - 0 ratings