This is a follow up to: https://golang.org/cl/12876047/ time: lower level interface to Timer: embedding, compact interface callback with fast callback Timers can be heavily used in networking applications. Current implementation at least has problems with scalability: $ ./time.test -test.run=none -test.bench=StartStop -test.benchtime=1s -test.cpu=1,2,4,8,16,32 PASS BenchmarkStartStop 10000000 214 ns/op BenchmarkStartStop-2 5000000 515 ns/op BenchmarkStartStop-4 5000000 735 ns/op BenchmarkStartStop-8 2000000 804 ns/op BenchmarkStartStop-16 5000000 708 ns/op BenchmarkStartStop-32 5000000 679 ns/op Some spot optimizations can be applied as well. Probably more efficient data structure can be used, but it's not clear to me how to do better than current 4-ary heap. FTR here is BenchmarkStartStop profile with 8 procs: + 13.75% time.test [kernel.kallsyms] [k] _raw_spin_lock_irqsave โ + 11.25% time.test time.test [.] runtime.lock โ + 11.15% time.test time.test [.] runtime.xchg โ + 6.89% time.test time.test [.] runtime.procyield โ + 6.32% time.test [kernel.kallsyms] [k] _raw_spin_lock โ + 4.06% time.test time.test [.] runtime.cas โ + 3.49% time.test [kernel.kallsyms] [k] gup_pte_range โ + 1.87% time.test time.test [.] runtime.deltimer โ + 1.80% time.test [kernel.kallsyms] [k] get_futex_key โ + 1.71% time.test [kernel.kallsyms] [k] put_page โ + 1.58% time.test [kernel.kallsyms] [k] try_to_wake_up โ + 1.55% time.test [kernel.kallsyms] [k] __wait_on_bit_lock โ + 1.42% time.test time.test [.] flushptrbuf โ + 1.38% time.test [kernel.kallsyms] [k] get_user_pages_fast โ + 1.38% time.test time.test [.] siftup โ + 1.22% time.test [kernel.kallsyms] [k] copy_user_generic_string โ + 1.19% time.test time.test [.] runtime.casp โ + 1.10% time.test [kernel.kallsyms] [k] unlock_page โ + 1.04% time.test [kernel.kallsyms] [k] get_futex_key_refs โ + 1.01% time.test time.test [.] addtimer โ + 1.00% time.test [kernel.kallsyms] [k] drop_futex_key_refs โ + 0.98% time.test [kernel.kallsyms] [k] prepare_to_wait_exclusive โ + 0.97% time.test [kernel.kallsyms] [k] __wake_up_bit โ + 0.94% time.test [kernel.kallsyms] [k] __wake_up_common โ + 0.81% time.test [kernel.kallsyms] [k] audit_filter_syscall โ + 0.75% time.test [kernel.kallsyms] [k] __schedule โ + 0.72% time.test time.test [.] runtime.mallocgc โ + 0.72% time.test time.test [.] siftdown
@aclements @RLH @randall77
Optimization of the timer heap can give up to 10%. On some of our prod servers we see 5-8% of time in siftdown.
But the real problem is the global mutex. We need:
Here is a prototype for 1 and 2:
https://go-review.googlesource.com/#/c/5760
Distribution is somewhat tricky. We will need hierarchical epoll descriptors, and kqueue/IOCP does not support hierarchical descriptors AFAICT. For timers we will need to somehow block on global minimum time, rather than on per-P minimum time.
@dvyukov How does CL 5760 "get rid of the timer thread"? There is a timerproc
goroutine which remains after the change. I am probably just too dumb to see it, but could you point me to where the timer thread was?
timerproc blocks in notetsleepg, which consumes a thread, so timerproc is also always a thread.
Thanks for the quick explanation. If I am reading it correctly, at least notetsleepg
will hand off the P to another thread right away, so it doesn't just consume the threadblock the P for a whole sysmon tick.
Yes, it hands off P, but it still consumes thread and causes 2 context switches per timer.
Change https://golang.org/cl/171883 mentions this issue: runtime: switch to using new timer code
Change https://golang.org/cl/171829 mentions this issue: runtime, syscall, time: prepare for adding timers to P's
Change https://golang.org/cl/171827 mentions this issue: runtime: add wasm support for timers on P's
Change https://golang.org/cl/171826 mentions this issue: runtime: initial scheduler changes for timers on P's
Change https://golang.org/cl/171821 mentions this issue: runtime: change netpoll to take an amount of time to block
Change https://golang.org/cl/171828 mentions this issue: runtime: handle timers on P's in procresize/(*pp).destroy
Change https://golang.org/cl/171884 mentions this issue: runtime, syscall, time: remove old timer code
Change https://golang.org/cl/203890 mentions this issue: runtime: clear js idle timeout before new one and after event handler
Change https://golang.org/cl/204045 mentions this issue: runtime: record stub netpoll initialization, add lock around note
Change https://golang.org/cl/204280 mentions this issue: runtime: use correct state machine in addAdjustedTimers
Change https://golang.org/cl/204717 mentions this issue: runtime: use atomic.Cas to change timerRemoved to timerWaiting
Change https://golang.org/cl/204800 mentions this issue: runtime: wake netpoller when dropping P, don't sleep too long in sysmon
Change https://golang.org/cl/206938 mentions this issue: runtime: acquire timersLocks around moveTimers
Change https://golang.org/cl/207348 mentions this issue: runtime: release timersLock while running timer
The ideas suggested here have been implemented for 1.14. Any further timer speedups can be done in new issues.
Change https://golang.org/cl/209580 mentions this issue: runtime: use current P's race context in timer code
After reviewing proc.go (and https://github.com/golang/go/issues/18507 and https://github.com/golang/go/commit/c05b06a12d005f50e4776095a60d6bd9c2c91fac), am I right in assuming that the naming of "netpoll" is due to its origin for only polling for network I/O, but now the event loop uses the same "netpoll" mechanism to poll for file I/O and timers too?
Yes. Interruptible sleep is often modelled using select, kqueue, or epoll, all of which started life as ways to poll a set of non blocking file descriptors, of which a network fd was the most common as cross platform async file io is a unicorn.
Most helpful comment
@aclements @RLH @randall77
Optimization of the timer heap can give up to 10%. On some of our prod servers we see 5-8% of time in siftdown.
But the real problem is the global mutex. We need:
Here is a prototype for 1 and 2:
https://go-review.googlesource.com/#/c/5760
Distribution is somewhat tricky. We will need hierarchical epoll descriptors, and kqueue/IOCP does not support hierarchical descriptors AFAICT. For timers we will need to somehow block on global minimum time, rather than on per-P minimum time.