Go: runtime: make timers faster

Created on 24 Aug 2013  ยท  26Comments  ยท  Source: golang/go

This is a follow up to:
https://golang.org/cl/12876047/
time: lower level interface to Timer: embedding, compact interface callback with fast
callback

Timers can be heavily used in networking applications.
Current implementation at least has problems with scalability:

$ ./time.test -test.run=none -test.bench=StartStop -test.benchtime=1s
-test.cpu=1,2,4,8,16,32
PASS
BenchmarkStartStop  10000000           214 ns/op
BenchmarkStartStop-2     5000000           515 ns/op
BenchmarkStartStop-4     5000000           735 ns/op
BenchmarkStartStop-8     2000000           804 ns/op
BenchmarkStartStop-16    5000000           708 ns/op
BenchmarkStartStop-32    5000000           679 ns/op

Some spot optimizations can be applied as well. Probably more efficient data structure
can be used, but it's not clear to me how to do better than current 4-ary heap.

FTR here is BenchmarkStartStop profile with 8 procs:

+  13.75%  time.test  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave                     
                                                            โ–’
+  11.25%  time.test  time.test          [.] runtime.lock                               
                                                            โ—†
+  11.15%  time.test  time.test          [.] runtime.xchg                               
                                                            โ–’
+   6.89%  time.test  time.test          [.] runtime.procyield                          
                                                            โ–’
+   6.32%  time.test  [kernel.kallsyms]  [k] _raw_spin_lock                             
                                                            โ–’
+   4.06%  time.test  time.test          [.] runtime.cas                                
                                                            โ–’
+   3.49%  time.test  [kernel.kallsyms]  [k] gup_pte_range                              
                                                            โ–’
+   1.87%  time.test  time.test          [.] runtime.deltimer                           
                                                            โ–’
+   1.80%  time.test  [kernel.kallsyms]  [k] get_futex_key                              
                                                            โ–’
+   1.71%  time.test  [kernel.kallsyms]  [k] put_page                                   
                                                            โ–’
+   1.58%  time.test  [kernel.kallsyms]  [k] try_to_wake_up                             
                                                            โ–’
+   1.55%  time.test  [kernel.kallsyms]  [k] __wait_on_bit_lock                         
                                                            โ–’
+   1.42%  time.test  time.test          [.] flushptrbuf                                
                                                            โ–’
+   1.38%  time.test  [kernel.kallsyms]  [k] get_user_pages_fast                        
                                                            โ–’
+   1.38%  time.test  time.test          [.] siftup                                     
                                                            โ–’
+   1.22%  time.test  [kernel.kallsyms]  [k] copy_user_generic_string                   
                                                            โ–’
+   1.19%  time.test  time.test          [.] runtime.casp                               
                                                            โ–’
+   1.10%  time.test  [kernel.kallsyms]  [k] unlock_page                                
                                                            โ–’
+   1.04%  time.test  [kernel.kallsyms]  [k] get_futex_key_refs                         
                                                            โ–’
+   1.01%  time.test  time.test          [.] addtimer                                   
                                                            โ–’
+   1.00%  time.test  [kernel.kallsyms]  [k] drop_futex_key_refs                        
                                                            โ–’
+   0.98%  time.test  [kernel.kallsyms]  [k] prepare_to_wait_exclusive                  
                                                            โ–’
+   0.97%  time.test  [kernel.kallsyms]  [k] __wake_up_bit                              
                                                            โ–’
+   0.94%  time.test  [kernel.kallsyms]  [k] __wake_up_common                           
                                                            โ–’
+   0.81%  time.test  [kernel.kallsyms]  [k] audit_filter_syscall                       
                                                            โ–’
+   0.75%  time.test  [kernel.kallsyms]  [k] __schedule                                 
                                                            โ–’
+   0.72%  time.test  time.test          [.] runtime.mallocgc                           
                                                            โ–’
+   0.72%  time.test  time.test          [.] siftdown
Performance

Most helpful comment

@aclements @RLH @randall77

Optimization of the timer heap can give up to 10%. On some of our prod servers we see 5-8% of time in siftdown.

But the real problem is the global mutex. We need:

  1. Merge timers into netpoll (epoll/kqueue/IOCP can wait with timeout).
  2. Get rid of the timer thread (this will also reduce timer latency, because currently we need 2 thread context switches to handle a timer).
  3. Distribute netpoll+timers per P.

Here is a prototype for 1 and 2:
https://go-review.googlesource.com/#/c/5760

Distribution is somewhat tricky. We will need hierarchical epoll descriptors, and kqueue/IOCP does not support hierarchical descriptors AFAICT. For timers we will need to somehow block on global minimum time, rather than on per-P minimum time.

All 26 comments

Comment 1:

_Labels changed: added go1.3maybe._

Comment 2:

_Labels changed: added release-none, removed go1.3maybe._

Comment 3:

_Labels changed: added repo-main._

@aclements @RLH @randall77

Optimization of the timer heap can give up to 10%. On some of our prod servers we see 5-8% of time in siftdown.

But the real problem is the global mutex. We need:

  1. Merge timers into netpoll (epoll/kqueue/IOCP can wait with timeout).
  2. Get rid of the timer thread (this will also reduce timer latency, because currently we need 2 thread context switches to handle a timer).
  3. Distribute netpoll+timers per P.

Here is a prototype for 1 and 2:
https://go-review.googlesource.com/#/c/5760

Distribution is somewhat tricky. We will need hierarchical epoll descriptors, and kqueue/IOCP does not support hierarchical descriptors AFAICT. For timers we will need to somehow block on global minimum time, rather than on per-P minimum time.

@dvyukov How does CL 5760 "get rid of the timer thread"? There is a timerproc goroutine which remains after the change. I am probably just too dumb to see it, but could you point me to where the timer thread was?

timerproc blocks in notetsleepg, which consumes a thread, so timerproc is also always a thread.

Thanks for the quick explanation. If I am reading it correctly, at least notetsleepg will hand off the P to another thread right away, so it doesn't just consume the threadblock the P for a whole sysmon tick.

Yes, it hands off P, but it still consumes thread and causes 2 context switches per timer.

Change https://golang.org/cl/171883 mentions this issue: runtime: switch to using new timer code

Change https://golang.org/cl/171829 mentions this issue: runtime, syscall, time: prepare for adding timers to P's

Change https://golang.org/cl/171827 mentions this issue: runtime: add wasm support for timers on P's

Change https://golang.org/cl/171826 mentions this issue: runtime: initial scheduler changes for timers on P's

Change https://golang.org/cl/171821 mentions this issue: runtime: change netpoll to take an amount of time to block

Change https://golang.org/cl/171828 mentions this issue: runtime: handle timers on P's in procresize/(*pp).destroy

Change https://golang.org/cl/171884 mentions this issue: runtime, syscall, time: remove old timer code

Change https://golang.org/cl/203890 mentions this issue: runtime: clear js idle timeout before new one and after event handler

Change https://golang.org/cl/204045 mentions this issue: runtime: record stub netpoll initialization, add lock around note

Change https://golang.org/cl/204280 mentions this issue: runtime: use correct state machine in addAdjustedTimers

Change https://golang.org/cl/204717 mentions this issue: runtime: use atomic.Cas to change timerRemoved to timerWaiting

Change https://golang.org/cl/204800 mentions this issue: runtime: wake netpoller when dropping P, don't sleep too long in sysmon

Change https://golang.org/cl/206938 mentions this issue: runtime: acquire timersLocks around moveTimers

Change https://golang.org/cl/207348 mentions this issue: runtime: release timersLock while running timer

The ideas suggested here have been implemented for 1.14. Any further timer speedups can be done in new issues.

Change https://golang.org/cl/209580 mentions this issue: runtime: use current P's race context in timer code

After reviewing proc.go (and https://github.com/golang/go/issues/18507 and https://github.com/golang/go/commit/c05b06a12d005f50e4776095a60d6bd9c2c91fac), am I right in assuming that the naming of "netpoll" is due to its origin for only polling for network I/O, but now the event loop uses the same "netpoll" mechanism to poll for file I/O and timers too?

Yes. Interruptible sleep is often modelled using select, kqueue, or epoll, all of which started life as ways to poll a set of non blocking file descriptors, of which a network fd was the most common as cross platform async file io is a unicorn.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mingrammer picture mingrammer  ยท  3Comments

OneOfOne picture OneOfOne  ยท  3Comments

ashb picture ashb  ยท  3Comments

Miserlou picture Miserlou  ยท  3Comments

enoodle picture enoodle  ยท  3Comments