Go-ipfs: Large number of context switches

Created on 20 Apr 2016  路  5Comments  路  Source: ipfs/go-ipfs

Irc user A124 reported a large amount of lag caused by an excess of context switches in the ipfs daemon.

When IFPS is running, there are CPU context switches spikes, from average 3k per whole system to 16k, they are fairly regular in pattern (once oevery few seconds)
08:13   <A124>  There is also problem in the CPU scheduler, which is (finally after years) put to light and finally patch should go in. That does consume a lot of cpu when context switches happen to different cores and stuff.
08:14   <A124>  So combination of those spikes + bad scheduler (the default CF on most systems) results in spikes and jitter in realtime applications plus wastes a lot of CPU.
08:23   <A124>  Yeah, I should find the reference. The CPU scheduler is presented on this years conference. But older kernels plus bad application would still suffer.
08:24   <A124>  Their measure performance hit was 10-20% usually, 20% for databases and up to 138x for scientific corner cases.
08:24   <A124>  www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
08:24   < *>    A124 still had it open since last week.
08:29   < *>    A124 notes, that he wanted to do multithreaded client for his application, but after getting more background, decided using polling, as a lot of connections and threads, hence context switching makes whole application both slower and more demanding on system and CPU. ... Depending on type of application and scenario, other people also use Threads + Fibers combination. (Few threads, each with few fibers.) Fibers are not pre-emptive, so less scheduling ove
08:30   < *>    A124 notes According to https://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf page 7, Software execution and Context switching are fields for oportunity. The CPU side being solved by the mentined patches)

He also built some graphs of the issue: https://imgur.com/uN0yF9d&wMp6t3y

This issue is to track this problem.

topilibp2p

Most helpful comment

whys general perf debugging guidelines for go-ipfs:

When you see ipfs doing something (using lots of cpu, memory, or otherwise
being weird), the first thing you want to do is gather all the relevant
profiling information.

  • goroutine dump

    • curl localhost:5001/debug/pprof/goroutine\?debug=2 > ipfs.stacks

  • 30 second cpu profile

    • curl localhost:5001/debug/pprof/profile > ipfs.cpuprof

  • heap trace dump

    • curl localhost:5001/debug/pprof/heap > ipfs.stacks

  • system information

    • ipfs diag sys > ipfs.sysinfo

Bundle all that up and include a copy of the ipfs binary that you are running
(having the exact same binary is important, it contains debug info).

You can investigate yourself if you feel intrepid:

Analysing the stack dump.

The first thing I look for is hung goroutines, any goroutine thats been stuck
for over a minute will note that in the trace, it looks something like:

goroutine 2306090 [semacquire, 458 minutes]:
sync.runtime_Semacquire(0xc8222fd3e4)
    /home/whyrusleeping/go/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc8222fd3e0)
    /home/whyrusleeping/go/src/sync/mutex.go:83 +0x1c4
gx/ipfs/QmedFDs1WHcv3bcknfo64dw4mT1112yptW1H65Y2Wc7KTV/yamux.(*Session).Close(0xc8222fd340, 0x0, 0x0)
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmedFDs1WHcv3bcknfo64dw4mT1112yptW1H65Y2Wc7KTV/yamux/session.go:205 +0x55
gx/ipfs/QmWSJzRkCMJFHYUQZxKwPX8WA7XipaPtfiwMPARP51ymfn/go-stream-muxer/yamux.(*conn).Close(0xc8222fd340, 0x0, 0x0)
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmWSJzRkCMJFHYUQZxKwPX8WA7XipaPtfiwMPARP51ymfn/go-stream-muxer/yamux/yamux.go:39 +0x2d
gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream.(*Conn).Close(0xc8257a2000, 0x0, 0x0)
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream/conn.go:156 +0x1f2
created by gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream.(*Conn).GoClose
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream/conn.go:131 +0xab

At the top, you can see that this goroutine (number 2306090) has been waiting
to acquire a semaphore for 458 minutes. That seems bad. Looking at the rest of
the trace, we see the exact line its waiting on is line 47 of runtime/sema.go.
Thats not particularly helpful, so we move on. Next we see that call was made
by line 205 of yamux/session.go in the Close method of yamux.Session. This
one appears to be the issue.

Given that information, I would look for another goroutine that might be
holding the semaphore in question in the rest of the stack dump. I won't go
into that here.

There are a few different reasons that goroutines can be hung, semacquire
means we're waiting to take a lock or semaphore. select means that the
goroutine is hanging in a select statement and none of the cases are yielding
anything. chan receive and chan send are waiting for a channel to be
received from or sent on, respectively. IO wait generally means that we are
waiting on a socket to read or write data, although it _can_ mean we are
waiting on a very slow filesystem. If you see any of those tags _without_ a , X minutes suffix, that generally means there isn't a problem, you just caught
that goroutine in the middle of a short wait for something. If the wait time is
over a few minutes, that either means that goroutine doesnt do much, or
something is pretty wrong.

Analysing the CPU Profile

The go team wrote an excellent article on profiling go
programs
, if you've already
gathered the above information, you can skip down to where they start talking
about go tool pprof. My go-to method of analysing these is to run the 'web'
command, which generates an svg dotgraph and opens it in your browser. This is
the quickest way to easily point out where the hot spots in the code are.

Other

If you have any questions, or want me to analyse some weird go-ipfs behaviour,
just let me know, and be sure to include all the profiling information
mentioned at the top.

All 5 comments

whys general perf debugging guidelines for go-ipfs:

When you see ipfs doing something (using lots of cpu, memory, or otherwise
being weird), the first thing you want to do is gather all the relevant
profiling information.

  • goroutine dump

    • curl localhost:5001/debug/pprof/goroutine\?debug=2 > ipfs.stacks

  • 30 second cpu profile

    • curl localhost:5001/debug/pprof/profile > ipfs.cpuprof

  • heap trace dump

    • curl localhost:5001/debug/pprof/heap > ipfs.stacks

  • system information

    • ipfs diag sys > ipfs.sysinfo

Bundle all that up and include a copy of the ipfs binary that you are running
(having the exact same binary is important, it contains debug info).

You can investigate yourself if you feel intrepid:

Analysing the stack dump.

The first thing I look for is hung goroutines, any goroutine thats been stuck
for over a minute will note that in the trace, it looks something like:

goroutine 2306090 [semacquire, 458 minutes]:
sync.runtime_Semacquire(0xc8222fd3e4)
    /home/whyrusleeping/go/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc8222fd3e0)
    /home/whyrusleeping/go/src/sync/mutex.go:83 +0x1c4
gx/ipfs/QmedFDs1WHcv3bcknfo64dw4mT1112yptW1H65Y2Wc7KTV/yamux.(*Session).Close(0xc8222fd340, 0x0, 0x0)
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmedFDs1WHcv3bcknfo64dw4mT1112yptW1H65Y2Wc7KTV/yamux/session.go:205 +0x55
gx/ipfs/QmWSJzRkCMJFHYUQZxKwPX8WA7XipaPtfiwMPARP51ymfn/go-stream-muxer/yamux.(*conn).Close(0xc8222fd340, 0x0, 0x0)
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmWSJzRkCMJFHYUQZxKwPX8WA7XipaPtfiwMPARP51ymfn/go-stream-muxer/yamux/yamux.go:39 +0x2d
gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream.(*Conn).Close(0xc8257a2000, 0x0, 0x0)
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream/conn.go:156 +0x1f2
created by gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream.(*Conn).GoClose
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream/conn.go:131 +0xab

At the top, you can see that this goroutine (number 2306090) has been waiting
to acquire a semaphore for 458 minutes. That seems bad. Looking at the rest of
the trace, we see the exact line its waiting on is line 47 of runtime/sema.go.
Thats not particularly helpful, so we move on. Next we see that call was made
by line 205 of yamux/session.go in the Close method of yamux.Session. This
one appears to be the issue.

Given that information, I would look for another goroutine that might be
holding the semaphore in question in the rest of the stack dump. I won't go
into that here.

There are a few different reasons that goroutines can be hung, semacquire
means we're waiting to take a lock or semaphore. select means that the
goroutine is hanging in a select statement and none of the cases are yielding
anything. chan receive and chan send are waiting for a channel to be
received from or sent on, respectively. IO wait generally means that we are
waiting on a socket to read or write data, although it _can_ mean we are
waiting on a very slow filesystem. If you see any of those tags _without_ a , X minutes suffix, that generally means there isn't a problem, you just caught
that goroutine in the middle of a short wait for something. If the wait time is
over a few minutes, that either means that goroutine doesnt do much, or
something is pretty wrong.

Analysing the CPU Profile

The go team wrote an excellent article on profiling go
programs
, if you've already
gathered the above information, you can skip down to where they start talking
about go tool pprof. My go-to method of analysing these is to run the 'web'
command, which generates an svg dotgraph and opens it in your browser. This is
the quickest way to easily point out where the hot spots in the code are.

Other

If you have any questions, or want me to analyse some weird go-ipfs behaviour,
just let me know, and be sure to include all the profiling information
mentioned at the top.

@whyrusleeping This would be great to have in a doc! That way you can just point to it. Want me to do that?

I remember now what happened the last time we ran into this, last time it was reuseport. To test this, set IPFS_REUSEPORT to false and run the daemon. If that solves the problem, then we really should pay attention to the reuseport impl. There were some updates here: https://github.com/golang/go/issues/9661

@RichardLitt if you want to make that into a doc then by all means please do :)

Closing as this issue isn't directly actionable. It's a symptom with many complicated causes.

Was this page helpful?
0 / 5 - 0 ratings