Go-ipfs: Canceling an add operation causes a panic in the datastore (no data loss)

Created on 16 Mar 2018  路  16Comments  路  Source: ipfs/go-ipfs

Canceling (with Ctrl-C) an operation to add content using the badger datastore sometimes results in a panic inside the badger code, without losing the already added files.

This seems to be caused because the database is closed (setting its mt value to nil) while in another goroutine blocks are still being added (trying to increment the reference of the nil object).

I'll research this issue further to see if an issue should be submitted in the badger repo or this is just an invalid use case that breaks the API contract.


Command and stack trace

~/go/src/github.com/ipfs/go-ipfs$ ipfs add  -r --pin=false  .
[...]
added QmNMMNbfW7kJ3iLs69uptyytk9gRtUrjqacZPbh2JPVsn8 go-ipfs/bin/gx-go
 8.52 MB / 130.04 MB [============>----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]   6.55% 4s^C
Error: context canceled
23:44:48.818 ERROR core/comma: context canceled add.go:337
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x10a7ae3]

goroutine 308 [running]:
gx/ipfs/QmdKhi5wUQyV9i3GcTyfUmpfTntWjXu8DcyT9HyNbznYrn/badger/skl.(*Skiplist).IncrRef(...)
    /home/user/go/src/gx/ipfs/QmdKhi5wUQyV9i3GcTyfUmpfTntWjXu8DcyT9HyNbznYrn/badger/skl/skl.go:87
gx/ipfs/QmdKhi5wUQyV9i3GcTyfUmpfTntWjXu8DcyT9HyNbznYrn/badger.(*DB).getMemTables(0xc42015e000, 0x0, 0x0, 0x0, 0x0)
    /home/user/go/src/gx/ipfs/QmdKhi5wUQyV9i3GcTyfUmpfTntWjXu8DcyT9HyNbznYrn/badger/db.go:434 +0x103
gx/ipfs/QmdKhi5wUQyV9i3GcTyfUmpfTntWjXu8DcyT9HyNbznYrn/badger.(*DB).get(0xc42015e000, 0xc42001c4b0, 0x47, 0x47, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/user/go/src/gx/ipfs/QmdKhi5wUQyV9i3GcTyfUmpfTntWjXu8DcyT9HyNbznYrn/badger/db.go:452 +0x7d
gx/ipfs/QmdKhi5wUQyV9i3GcTyfUmpfTntWjXu8DcyT9HyNbznYrn/badger.(*Txn).Get(0xc420466000, 0xc420476500, 0x3f, 0x40, 0x3f, 0x40, 0xc420136390)
    /home/user/go/src/gx/ipfs/QmdKhi5wUQyV9i3GcTyfUmpfTntWjXu8DcyT9HyNbznYrn/badger/transaction.go:369 +0x19e
gx/ipfs/Qmbjb3c2KRPVNZWSvQED8zAf12Brdbp3ksSnGdsJiytqUs/go-ds-badger.(*datastore).Has(0xc4200441f0, 0xc420476480, 0x3f, 0x1e08b00, 0x0, 0x0)
    /home/user/go/src/gx/ipfs/Qmbjb3c2KRPVNZWSvQED8zAf12Brdbp3ksSnGdsJiytqUs/go-ds-badger/datastore.go:109 +0x146
gx/ipfs/QmRhjB5Mnha4k6VH6qRFNabAVkxpbqC7bVw2daMKLHPXXN/go-ds-measure.(*measure).Has(0xc4204aa000, 0xc420476480, 0x3f, 0x1e08b00, 0x0, 0x0)
    /home/user/go/src/gx/ipfs/QmRhjB5Mnha4k6VH6qRFNabAVkxpbqC7bVw2daMKLHPXXN/go-ds-measure/measure.go:148 +0xde
gx/ipfs/QmRhjB5Mnha4k6VH6qRFNabAVkxpbqC7bVw2daMKLHPXXN/go-ds-measure.(*measure).Has(0xc4204aa1c0, 0xc420476480, 0x3f, 0x9fe000, 0x0, 0x0)
    /home/user/go/src/gx/ipfs/QmRhjB5Mnha4k6VH6qRFNabAVkxpbqC7bVw2daMKLHPXXN/go-ds-measure/measure.go:148 +0xde
gx/ipfs/Qmc74pRHvndTDAB5nXztWAV7vs5j1obvCb9ejfQzXp9USX/retry-datastore.(*Datastore).Has.func1(0x3f, 0xc425db3bd0)
    /home/user/go/src/gx/ipfs/Qmc74pRHvndTDAB5nXztWAV7vs5j1obvCb9ejfQzXp9USX/retry-datastore/datastore.go:59 +0x4e
gx/ipfs/Qmc74pRHvndTDAB5nXztWAV7vs5j1obvCb9ejfQzXp9USX/retry-datastore.(*Datastore).runOp(0xc420446e10, 0xc420143c18, 0xc420476480, 0x3f)
    /home/user/go/src/gx/ipfs/Qmc74pRHvndTDAB5nXztWAV7vs5j1obvCb9ejfQzXp9USX/retry-datastore/datastore.go:21 +0x2b
gx/ipfs/Qmc74pRHvndTDAB5nXztWAV7vs5j1obvCb9ejfQzXp9USX/retry-datastore.(*Datastore).Has(0xc420446e10, 0xc420476480, 0x3f, 0xc420476480, 0x3f, 0x38)
    /home/user/go/src/gx/ipfs/Qmc74pRHvndTDAB5nXztWAV7vs5j1obvCb9ejfQzXp9USX/retry-datastore/datastore.go:57 +0x9a
gx/ipfs/QmPpegoMqhAEqjncrzArm7KVWAkCm78rqL2DPuNjhPrshg/go-datastore/keytransform.(*ktds).Has(0xc420464120, 0xc420476180, 0x38, 0x30, 0xc420476180, 0x38)
    /home/user/go/src/gx/ipfs/QmPpegoMqhAEqjncrzArm7KVWAkCm78rqL2DPuNjhPrshg/go-datastore/keytransform/keytransform.go:48 +0x76
gx/ipfs/QmPpegoMqhAEqjncrzArm7KVWAkCm78rqL2DPuNjhPrshg/go-datastore/namespace.(*datastore).Has(0xc420446f00, 0xc420476180, 0x38, 0xc420011a90, 0xc427a5cff0, 0x22)
    <autogenerated>:1 +0x51
github.com/ipfs/go-ipfs/blocks/blockstore.(*blockstore).Has(0xc420464140, 0xc426093710, 0xc420470000, 0x7f28c35096c8, 0x0)
    /home/user/go/src/github.com/ipfs/go-ipfs/blocks/blockstore/blockstore.go:184 +0x41
github.com/ipfs/go-ipfs/blocks/blockstore.(*arccache).Has(0xc420132400, 0xc426093710, 0x1344e60, 0xc420498001, 0xc420498400)
    /home/user/go/src/github.com/ipfs/go-ipfs/blocks/blockstore/arc_cache.go:77 +0x6c
github.com/ipfs/go-ipfs/blocks/blockstore.(*gcBlockstore).Has(0xc4202bb380, 0xc426093710, 0x8, 0xc420498400, 0x0)
    <autogenerated>:1 +0x46
github.com/ipfs/go-ipfs/blockservice.(*blockService).AddBlocks(0xc4202be8a0, 0xc420498080, 0x8, 0x8, 0xc4270cce70, 0x8)
    /home/user/go/src/github.com/ipfs/go-ipfs/blockservice/blockservice.go:154 +0x2af
github.com/ipfs/go-ipfs/merkledag.(*dagService).AddMany(0xc420161cb0, 0x1d11440, 0xc425edb540, 0xc420050200, 0x8, 0x20, 0xc4200601c0, 0xc420170f50)
    /home/user/go/src/github.com/ipfs/go-ipfs/merkledag/merkledag.go:54 +0x154
gx/ipfs/Qme5bWv7wtjUNGsK2BNGVUFPKiuxWrsqrtvYwCLRw8YFES/go-ipld-format.(*Batch).asyncCommit.func1(0x1d11440, 0xc425edb540, 0xc420050200, 0x8, 0x20, 0xc4200639e0, 0x1d15720, 0xc420161cb0)
    /home/user/go/src/gx/ipfs/Qme5bWv7wtjUNGsK2BNGVUFPKiuxWrsqrtvYwCLRw8YFES/go-ipld-format/batch.go:96 +0x8a
created by gx/ipfs/Qme5bWv7wtjUNGsK2BNGVUFPKiuxWrsqrtvYwCLRw8YFES/go-ipld-format.(*Batch).asyncCommit
    /home/user/go/src/gx/ipfs/Qme5bWv7wtjUNGsK2BNGVUFPKiuxWrsqrtvYwCLRw8YFES/go-ipld-format/batch.go:94 +0xbc

P2 kinbug topibadger topicommands

All 16 comments

@schomatis Good find!

There might be a fix for this already in https://github.com/dgraph-io/badger/pull/437

As discussed in dgraph-io/badger#437, that PR partially addressed the issue here.

The main problem seems to be that IPFS is closing a DB which it is still operating on when the command is canceled. I would like to find out the goroutines program flow that lead to this situation, any pointers will be very appreciated.

@schomatis thanks for investigating.

Ipfs add spins off this goroutine here: https://github.com/ipfs/go-ipfs/blob/master/core/commands/add.go#L330

And when you send a ctrl+c, it invokes these handlers: https://github.com/ipfs/go-ipfs/blob/master/cmd/ipfs/main.go#L391

Cancelling that context cancels the entire programs context tree, which triggers a shutdown of the core.IpfsNode, eventually reaching this function: https://github.com/ipfs/go-ipfs/blob/master/core/core.go#L543 (which is where the datastore gets closed)

@whyrusleeping Thanks for the details on how IPFS shuts down. This seems more of an architecture decision than a technical one, and I'm lacking the IPFS high level understanding to add real value here.

Independently of Badger's issue discussed here this is a problem that any of the closing services will have, there are two opposing forces between the teardown and IPFS operations like the add command that consume those services and may (or may not) detect the context cancellation at a later time (or never). The central authority (is that the IpfsNode?) should have a record of what operations are being done which are consuming the services to wait for them to stop before shutting down (at least in the first attempt, with only one Ctrl+C, which tries to end things gracefully). (I'm using the generic term operations because I don't know what is going on inside the IPFS daemon or offline program, commands are being run, HTTP requests are being processed, etc.)

@schomatis I think what we should probably be doing on ipfs shutdown is to wait for:

  • bitswap to shut down fully
  • the dht to shut down fully
  • the api the shut down fully

    • We have a log of all actively running api commands: https://github.com/ipfs/go-ipfs/blob/master/commands/reqlog.go

    • the ipfs diag cmds command prints them out.

    • If we cancel the master context, it should shut down the api server, and cancel all the contexts for all running commands. We could then just wait for all the commands in the reqlog to complete.

And then close the repo (incl badger)

That should prevent the issues we're seeing here, i think

@whyrusleeping From what I'm seeing in the code, LogRequest is only used in the HTTP handler (ServeHTTP), but does this also work for an offline command? (as the one in this issue). The "running operations" bookkeeping should happen at the core level (is that IpfsNode?).

@schomatis ah, very good point. If the daemon isnt running, then that won't be registered there... this might need a bit more surgery to fix then.

We should have a way of waiting on actions to complete even in the case of a command running without a daemon. It might be easy enough to just add a single entry to the request log. I think the reqlog instance is still created whether or not you have a daemon. So all we have to do for that i think it just change the executor (see makeExecutor in cmd/ipfs/main.go) to add the operation to the log when its invoked.

If you'd like, I can have someone else who already has some experience in the commands lib take a look, or if you're feeling adventurous, go right ahead :)

P.S. i'm going to bump the severity of this issue. I think its actually a bit more important than a P3

From what I've been understanding from the code you mentioned (and please correct me as I'm having a hard time understanding the commands library architecture) there are two scenarios for an IPFS command, either run it as a stand-alone process (this would be the offline mode) or act as a (HTTP) client that will forward it to the IPFS daemon (online mode, client/server architecture). In the online mode the request is currently being logged on the server side (ServeHTTP) which seems correct (the client is not executing the request just forwarding it). In the offline mode that request (that currently is not being logged) should be logged by the Execute function of the cmds.executor structure (of the go-ipfs-cmd repo) using the Environment argument that holds the request logger, is that correct?

My confusion arises due to some terminology I'm not grasping correctly. The request log actually logs only requests under execution, this might sound like a nitpick but in the client/server architecture both the client and the server receive a command request (the client from the CLI and the server from the HTTP client) but only the sever should log it (because it is the one executing it). It could also be said that the client is doing some part of the execution (preprocessing and forwarding the request) but that should be explicitly clarified (IMO) in the documentation. In a similar way the Executor interface (with the Execute command) which is obtained from makeExecutor can be either the real executor of the command (offline) or just a client that will request (but not execute) it (online).

Also, the structure containing the request log is named Context in the commands package, which implements the Environment interface (in the cmds package of the go-ipfs-cmd repo) which has a Context() function that returns a context.Context (that is, Go's Context). That structure seems to be always asserted to the commands.Context without error checking even though functions like makeExecutor treat the environment variable as a general interface{}.

If you'd like, I can have someone else who already has some experience in the commands lib take a look, or if you're feeling adventurous, go right ahead :)

I don't mind doing the heavy lifting part of this issue but I would certainly benefit from someone with more experience in this part of the code base that could provide me some guidance regarding the architecture of the commands library.

@keks From what I'm seeing in the log you have a lot of experience with the commands library architecture, would you mind taking a look at this issue and tell me what you think? Also if you could clarify some of my doubts (see above post) that would really help me to decide on how to proceed here.

Hmm, I think this is caused by a flaw in the context cancellation model: you can only cancel everything at once and can not wait until something is done canceling.

@schomatis The request log actually only logs requests that come in through the http interface. cli requests are not logged, so this only affects client-server connections. Yes, this happens through the environment, which is passed into the library using the buildEnv function defined in main.go. This design allows the user of the cmds lib to hook into the handlers, which currently is used mostly for logging requests. Since go-ipfs only passes that in once, we can be pretty sure that all the environments we get are oldcmds.Context - otherwise it's a severe programming error and we deal with undefined state, so panicking is okay when that type assertion fails.
Yes, the executor can either be the local executor, which directly calls the command's Run function, or the http client, which redirects the request to the server. Note that the cmd's PostRun function is called on the caller's side, not on the server side.

Maybe one way to tackle this is to add a method Shutdown() error to oldcmds.Context, which stops things in the right order, and call that from cmds when the context is canceled. Since the time I started working on the code I mostly moved that part around though, so I'm not 100% sure that works like I expect.

Did that answer your questions? If not, please ask on :)

This also touches on a similar problem we have with ipfs shutdown.

We sometimes close HTTP API before we send the response before closing and/or don't close to connection to the API client cleanly causing weird errors on CLI and test failures in sharness.

@keks Thanks for the explanation. Regarding your proposed solution, could you expand on it some more? I don't see how this new Shutdown() method would interact with the existing IpfsNode.teardown() one.

Ok, from what I'm understanding there is no easy fix here and the problem exceeds the Badger datastore,

this is caused by a flaw in the context cancellation model: you can only cancel everything at once and can not wait until something is done canceling.

it just was manifested in this particular use case of the Badger service shutdown.

I'm opening another issue to address this more general problem and when that is resolved we can see how to leverage it in the Badger case.

@schomatis Sorry I only respond now.

Oh, there already is a function like that. I don't know that part of the codebase well so I can't really comment.

We sometimes close HTTP API before we send the response before closing and/or don't close to connection to the API client cleanly causing weird errors on CLI and test failures in sharness.

Note: this has now been fixed.

Can't we just take a read lock when writing and a write lock when closing? I believe the core issue here is that contexts exist for cancellation of requests, not closing of services (and we're abusing them all over the place).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

whyrusleeping picture whyrusleeping  路  4Comments

daviddias picture daviddias  路  3Comments

magik6k picture magik6k  路  3Comments

Kubuxu picture Kubuxu  路  3Comments

Jorropo picture Jorropo  路  3Comments