Go: database/sql: add extra locking when calling driver

Created on 21 Jul 2017 · 31Comments · Source: golang/go

What version of Go are you using (`go version`)?

go version go1.8.3 darwin/amd64

What did you do?

Unlike Conn or Stmt there's no guidance on Tx indicating how the stdlib uses implementations. This seems to cause bugs in driver implementations, as a Tx is expected to handle concurrent calls from multiple goroutines unlike most of the other interfaces in the package.

This came up as I was debugging a panic inside of lib/pq and discovered it was a data race due to the authors not assuming that their Tx would have to deal with calls from multiple goroutines (the repro is here. It slowly adjusts timeouts to try and make the Context time out as results are coming back). Out of curiosity, I looked at go-sql-driver/mysql and the lack of a race there seems like a happy accident - the Context expiring occasionally causes an internal error to bubble up, since the authors appear to aggressively check their invariants.

I'm not sure whether this is a documentation problem or if there's something database/sql could do to make life easier for driver implementors.

FrozenDueToAge NeedsDecision

Source

blinsay

Most helpful comment

I've marked this for consideration for a 1.9 release.

ianlancetaylor on 3 Oct 2017

👍4

All 31 comments

/cc @kardianos @bradfitz

odeke-em on 21 Jul 2017

In my case I have a pool of database connections. I open a transaction and run queries. If I use BeginTx() I get an crash panic overtime.

Part of why I need a transaction is so I can use postgresql statement_timeout. The only way to use it with a connection pool is to use a transaction.

If I just use Begin() without context it never crashes. In my case though I need to be able to cancel the transaction if conditions are met in another goroutine.

chrispassas on 30 Aug 2017

Hey y'all,

It looks like driver libs are still tossing out panics in Go 1.9. The underlying problem with the way the stdlib and the drivers are misunderstanding each other seems like it's still there. This comment has some more info (race detector output, panic traces, and a reliable repro) on how lib/pq and database/sql interact to panic.

blinsay on 30 Aug 2017

This might not be a documentation issue but a driver issue.
row.Next locks on closemu, but doesn't lock on the driver connection lock. If DB.Next operations also locked on the dc, then we would probably correct this issue. Could someone apply the following diff locally and see if it fixes it? I need to do a fuller evaluation of what is going on as well.

In database/sql/sql.go

diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index c609fe4..4f2c9f9 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2454,6 +2454,9 @@ func (rs *Rows) nextLocked() (doClose, ok bool) {
        if rs.lastcols == nil {
                rs.lastcols = make([]driver.Value, len(rs.rowsi.Columns()))
        }
+       rs.dc.Lock()
+       defer rs.dc.Unlock()
+
        rs.lasterr = rs.rowsi.Next(rs.lastcols)
        if rs.lasterr != nil {
                // Close the connection if there is a driver error.
@@ -2503,6 +2506,8 @@ func (rs *Rows) NextResultSet() bool {
                doClose = true
                return false
        }
+       rs.dc.Lock()
+       defer rs.dc.Unlock()
        rs.lasterr = nextResultSet.NextResultSet()
        if rs.lasterr != nil {
                doClose = true

I need to do a full evaluation of this solution. But you could try it out and see if the issue goes away.

kardianos on 31 Aug 2017

I applied the patch and that seems to fix the panic and satisfy the race detector!

Now that the log lines aren't mostly race detector output/panic stacktrace I've started noticing that quite a lot of the errors that come back are sql.ErrNoRows. That seems surprising to me.

Here's some sample data after running the repro for a few minutes:

$ ~/oss/go/bin/go run -race repro.go &> repro.log

# wait a while, grab some coffee

$ grep -E -o "error=.*" < repro.log | sort | uniq -c
  46 error="context deadline exceeded"
   1 error="driver: bad connection"
3017 error="pq: canceling statement due to user request"
  42 error="sql: Rows are closed"
1369 error="sql: no rows in result set"

blinsay on 1 Sep 2017

👍1

@blinsay Good catch. Next step will be to determine why it is reporting no rows sometimes. I suspect it may be driver issue of returning the wrong error (NoRows rather then ctx.Error), but I haven't looked.

kardianos on 1 Sep 2017

Change https://golang.org/cl/65731 mentions this issue: database/sql: prevent race in driver by locking dc in Next

gopherbot on 24 Sep 2017

👍2

Once fixed can this be released in go 1.9.1 or 1.9.2? This is causing me problems in my application and waiting for 1.10 would be hard.

Thank you

chrispassas on 25 Sep 2017

I'm not that familiar with the Go release process. This issue appears to be fixed but sitting in Code Review. What comes next? Is there any hope of getting this in a 1.9.x release?

https://go-review.googlesource.com/c/go/+/65731

chrispassas on 2 Oct 2017

👍1

I've marked this for consideration for a 1.9 release.

ianlancetaylor on 3 Oct 2017

👍4

Every time we touch the locking in database/sql it seems like we introduce a deadlock. What is the condition under which the problem being fixed happens? Are we sure that the expected benefit here outweighs the very real chance of introducing a deadlock in some other case?

rsc on 13 Oct 2017

Also, is this a regression from Go 1.8? If not, then we really should leave it alone and let the ordinary Go 1.10 testing happen.

rsc on 13 Oct 2017

👎3 👍1

@rsc I don't believe this is a regression from go1.8.

The benefit is it prevents real races in real programs.

In this case the risk of deadlock seems non-existent. Unlike previous issues, this just using the existing locking in *driverConn as it is meant to be used, locking for the short period of time around driver connection calls. The only way this could introduce a deadlock is if a driver calls some public sql API referencing the same connection from within the driver.Next, which would be a blatant driver issue.

I offer no opinion on when it should be released.

kardianos on 13 Oct 2017

This problem is affecting our production application. A 1.9.2 fix would be appreciated.

chrispassas on 13 Oct 2017

👍2

What is the condition under which the problem being fixed happens?

@rsc the panic seems like it happens when a Context is cancelled as a query is returning results. I hit it organically while introducing a context.Context to some Postgres queries so that they could respect http.Request deadlines. With reasonable timeouts, we hit the panic every couple hours.

blinsay on 13 Oct 2017

In my case I hit the panic every minute. I can’t use context at all because of how frequently it panics.

chrispassas on 13 Oct 2017

Change https://golang.org/cl/71450 mentions this issue: Revert "database/sql: prevent race in driver by locking dc in Next"

gopherbot on 18 Oct 2017

Change https://golang.org/cl/71433 mentions this issue: database/sql: ensure all driver interfaces are called under single lock

gopherbot on 18 Oct 2017

Change https://golang.org/cl/71510 mentions this issue: [release-branch.go1.9] database/sql: prevent race in driver by locking dc in Next

gopherbot on 18 Oct 2017

@blinsay and @chrispassas and anyone else seeing these crashes: I am assuming you are using Linux. Can you please try https://golang-release-staging.storage.googleapis.com/go1.9.2/go1.9.2rc2.linux-amd64.tar.gz and see if the problem disappears? Thank you.

rsc on 19 Oct 2017

@rsc I just tried with 1.9.1 and can reproduce the crash reliably in a few seconds.

Using 1.9.2rc2 I've let it run for a few minutes 3 times with 0 crashes.

chrispassas on 19 Oct 2017

@chrispassas, thanks very much. Based on that, I'm comfortable with including that fix in the final Go 1.9.2.

rsc on 19 Oct 2017

🎉2

@rsc Thank you, I'm looking forward to having it!

chrispassas on 19 Oct 2017

OK, I am going to approve the cherry-pick to Go 1.9.2 over in #22314 and leave this bug open to track additional locking fixes in Go 1.10.

rsc on 20 Oct 2017

I'm reopening this because the commit that auto-closed it was reverted afterwards by 292366e7162a030cade7b177eb4ad55bd887d25f

Reason for revert: Fails to fix all the locking issues.

AFAICT this issue has not been addressed for 1.10. Feel free to close if that's not correct. (I was brought here by some serious Tx+context races/crashes in both 1.9.2 and 1.10beta1.)

cespare on 21 Dec 2017

The fix that went into version 1.9.2 did fix the crashes for QueryContext. If I use ExecContext and PrepareContext I do get crashes fairly often. It looked like right after 1.9.2 was tagged they went back and fixed the muteness for 1.10.

If I can get some time this week I'll do some test with 1.10 beta 1 to see if it still crashes.

chrispassas on 21 Dec 2017

The fix for this did get in shortly after that revert. Yes, please verify @chrispassas , but as far as I know this is fixed.

kardianos on 21 Dec 2017

@kardianos can you please link the fix here?

cespare on 21 Dec 2017

https://github.com/golang/go/commit/1126d1483f0397648905fcd4590ae45352cabd69
https://golang.org/cl/71433

kardianos on 21 Dec 2017

@kardianos thanks, I missed that.

After running my crashing code with on tip + CL 85175 (for #23208) I don't see any issues.

I'm going to re-close this as we don't seem to have any reason to believe that there's still a locking issue here. We can revisit if @chrispassas finds something.

cespare on 22 Dec 2017

I have tested version 1.9.2 and QueryContext is good but ExecContext will trigger a depending how often your calling it.

I still need to test that 1.10-beta1 ExecContext is OK.

chrispassas on 22 Dec 2017

Was this page helpful?

0 / 5 - 0 ratings