go version)?go version go1.8.3 darwin/amd64
Unlike Conn or Stmt there's no guidance on Tx indicating how the stdlib uses implementations. This seems to cause bugs in driver implementations, as a Tx is expected to handle concurrent calls from multiple goroutines unlike most of the other interfaces in the package.
This came up as I was debugging a panic inside of lib/pq and discovered it was a data race due to the authors not assuming that their Tx would have to deal with calls from multiple goroutines (the repro is here. It slowly adjusts timeouts to try and make the Context time out as results are coming back). Out of curiosity, I looked at go-sql-driver/mysql and the lack of a race there seems like a happy accident - the Context expiring occasionally causes an internal error to bubble up, since the authors appear to aggressively check their invariants.
I'm not sure whether this is a documentation problem or if there's something database/sql could do to make life easier for driver implementors.
/cc @kardianos @bradfitz
In my case I have a pool of database connections. I open a transaction and run queries. If I use BeginTx() I get an crash panic overtime.
Part of why I need a transaction is so I can use postgresql statement_timeout. The only way to use it with a connection pool is to use a transaction.
If I just use Begin() without context it never crashes. In my case though I need to be able to cancel the transaction if conditions are met in another goroutine.
Hey y'all,
It looks like driver libs are still tossing out panics in Go 1.9. The underlying problem with the way the stdlib and the drivers are misunderstanding each other seems like it's still there. This comment has some more info (race detector output, panic traces, and a reliable repro) on how lib/pq and database/sql interact to panic.
This might not be a documentation issue but a driver issue.
row.Next locks on closemu, but doesn't lock on the driver connection lock. If DB.Next operations also locked on the dc, then we would probably correct this issue. Could someone apply the following diff locally and see if it fixes it? I need to do a fuller evaluation of what is going on as well.
In database/sql/sql.go
diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index c609fe4..4f2c9f9 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2454,6 +2454,9 @@ func (rs *Rows) nextLocked() (doClose, ok bool) {
if rs.lastcols == nil {
rs.lastcols = make([]driver.Value, len(rs.rowsi.Columns()))
}
+ rs.dc.Lock()
+ defer rs.dc.Unlock()
+
rs.lasterr = rs.rowsi.Next(rs.lastcols)
if rs.lasterr != nil {
// Close the connection if there is a driver error.
@@ -2503,6 +2506,8 @@ func (rs *Rows) NextResultSet() bool {
doClose = true
return false
}
+ rs.dc.Lock()
+ defer rs.dc.Unlock()
rs.lasterr = nextResultSet.NextResultSet()
if rs.lasterr != nil {
doClose = true
I need to do a full evaluation of this solution. But you could try it out and see if the issue goes away.
I applied the patch and that seems to fix the panic and satisfy the race detector!
Now that the log lines aren't mostly race detector output/panic stacktrace I've started noticing that quite a lot of the errors that come back are sql.ErrNoRows. That seems surprising to me.
Here's some sample data after running the repro for a few minutes:
$ ~/oss/go/bin/go run -race repro.go &> repro.log
# wait a while, grab some coffee
$ grep -E -o "error=.*" < repro.log | sort | uniq -c
46 error="context deadline exceeded"
1 error="driver: bad connection"
3017 error="pq: canceling statement due to user request"
42 error="sql: Rows are closed"
1369 error="sql: no rows in result set"
@blinsay Good catch. Next step will be to determine why it is reporting no rows sometimes. I suspect it may be driver issue of returning the wrong error (NoRows rather then ctx.Error), but I haven't looked.
Change https://golang.org/cl/65731 mentions this issue: database/sql: prevent race in driver by locking dc in Next
Once fixed can this be released in go 1.9.1 or 1.9.2? This is causing me problems in my application and waiting for 1.10 would be hard.
Thank you
I'm not that familiar with the Go release process. This issue appears to be fixed but sitting in Code Review. What comes next? Is there any hope of getting this in a 1.9.x release?
I've marked this for consideration for a 1.9 release.
Every time we touch the locking in database/sql it seems like we introduce a deadlock. What is the condition under which the problem being fixed happens? Are we sure that the expected benefit here outweighs the very real chance of introducing a deadlock in some other case?
Also, is this a regression from Go 1.8? If not, then we really should leave it alone and let the ordinary Go 1.10 testing happen.
@rsc I don't believe this is a regression from go1.8.
The benefit is it prevents real races in real programs.
In this case the risk of deadlock seems non-existent. Unlike previous issues, this just using the existing locking in *driverConn as it is meant to be used, locking for the short period of time around driver connection calls. The only way this could introduce a deadlock is if a driver calls some public sql API referencing the same connection from within the driver.Next, which would be a blatant driver issue.
I offer no opinion on when it should be released.
This problem is affecting our production application. A 1.9.2 fix would be appreciated.
What is the condition under which the problem being fixed happens?
@rsc the panic seems like it happens when a Context is cancelled as a query is returning results. I hit it organically while introducing a context.Context to some Postgres queries so that they could respect http.Request deadlines. With reasonable timeouts, we hit the panic every couple hours.
In my case I hit the panic every minute. I can鈥檛 use context at all because of how frequently it panics.
Change https://golang.org/cl/71450 mentions this issue: Revert "database/sql: prevent race in driver by locking dc in Next"
Change https://golang.org/cl/71433 mentions this issue: database/sql: ensure all driver interfaces are called under single lock
Change https://golang.org/cl/71510 mentions this issue: [release-branch.go1.9] database/sql: prevent race in driver by locking dc in Next
@blinsay and @chrispassas and anyone else seeing these crashes: I am assuming you are using Linux. Can you please try https://golang-release-staging.storage.googleapis.com/go1.9.2/go1.9.2rc2.linux-amd64.tar.gz and see if the problem disappears? Thank you.
@rsc I just tried with 1.9.1 and can reproduce the crash reliably in a few seconds.
Using 1.9.2rc2 I've let it run for a few minutes 3 times with 0 crashes.
@chrispassas, thanks very much. Based on that, I'm comfortable with including that fix in the final Go 1.9.2.
@rsc Thank you, I'm looking forward to having it!
OK, I am going to approve the cherry-pick to Go 1.9.2 over in #22314 and leave this bug open to track additional locking fixes in Go 1.10.
I'm reopening this because the commit that auto-closed it was reverted afterwards by 292366e7162a030cade7b177eb4ad55bd887d25f
Reason for revert: Fails to fix all the locking issues.
AFAICT this issue has not been addressed for 1.10. Feel free to close if that's not correct. (I was brought here by some serious Tx+context races/crashes in both 1.9.2 and 1.10beta1.)
The fix that went into version 1.9.2 did fix the crashes for QueryContext. If I use ExecContext and PrepareContext I do get crashes fairly often. It looked like right after 1.9.2 was tagged they went back and fixed the muteness for 1.10.
If I can get some time this week I'll do some test with 1.10 beta 1 to see if it still crashes.
The fix for this did get in shortly after that revert. Yes, please verify @chrispassas , but as far as I know this is fixed.
@kardianos can you please link the fix here?
@kardianos thanks, I missed that.
After running my crashing code with on tip + CL 85175 (for #23208) I don't see any issues.
I'm going to re-close this as we don't seem to have any reason to believe that there's still a locking issue here. We can revisit if @chrispassas finds something.
I have tested version 1.9.2 and QueryContext is good but ExecContext will trigger a depending how often your calling it.
I still need to test that 1.10-beta1 ExecContext is OK.
Most helpful comment
I've marked this for consideration for a 1.9 release.