go version)?1.12.8
I haven't been able to run with the latest version, but examination of the code shows that the problem is still there.
Very rarely, I have observed problems resulting from transactions rolled back as a result of context cancellation. Namely, I have observed writes from presumably aborted transactions persist nonetheless. I added enough debugging output to the mysql driver to observe the following sequence:
driver.Tx.Rollback()
driver.Stmt.ExecContext()
driver.Conn.Close()
The driver documentation at https://golang.org/pkg/database/sql/driver/#ConnBeginTx:
// If the context is canceled by the user the sql package will
// call Tx.Rollback before discarding and closing the connection.
Furthermore, the documentation promises that driver connections won't be used concurrently by multiple goroutines:
Conn is a connection to a database. It is not used concurrently by multiple goroutines.
I believe that contract was violated and that there is a gap between the "rollback" and the "close" in which a statement was allowed to execute. Since MySQL defaults to autocommit, that statement which we thought was executing within a transaction (which got rolled back) actually executed outside the transaction and committed.
I believe the bug is in tx.rollback in sql.go:
// rollback aborts the transaction and optionally forces the pool to discard
// the connection.
func (tx *Tx) rollback(discardConn bool) error {
if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
return ErrTxDone
}
var err error
withLock(tx.dc, func() {
err = tx.txi.Rollback()
})
if err != driver.ErrBadConn {
tx.closePrepared()
}
if discardConn {
err = driver.ErrBadConn
}
tx.close(err)
return err
}
The connection is locked to call driver.Tx.Rollback(), but nothing prevents a statement from executing before we get to tx.close(err).
Standalone repro:
https://gist.github.com/dadkins-at-dropbox/5b051b9ccb30f115c332c6037d03d63d
package main
import (
"context"
"database/sql"
"fmt"
"log"
"os"
"time"
_ "github.com/go-sql-driver/mysql"
)
func main() {
if len(os.Args) != 2 {
log.Fatalf("usage: %s <dbname>", os.Args[0])
}
dbName := os.Args[1]
db, err := sql.Open("mysql", "root@/"+dbName)
if err != nil {
log.Fatal(err)
}
defer db.Close()
db.SetMaxOpenConns(100)
db.SetMaxIdleConns(100)
if _, err := db.Exec("drop table if exists t"); err != nil {
log.Fatal(err)
}
if _, err := db.Exec("create table t (id int not null auto_increment primary key)"); err != nil {
log.Fatal(err)
}
ins, err := db.Prepare("insert into t (id) values (null)")
if err != nil {
log.Fatal(err)
}
sel, err := db.Prepare("select count(*) from t")
if err != nil {
log.Fatal(err)
}
for i := 0; i < 10; i++ {
go func() {
for i := 0; ; i++ {
runTx(db, ins)
}
}()
}
for range time.Tick(time.Second) {
var n int
r := sel.QueryRow()
if err := r.Scan(&n); err != nil {
fmt.Println("select", err)
continue
}
fmt.Printf("%d rows (should be 0)\n", n)
}
}
func runTx(db *sql.DB, stmt *sql.Stmt) error {
ctx, cancel := context.WithCancel(context.Background())
quit := make(chan struct{})
done := make(chan struct{})
tx, err := db.BeginTx(ctx, nil)
if err != nil {
fmt.Println("begin", err)
return err
}
defer tx.Rollback()
stmt = tx.Stmt(stmt)
go func() {
timer := time.NewTimer(time.Microsecond)
defer timer.Stop()
select {
case <-quit:
case <-timer.C:
}
cancel()
close(done)
}()
defer func() {
close(quit)
<-done
}()
for i := 0; i < 10; i++ {
if _, err := stmt.Exec(); err != nil {
return err
}
}
_ = tx.Rollback()
return nil
}
outputs
1 rows (should be 0)
3 rows (should be 0)
4 rows (should be 0)
5 rows (should be 0)
^Csignal: interrupt
when you should never succeed in inserting any rows, since the transaction is always rolled back.
$ go version
go version go1.13.1 darwin/amd64
You used stmt.Exec(). It doesn't use transaction tx.
You should use tx.Exec() instead.
stmt = tx.Stmt(stmt) associates the prepared statement with the transaction, and that's the one the repro uses.
I'm sorry, I've missed it. I confirm your repro.
@bradfitz @kardianos @kevinburke
Here's another log trace, via the mysql driver, of driver.Tx.Rollback being called in the middle of a prepared driver.Stmt.ExecContext and wreaking havoc, because that driver is not expecting those calls to be interleaved:
```
[mysql] 2019/10/10 23:25:59 connection.go:577: [0xc000348fc0]stmt.ExecContext
* COM_STMT_EXECUTE
[mysql] 2019/10/10 23:25:59 packets.go:155: [0xc000348fc0] :write len=10 seq=0
[mysql] 2019/10/10 23:25:59 packets.go:156: [0xc000348fc0]
00000000 17 01 00 00 00 00 01 00 00 00 |..........|
* column count
[mysql] 2019/10/10 23:25:59 packets.go:45: [0xc000348fc0] :read len=1 seq=1
[mysql] 2019/10/10 23:25:59 packets.go:80: [0xc000348fc0]
00000000 01 |.|
* column def
[mysql] 2019/10/10 23:25:59 packets.go:45: [0xc000348fc0] :read len=37 seq=2
[mysql] 2019/10/10 23:25:59 packets.go:80: [0xc000348fc0]
00000000 03 64 65 66 00 00 00 0f 43 4f 4e 4e 45 43 54 49 |.def....CONNECTI|
00000010 4f 4e 5f 49 44 28 29 00 0c 3f 00 15 00 00 00 08 |ON_ID()..?......|
00000020 a1 00 00 00 00 |.....|
* EOF
[mysql] 2019/10/10 23:25:59 packets.go:45: [0xc000348fc0] :read len=5 seq=3
[mysql] 2019/10/10 23:25:59 packets.go:80: [0xc000348fc0]
00000000 fe 00 00 03 00 |.....|
* RACE: stmt.ExecContext is still active, shouldn't be calling tx.Rollback
[mysql] 2019/10/10 23:25:59 transaction.go:30: [0xc000348fc0]tx.Rollback
[mysql] 2019/10/10 23:25:59 packets.go:454: busy buffer
resets mc.sequence = 0 *
* packet header for row...
[mysql] 2019/10/10 23:25:59 packets.go:45: [0xc000348fc0] :read len=10 seq=4
ErrPktSyncMul? since mc.sequence got reset **
* disaster, we've treating the packet body as a new packet header, and the sequence number matches! we're going to lunch now while we wait to read 10MB that's never coming.
[mysql] 2019/10/10 23:25:59 packets.go:45: [0xc000348fc0] :read len=10027008 seq=0
It would be great if someone more familiar with this code could take a look. I think I've pretty much nailed down the problem, but I'm not really in a good position to propose a fix to this code which doesn't introduce new problems or break some intricate assumptions.
I've confirmed the following race:
Goroutine 1:
(Stmt).ExecContext begins a prepared statement in a transaction.
(Tx).grabConn obtains tx.closemu in read mode and returns a valid (*driveConn)
Goroutine 2:
(Tx).awaitDone awakes on canceled context and calls (Tx).rollback
(Tx).rollback atomically changes tx.done from 0 to 1
(Tx).rollback calls tx.txi.Rollback() while holding (*driverConn) mutex.
Goroutine 1:
(*Stmt).ExecContext calls resultFromStatement with *driverConn and *driverStmt
resultFromStatement locks ds, which is actually dc.mutex.
resultFromStatement executes statement with driverStmt.ExecContext
resultFromStatement unlocks dc.mutex.
Goroutine 2:
(Tx).rollback closes prepared statements and calls (Tx).close
(Tx).close calls tx.cancel(), which is a no-op.
(Tx).close calls tx.releaseConn() while holding tx.closemu.
It definitely allows a statement to execute on a connection after a rollback caused by context cancellation, which under mysql autocommits that statement unexpectedly.
Examining the code, it seems like we should hold tx.closemu for the duration of tx.Close or tx.Rollback, certainly while we're changing the state of tx.done. I tried the following workaround:
diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index 0f5bbc01c9..abc892dc08 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2021,8 +2021,8 @@ var ErrTxDone = errors.New("sql: transaction has already been committed or rolle
func (tx *Tx) close(err error) {
tx.cancel()
- tx.closemu.Lock()
- defer tx.closemu.Unlock()
+ //tx.closemu.Lock()
+ //defer tx.closemu.Unlock()
tx.releaseConn(err)
tx.dc = nil
@@ -2087,6 +2087,10 @@ func (tx *Tx) Commit() error {
}
return tx.ctx.Err()
}
+
+ tx.closemu.Lock()
+ defer tx.closemu.Unlock()
+
if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
return ErrTxDone
}
@@ -2104,6 +2108,12 @@ func (tx *Tx) Commit() error {
// rollback aborts the transaction and optionally forces the pool to discard
// the connection.
func (tx *Tx) rollback(discardConn bool) error {
+ // Problem: We can't lock tx.closemu while an oustanding
+ // Rows holding tx.closemu in R mode.
+ // We must call tx.cancel to release that first.
+ tx.cancel()
+ tx.closemu.Lock()
+ defer tx.closemu.Unlock()
if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
return ErrTxDone
}
It was unsuccessful.
First, I broke Issues 18719 https://github.com/golang/go/issues/18719. The fix for that issue (https://go-review.googlesource.com/c/go/+/35550/) introduced the closemu mutex, which was supposed to be held exclusively while closing a tx and releasing its conn back to the pool, while queries hold closemu in read mode to prevent the close from occurring mid-query. Its test inserted a hook into tx.grabConn which expects to be able to wait for tx.done to change from 0 to 1 while it sits there holding tx.closemu in read mode. That no longer works, but the test could be changed.
The fix to Issue 18719 caused another issue: https://github.com/golang/go/issues/20575. Namely, Rows objects hold closemu in read mode, which prevents us from rolling back a transaction without first calling (Rows).Close. The fix for that issue (https://go-review.googlesource.com/c/go/+/44812/) sticks a call to tx.cancel() in tx.close() prior to obtaining tx.closemu, which the (*Rows) code catches in order to unlock tx.closemu. My attempted fix broke this, because I moved the call to tx.closemu.Lock() earlier.
I can somewhat workaround this problem by inserting an extra call to tx.cancel() in tx.Rollback, but that's hacky and introduces a race between tx.rollback and tx.awaitDone also calling tx.rollback.
I think I'm close to a reasonable solution, but I don't quite know how to resolve the problems with tx.closemu. We need the exclusion between the various queries we can run on a transaction, and commit/rollback. But, perhaps a mutex is not the right solution to hold across multiple calls while iterating over a resultset. Or perhaps context cancellation is not the right mechanism to reclaim a connection while iterating over a rowset.
I now have a plausible solution, and a test case that it fixes (in addition to the repro program already posted). I'm awaiting approval from my employer to submit the change for review, but here it is in the meantime if anyone's interested. My company only uses even numbered versions of Go, so I'd like to get this fix into 1.14 if possible:
diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index 0f5bbc01c9..a0fd05423c 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2018,12 +2018,10 @@ var ErrTxDone = errors.New("sql: transaction has already been committed or rolle
// close returns the connection to the pool and
// must only be called by Tx.rollback or Tx.Commit.
+//
+// The transaction context must already have been canceled,
+// and closemu must locked.
func (tx *Tx) close(err error) {
- tx.cancel()
-
- tx.closemu.Lock()
- defer tx.closemu.Unlock()
-
tx.releaseConn(err)
tx.dc = nil
tx.txi = nil
@@ -2090,6 +2088,18 @@ func (tx *Tx) Commit() error {
if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
return ErrTxDone
}
+
+ // cancel is called after done transitions from 0 to 1, but
+ // before we attempt to grab closemu, since context cancellation
+ // will cause any operations holding closemu in read mode to
+ // release their locks.
+ tx.cancel()
+
+ // closemu must be acquired exclusively before rolling back to
+ // allow any oustanding operations to finish.
+ tx.closemu.Lock()
+ defer tx.closemu.Unlock()
+
var err error
withLock(tx.dc, func() {
err = tx.txi.Commit()
@@ -2107,6 +2117,18 @@ func (tx *Tx) rollback(discardConn bool) error {
if !atomic.CompareAndSwapInt32(&tx.done, 0, 1) {
return ErrTxDone
}
+
+ // cancel is called after done transitions from 0 to 1, but
+ // before we attempt to grab closemu, since context cancellation
+ // will cause any operations holding closemu in read mode to
+ // release their locks.
+ tx.cancel()
+
+ // closemu must be acquired exclusively before rolling back to
+ // allow any oustanding operations to finish.
+ tx.closemu.Lock()
+ defer tx.closemu.Unlock()
+
var err error
withLock(tx.dc, func() {
err = tx.txi.Rollback()
Your patch looks good to me. Would you create a pull request?
I have a pull request ready to go, as soon as I get the okay regarding
company CLA.
-Dan
On Fri, Nov 8, 2019 at 4:49 AM Inada Naoki notifications@github.com wrote:
Your patch looks good to me. Would you create a pull request?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/34775?email_source=notifications&email_token=ANNV6MCDQ5L5PHYFIAFE7LTQSVN47A5CNFSM4I6WCSP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDRMTGY#issuecomment-551733659,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ANNV6MH2P5HO7SX4RSZUR5LQSVN47ANCNFSM4I6WCSPQ
.
Change https://golang.org/cl/216240 mentions this issue: database/sql: prevent Tx statement from committing after rollback
CL Ready, just waiting for a reviewer.
I've got you!
Will this be released as a patch? Are there any mitigations we can apply in the meantime? Having some DB queries executed instead of rolled-back is potentially catastrophic.
@ire-and-curses so this issue has existed since Go1.12 and unfortunately because it wasn't a regression but rather a long standing bug, the policy is that we can only cherry pick to prior releases only regressions and security issues. Go1.15 release candidates and betas will be made between now and August 2020 when Go1.15 shall be released. Sorry about that.
You can however, clone Go, add the patch to a branch and build Go, and backport the respective fixes you'd like.
@odeke-em I believe this fix satisfies the requirements for a backport according to the minor version release policy because the issue is a serious issue that is causing databases to get into bad state where no work around is possible and updating to the latest release Go 1.14.3 does not fix the issue.
If patching Go is the suggested work around that suggests there is no work around in an application. Unfortunately building custom Go isn't an option for everyone as our product is open source and so we'd need to ask everyone who uses it to build with a custom Go.
@gopherbot please consider this for backport because this issue breaks production applications resulting in unexpected database state and cannot be fixed with a work around including updating the version of Go to the latest version.
Backport issue(s) opened: #39101 (for 1.14).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.
Awesome, thank you @leighmcculloch for the advocacy and for citing the backporting clause, it will be useful for folks to get it on Go1.14. The cherry pick has been approved and we shall get it up and approved shortly.
Awesome, thank you @leighmcculloch for the advocacy and for citing the backporting clause, it will be useful for folks to get it on Go1.14. The cherry pick has been approved and we shall get it up and approved shortly.
We're hitting this bug in production. Really looking forward to seeing it fixed in 1.14.5.
Change https://golang.org/cl/242101 mentions this issue: [release-branch.go1.14] database/sql: backport 3 Tx rollback related CLs
Change https://golang.org/cl/242102 mentions this issue: [release-branch.go1.14] database/sql: backport 3 Tx rollback related CLs
Based on https://github.com/golang/go/issues/34775#issuecomment-628115836:
this issue has existed since Go1.12 and unfortunately because it wasn't a regression but rather a long standing bug
The backport policy requires that a fix must either be backported to both 1.14 and 1.13, or neither (see https://github.com/golang/go/issues/34536#issuecomment-572188389).
I've opened #40205 for this fix to be considered for backporting to 1.13 as well.
Change https://golang.org/cl/242522 mentions this issue: [release-branch.go1.13] database/sql: backport 5 Tx rollback related CLs
Thank you @dmitshur! I've mailed CL 242522 for the Go1.13.14 backport.
Thank you for doing that work, @odeke-em. The 1.13 backport issue is still in CherryPickCandidate state, so the release managers will need to decide whether it should be approved for backport. Having the backport CL ready makes that process easier.
Most helpful comment
Awesome, thank you @leighmcculloch for the advocacy and for citing the backporting clause, it will be useful for folks to get it on Go1.14. The cherry pick has been approved and we shall get it up and approved shortly.