For any application performing transactions, it's important that the application be able to configure an explicit limit on how long a transaction thread will block trying to perform the transaction.
Enhancement Request: Add support for an SQL variable like this:
SET transaction_timeout = 2500
This will then cause an immediate retry exception (and transaction rollback) if any statement (or the overall transaction?) takes longer than 2500 milliseconds to complete.
It is important that the corresponding SQLError has either existing the "retry" error code ("CR000") or a new, well-documented timeout error code; the application will likely end-up treating these in the same way as retry errors.
Note that this is only a partial solution to the overall problem of controlling timeouts: there also needs to be a timeout configured on the client's PostgreSQL driver itself (for example, there could be a network hang between the local client and the CockroachDB node it's talking to via JDBC). This part of the problem is not within CockroachDB's realm of control, however, it would be nice and helpful for SQL users if CockroachDB also provided a clearly documented way to configure the local PostgreSQL driver timeout. E.g., there are some timeout parameters shown here.
Thanks.
Discussion thread: https://groups.google.com/forum/#!topic/cockroach-db/FpBemFJM4w8
@archiecobbs Looks like postgres only supports statement_timeout. Would that be sufficient, or would you want separate control for an overall transaction timeout?
@petermattis Having statement_timeout would certainly be a big improvement, assuming COMMIT TRANSACTION counts as a statement.
Right now the amount of time you could get stuck is (theoretically) unbounded, and this would fix that.
Yes, COMMIT TRANSACTION is a statement. Internally, the corresponding EndTransaction request already contains a deadline so we can ensure that we positively know a transaction was aborted when the deadline was exceeded.
FWIW, a unit test I'm running in Java against Cockroach hangs forever. This is the situation that motivated creating this issue. Why this test is hanging whereas it doesn't hang with other databases (e.g., MySQL) is a related question.
Here's the stack trace of the thread that's stuck forever waiting for some response:
"pool-1-thread-4" prio=5 tid=0x00007fa033acf000 nid=0x5d03 runnable [0x000000011c011000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143)
at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112)
at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:70)
at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:283)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1799)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:200)
- locked <0x00000007ad634380> (a org.postgresql.core.v3.QueryExecutorImpl)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:424)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:161)
at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:133)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.postgresql.ds.PGPooledConnection$StatementHandler.invoke(PGPooledConnection.java:426)
at com.sun.proxy.$Proxy13.executeUpdate(Unknown Source)
at org.jsimpledb.kv.sql.SQLKVTransaction.update(SQLKVTransaction.java:296)
at org.jsimpledb.kv.sql.SQLKVTransaction.put(SQLKVTransaction.java:131)
- locked <0x00000007ad63f1e0> (a org.jsimpledb.kv.cockroach.CockroachKVTransaction)
at org.jsimpledb.kv.test.KVDatabaseTest$Writer.run(KVDatabaseTest.java:877)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Sorry I don't have more debug info (I'm go-ignorant). Let me know if you want instructions on how to reproduce.
Which version of cockroachdb are you running? We had a deadlock in last week's beta that will be fixed in the new version being released today. If that doesn't fix it, please file a separate issue with more information about what exactly you're trying to do, and any relevant bits of the server logs (by default in cockroach-data/logs/)
Version I have is:
$ cockroach version
Build Tag: beta-20160505
Build Time: 2016/05/12 00:16:46
Platform: darwin amd64
Go Version: go1.6
C Compiler: 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)
I'll upgrade tomorrow, try again, and then update this issue.
Thanks.
I'm getting the same behavior with this version:
Build Tag: beta-20160512
Build Time: 2016/05/13 01:21:09
Platform: darwin amd64
Go Version: go1.6
C Compiler: 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)
Two different tests are hanging. The first one hangs sometimes, the second one hangs always.
Test setup:
CREATE TABLE IF NOT EXISTS "KV" (
"kv_key" BYTES PRIMARY KEY NOT NULL,
"kv_value" BYTES NOT NULL
);
DELETE FROM "KV";
Test 1: testConflictingTransactions():
Open two transactions T1 and T2
T1: SELECT "kv_value" FROM "KV" WHERE "kv_key" = ? where ? = 0x10
T2: SELECT "kv_value" FROM "KV" WHERE "kv_key" = ? where ? = 0x10
T1: UPSERT INTO "KV" ("kv_key", "kv_value") VALUES (?, ?) where ?1 = 0x10, ?2 = 0x01
T2: UPSERT INTO "KV" ("kv_key", "kv_value") VALUES (?, ?) where ?1 = 0x10, ?2 = 0x02
At this point T2 hangs - sometimes:
"pool-1-thread-4" prio=5 tid=0x00007fd25a994000 nid=0x5d03 runnable [0x0000000116f98000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143)
at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112)
at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:70)
at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:283)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1799)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:200)
- locked <0x00000007ad6381c0> (a org.postgresql.core.v3.QueryExecutorImpl)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:424)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:161)
at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:133)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.postgresql.ds.PGPooledConnection$StatementHandler.invoke(PGPooledConnection.java:426)
at com.sun.proxy.$Proxy13.executeUpdate(Unknown Source)
at org.jsimpledb.kv.sql.SQLKVTransaction.update(SQLKVTransaction.java:296)
at org.jsimpledb.kv.sql.SQLKVTransaction.put(SQLKVTransaction.java:131)
- locked <0x00000007ad643020> (a org.jsimpledb.kv.cockroach.CockroachKVTransaction)
at org.jsimpledb.kv.test.KVDatabaseTest$Writer.run(KVDatabaseTest.java:890)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Test 2: testNonconflictingTransactions():
Open 10 transactions T1, T2, ..., T10
T1: SELECT "kv_key", "kv_value" FROM "KV" WHERE "kv_key" >= ? ORDER BY "kv_key" ASC LIMIT 1 where ? = 0x00
T1: UPSERT INTO "KV" ("kv_key", "kv_value") VALUES (?, ?) where ?1 = 0x80, ?2 = 0x02
T2: SELECT "kv_key", "kv_value" FROM "KV" WHERE "kv_key" >= ? ORDER BY "kv_key" ASC LIMIT 1 where ? = 0x01
Test hangs at this point:
"pool-1-thread-8" prio=5 tid=0x00007fd1fda8c800 nid=0x6503 runnable [0x000000011e378000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143)
at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112)
at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:70)
at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:283)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1799)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:200)
- locked <0x00000007ad638468> (a org.postgresql.core.v3.QueryExecutorImpl)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:424)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:161)
at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.postgresql.ds.PGPooledConnection$StatementHandler.invoke(PGPooledConnection.java:426)
at com.sun.proxy.$Proxy13.executeQuery(Unknown Source)
at org.jsimpledb.kv.sql.SQLKVTransaction.query(SQLKVTransaction.java:276)
at org.jsimpledb.kv.sql.SQLKVTransaction.queryKVPair(SQLKVTransaction.java:250)
at org.jsimpledb.kv.sql.SQLKVTransaction.getAtLeast(SQLKVTransaction.java:99)
- locked <0x00000007adee7da8> (a org.jsimpledb.kv.cockroach.CockroachKVTransaction)
at org.jsimpledb.kv.test.KVDatabaseTest$Reader.call(KVDatabaseTest.java:861)
at org.jsimpledb.kv.test.KVDatabaseTest$Reader.call(KVDatabaseTest.java:842)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
To run these tests yourself:
git clone [email protected]:archiecobbs/jsimpledbcd jsimpledbmvn installpom.xml, and uncomment the <cockroachURL> line (edit to suit)cd jsimpledb-kv-cockroachcockroach sql -e 'create database if not exists jsimpledb'mvn testFYI, test produces same hanging behavior with beta-20160519.
Created separate issue #7556 for the deadlock problem.
Some more related comments in https://github.com/cockroachdb/cockroach/issues/7556#issuecomment-229828519
@petermattis should this be 1.0? Could you assign if so, remove the 1.0 milestone otherwise.
Given our plumbing of context everywhere, this seems relatively straightforward to do on the surface.
Because this feature is not already on our 1.1 roadmap, I'm moving this to "Later."
This looks like a good starter project.
Nominating this for 1.2; it should be fairly easy. cc @dianasaur323
ok, thanks @bdarnell. I think it would address another user issue that occurred today as well. Will map it into 1.2.
@vivekmenezes I'm going to need someone to get assigned to this at some point.
Rationale
Users currently have issues with transactions that hang - it's unclear if they are still running or if they are just hanging. We need to allow users to set a max transaction timeout so that any transactions running longer than they should should be killed. Since write queries tend to impact cluster health more critically than long running read queries, we should allow users to set different behavior for different types of queries.
Feature Scope
_Strech_
PM Acceptance Testing
Session and transaction cancellation RFC is a work in progress: https://github.com/cockroachdb/cockroach/pull/17252
Once that is implemented this should be relatively easy to accomplish
I'd like to make a distinction between read-only and read/write transactions. Long-running write transactions are expensive, but long-running reads are mostly OK. Assuming we distinguish between the two, I'd like a non-null default for write transactions (1 minute?). It's fine if the clock starts at the first write instead of the real start of the transaction. See also #10188.
@bdarnell point taken. I've updated the acceptance criteria comment, and I think 1 minute is good for now. Regarding the ability to set this by session mentioned in #10188, I'm assuming that's a stretch, and that we would prefer to do this at a cluster level first. I've updated the comment for that as well.
Starting with a cluster-level setting is fine with me.
@andreimatei any chance this interests you for 1.2?
Not particularly, to be totally honest :). Not sure if I'll have time.
Besides that, it seems to me that what we want to do is not specified well enough in this issue. Diana talks about "transactions that hang" and the discussion is around "killing transactions". But it is generally not _transactions_ that hang, but queries.
There's two cases:
I'd like more clarity on which of these cases we want to address, and how. 2) would be helped by killing of B, but better yet it could be helped by allowing A to push B after a while. 1) could be helped by cancelling the _query_.
So it seems to me that we (may) want a _query_ timeout, and a _transaction push_ timeout.
@andreimatei This was a feature that was marked down as not requiring any PM scoping, hence the lack of actual feature definition. That being said, I think the approach here would be to do something similar to what PostgreSQL does with statement_timeout, lock_timeout, and idle_in_transaction_session_timeout. It looks like statement_timeout and lock_timeout would meet your 2 use cases, whereas idle_in_transaction_session_timeout would meet the transaction use case.
Essentially, we want to provide a way to give users a hammer they can use to knock down transactions (or queries) that are taking longer than one would expect. Does that provide some more context?
I like idle_in_transaction_session_timeout as a session variable (if we start the clock after the first write, as Ben said). I will try to implement that as part of my current sql session rewrite. I think it will fit easy enough, whereas with today's code I don't know where I'd put it.
I somewhat like statement_timeout as a session variable. In lack of better cancelation mechanisms, I think it makes enough sense. Let's keep this in mind for the future; it's probably not as urgent if we have idle_in_transaction_session_timeout.
lock_timeout I don't like. Seems like quite a random knob to me. I'm not sure why I'd ever use it.
I'm in the camp that the highest value toggle for us is idle_in_transaction_session_timeout. It sounds like you are in the same camp! In that case, upwards and onwards!
What is the incremental work left on this ticket after implementing statement_timeout, documented here: https://www.cockroachlabs.com/docs/dev/set-vars.html#supported-variables? Is this done?
@andreimatei to comment and close
Statement timeouts and transaction timeouts are different; adding statement_timeout doesn't address the need for a transaction timeout. This transaction is primarily about transaction timeouts (perhaps using the postgres variable idle_in_transaction_session_timeout)
We now have a statement_timeout and a idle_in_session_timeout (introduced in 20.2) we should round out the set with this idle_in_transaction_session_timeout.
Confusingly, this variable is only implemented now as a no-op:
[email protected]:59509/movr> SET idle_in_transaction_session_timeout = 10;
ERROR: invalid value for parameter "idle_in_transaction_session_timeout": "10"
SQLSTATE: 22023
DETAIL: this parameter is currently recognized only for compatibility and has no effect in CockroachDB.
HINT: Available values: 0
Despite showing up in SHOW ALL:
variable | value
------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------
idle_in_session_timeout | 0
idle_in_transaction_session_timeout | 0