No results are sporadically returned from query by foreign key while inserts are concurrently performed. This happens sporadically and sometimes the query result is valid.
This could be some kind of issue in the read restart handling. I was trying to reproduce an issue with read restart errors in my application, but instead I found this issue. That's why the test case name refers "read restart bug".
Test case is AbstractReadRestartBugTest
To reproduce:
git clone https://github.com/lhotari/yugabyte-bugs-repro
cd yugabyte-bugs-repro
Run the test with YB 2.1.2.0-b10:
./gradlew readRestartBug
The test uses https://www.testcontainers.org/ to orchestrate Docker with Junit tests. There's YugaByte integration in https://github.com/lhotari/dbcontainers . That also contains an integration to Spring.
to run test with YB 2.0.12.0-b10
./gradlew -PyugabyteVersion=2.0.12.0-b10 readRestartBug
to run test with Postgres 11, the test passes
./gradlew readRestartBugPostgres
The problem is that a query doesn鈥檛 return any results when there is a concurrent insert happening and the query should return results.
Some explanation of the test case.
Tables in test
CREATE TABLE my_table (id UUID primary key, created timestamp not null);
CREATE TABLE my_child (id UUID primary key, my_table_id UUID, created timestamp not null, FOREIGN KEY (my_table_id) REFERENCES my_table (id));
one thread is doing inserts
UUID my_table_id = UUID.randomUUID();
System.out.println("i:" + i + " my_table_id:" + my_table_id);
transactionTemplate.execute(status -> {
jdbcOperations.update("INSERT INTO my_table values (?, ?)", my_table_id, new Date());
return null;
});
for (int j = 0; j < 100; j++) {
transactionTemplate.execute(status -> {
System.out.print(".");
jdbcOperations.update("INSERT INTO my_child values (?, ?, ?)", UUID.randomUUID(), my_table_id, new Date());
return null;
});
activeParentId.set(my_table_id);
...
another thread is doing selects in a loop:
UUID my_table_id = activeParentId.get();
if (my_table_id != null) {
System.out.println("querying " + my_table_id);
transactionTemplate.execute(status -> {
List<Map<String, Object>> results = jdbcOperations.queryForList("select * from my_child where my_table_id = ?", my_table_id);
if (results.isEmpty()) {
throw new IllegalStateException("There should have been results for " + my_table_id);
}
System.out.println("ok " + my_table_id);
return null;
});
} else {
Thread.sleep(100L);
}
The select
select * from my_child where my_table_id = ?
doesn鈥檛 return results in all cases.
see the source code of AbstractReadRestartBugTest for more details.
The test is implemented in an abstract class so that the same test could be easily run on Postgres 11. The same test passes on Postgres.
The fact that the query is by a foreign key doesn't seem to have an impact to the end result.
I also tested by removing the FOREIGN KEY (my_table_id) REFERENCES my_table (id) part from my_child table DDL and the result is the same.
I made an experiment to change the UUID fields to char(36) fields (see https://github.com/lhotari/yugabyte-bugs-repro/commit/8e4b8ba3eeddb1821073a64b3fc88a271aeded4f). After this I got a different failure. The tserver process crashed.
2020-03-27 12:51:26.312 UTC [47] LOG: server process (PID 201) was terminated by signal 11: Segmentation fault
2020-03-27 12:51:26.312 UTC [47] DETAIL: Failed process was running: select * from my_child where my_table_id = $1
2020-03-27 12:51:26.312 UTC [47] LOG: terminating any other active server processes
2020-03-27 12:51:26.312 UTC [322] WARNING: terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [322] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [322] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [312] WARNING: terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [312] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [312] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [238] WARNING: terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [238] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [238] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [301] WARNING: terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [301] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [301] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [220] WARNING: terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [220] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [220] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [267] WARNING: terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [267] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [267] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [291] WARNING: terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [291] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [291] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.313 UTC [322] WARNING: terminating connection because of crash of another server process
2020-03-27 12:51:26.313 UTC [322] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.313 UTC [322] HINT: In a moment you should be able to reconnect to the database and repeat your command.
F0327 12:51:26.313344 323 ybc_pggate.cc:77] void yb::pggate::YBCDestroyPgGate() can only be called once
2020-03-27 12:51:26.313 UTC [188] WARNING: terminating connection because of crash of another server process
2020-03-27 12:51:26.313 UTC [188] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.313 UTC [188] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.313 UTC [279] WARNING: terminating connection because of crash of another server process
2020-03-27 12:51:26.313 UTC [279] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.313 UTC [279] HINT: In a moment you should be able to reconnect to the database and repeat your command.
W0327 12:51:26.314390 189 reactor.cc:717] pggate_ybclient_R001: Failed to schedule process outbound queue task
W0327 12:51:26.315039 194 outbound_call.cc:171] Failed to schedule invoking callback on response for request yb.tserver.TabletServerService.UpdateTransaction to 4adc4b0e86d3: Aborted (yb/rpc/thread_pool.cc:249): Service is shutting down
W0327 12:51:26.316138 194 rpc.cc:170] Aborted (yb/rpc/rpc.cc:169): Failed to schedule: 0x00007fa134006280 -> UpdateTransaction: tablet_id: "1ccd1c6fdb2a46d19eba53964ccd4cfc" state { transaction_id: "\373\321\374@~*GI\250,yO\225\362\317\365" status: PENDING } propagated_hybrid_time: 6493444039943536640, retrier: { task_id: -1 state: kScheduling deadline: 124116.755s }
W0327 12:51:26.316169 194 transaction.cc:864] fbd1fc40-7e2a-4749-a82c-794f95f2cff5: Send heartbeat failed: Aborted (yb/rpc/rpc.cc:169): Failed to schedule: 0x00007fa134006280 -> UpdateTransaction: tablet_id: "1ccd1c6fdb2a46d19eba53964ccd4cfc" state { transaction_id: "\373\321\374@~*GI\250,yO\225\362\317\365" status: PENDING } propagated_hybrid_time: 6493444039943536640, retrier: { task_id: -1 state: kScheduling deadline: 124116.755s }
2020-03-27 12:51:26.336 UTC [379] FATAL: the database system is in recovery mode
Fatal failure details written to /mnt/disk0/yb-data/tserver/logs/yb-tserver.FATAL.details.2020-03-27T12_51_26.pid322.txt
F20200327 12:51:26 ../../../../../../src/yb/yql/pggate/ybc_pggate.cc:77] void yb::pggate::YBCDestroyPgGate() can only be called once
@ 0x7fa15072674c yb::LogFatalHandlerSink::send()
@ 0x7fa14c03ad16 google::LogMessage::SendToLog()
@ 0x7fa14c03879a google::LogMessage::Flush()
@ 0x7fa14c03b869 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa15034e0c7 YBCDestroyPgGate
@ 0xa063eb YBOnPostgresBackendShutdown
@ 0x85be7b quickdie
@ 0x7fa14f74dba0 (unknown)
@ 0x7fa14ee829f3 __GI_epoll_wait
@ 0x7fa15074e7cf boost::asio::detail::epoll_reactor::run()
@ 0x7fa1484a5c81 yb::rpc::IoThreadPool::Impl::Execute()
@ 0x7fa1507b3faf yb::Thread::SuperviseThread()
@ 0x7fa14f745694 start_thread
@ 0x7fa14ee8241d __clone
@ (nil) (unknown)
*** Check failure stack trace: ***
@ 0x7fa150724b31 yb::(anonymous namespace)::DumpStackTraceAndExit()
@ 0x7fa14c038c4d google::LogMessage::Fail()
@ 0x7fa14c03adbd google::LogMessage::SendToLog()
@ 0x7fa14c03879a google::LogMessage::Flush()
@ 0x7fa14c03b869 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa15034e0c7 YBCDestroyPgGate
@ 0xa063eb YBOnPostgresBackendShutdown
@ 0x85be7b quickdie
@ 0x7fa14f74dba0 (unknown)
@ 0x7fa14ee829f3 __GI_epoll_wait
@ 0x7fa15074e7cf boost::asio::detail::epoll_reactor::run()
@ 0x7fa1484a5c81 yb::rpc::IoThreadPool::Impl::Execute()
@ 0x7fa1507b3faf yb::Thread::SuperviseThread()
@ 0x7fa14f745694 start_thread
@ 0x7fa14ee8241d __clone
@ (nil) (unknown)
2020-03-27 12:51:26.374 UTC [47] LOG: all server processes terminated; reinitializing
I wonder why the behavior is different when switching to use char(36) fields instead of UUID?
@frozenspider I did an experiment where the crash goes a way, see this one:
https://github.com/lhotari/yugabyte-bugs-repro/commit/f7ae7f152730ac7fd89ddcaed7ae3d68122ede1f#diff-3c5ac48ca8be8c67006721243e32e109R112
if (USE_STATIC_SQL) {
// PASSES WITH THIS QUERY
results = jdbcOperations.queryForList("select * from my_child where my_table_id = '" + my_table_id + "'");
} else {
// FAILS WITH THIS QUERY
results = jdbcOperations.queryForList("select * from my_child where my_table_id = ?", my_table_id);
}
When the select query is a static query without dynamic parameters, the test completes successfully.
I added this target to the repro repo Gradle build to run it:
./gradlew readRestartBugStaticSql
It seems that this issue is about SQL parameter binding and not about read restarts at all.
Thx a lot @lhotari for narrowing down the test case so much! Really appreciated.
I continued the experiments and noticed that I was finally able to reproduce the "read restart" issue that was my original goal. This happens when the Postgres JDBC driver is configured to use "simple" in the "preferQueryMode" parameter. I added a target to the repro repo for that as well in https://github.com/lhotari/yugabyte-bugs-repro/commit/f23f54fc76d9fcc39dff3d79a2ec27f57e96e9d2 .
Running that reproduces the "ERROR: Query error: Restart read required at: { read: { physical: 1585316553533711 } local_limit: { physical: 1585316553541734 } global_limit:
Postgres JDBC driver's "preferQueryMode" documentation:
preferQueryMode = String
Specifies which mode is used to execute queries to database: simple means ('Q' execute, no parse, no bind, text mode only), extended means always use bind/execute messages, extendedForPrepared means extended for prepared statements only, extendedCacheEverything means use extended protocol and try cache every statement (including Statement.execute(String sql)) in a query cache. extended | extendedForPrepared | extendedCacheEverything | simple
The default is extended
Postgres docs for Simple query and Extended query at the protocol level: https://www.postgresql.org/docs/11/protocol-flow.html#id-1.10.5.7.4 .
Perhaps a read restart crashes the extended query mode and the failure doesn't get passed to the client?
Thanks @lhotari, I'm looking into it right now. I've created #4094 to track these issues separately.
(Looks like there's some bug in how we track data sent to the client in simple mode specifically - for some reason we don't reset the "flushed" flag and thus cannot do the restart...)
A (small) follow-up: yb-tserver crash was caused by #3779 (recently fixed)
As for empty results returned by a prepared statement with UUID column - I was able to reproduce this with local installation and simplified test, will start looking deeper into that soon
Found a cause of no results being returned when binding a UUID (or char(36)): #4254 (fix for this is currently being reviewed).
As such, I'm closing this issue in favor of #4254.
Most helpful comment
Thx a lot @lhotari for narrowing down the test case so much! Really appreciated.