Yugabyte-db: [YSQL] No results are sporadically returned from query by foreign key while inserts are concurrently performed

Created on 26 Mar 2020  路  9Comments  路  Source: yugabyte/yugabyte-db

No results are sporadically returned from query by foreign key while inserts are concurrently performed. This happens sporadically and sometimes the query result is valid.

This could be some kind of issue in the read restart handling. I was trying to reproduce an issue with read restart errors in my application, but instead I found this issue. That's why the test case name refers "read restart bug".

Test case is AbstractReadRestartBugTest

To reproduce:

  1. Clone the repo:
git clone https://github.com/lhotari/yugabyte-bugs-repro
cd yugabyte-bugs-repro
  1. Run the test with YB 2.1.2.0-b10:

    ./gradlew readRestartBug

The test uses https://www.testcontainers.org/ to orchestrate Docker with Junit tests. There's YugaByte integration in https://github.com/lhotari/dbcontainers . That also contains an integration to Spring.

to run test with YB 2.0.12.0-b10

./gradlew -PyugabyteVersion=2.0.12.0-b10 readRestartBug

to run test with Postgres 11, the test passes

./gradlew readRestartBugPostgres

The problem is that a query doesn鈥檛 return any results when there is a concurrent insert happening and the query should return results.

Some explanation of the test case.

Tables in test

CREATE TABLE my_table (id UUID primary key, created timestamp not null);
CREATE TABLE my_child (id UUID primary key, my_table_id UUID, created timestamp not null, FOREIGN KEY (my_table_id) REFERENCES my_table (id));

one thread is doing inserts

                        UUID my_table_id = UUID.randomUUID();
                        System.out.println("i:" + i + " my_table_id:" + my_table_id);
                        transactionTemplate.execute(status -> {
                            jdbcOperations.update("INSERT INTO my_table values (?, ?)", my_table_id, new Date());
                            return null;
                        });
                        for (int j = 0; j < 100; j++) {
                            transactionTemplate.execute(status -> {
                                System.out.print(".");
                                jdbcOperations.update("INSERT INTO my_child values (?, ?, ?)", UUID.randomUUID(), my_table_id, new Date());
                                return null;
                            });
                            activeParentId.set(my_table_id);
...

another thread is doing selects in a loop:

                    UUID my_table_id = activeParentId.get();
                    if (my_table_id != null) {
                        System.out.println("querying " + my_table_id);
                        transactionTemplate.execute(status -> {
                            List<Map<String, Object>> results = jdbcOperations.queryForList("select * from my_child where my_table_id = ?", my_table_id);
                            if (results.isEmpty()) {
                                throw new IllegalStateException("There should have been results for " + my_table_id);
                            }
                            System.out.println("ok " + my_table_id);
                            return null;
                        });
                    } else {
                        Thread.sleep(100L);
                    }

The select

select * from my_child where my_table_id = ?

doesn鈥檛 return results in all cases.

see the source code of AbstractReadRestartBugTest for more details.
The test is implemented in an abstract class so that the same test could be easily run on Postgres 11. The same test passes on Postgres.

areysql communitrequest

Most helpful comment

Thx a lot @lhotari for narrowing down the test case so much! Really appreciated.

All 9 comments

The fact that the query is by a foreign key doesn't seem to have an impact to the end result.
I also tested by removing the FOREIGN KEY (my_table_id) REFERENCES my_table (id) part from my_child table DDL and the result is the same.

I made an experiment to change the UUID fields to char(36) fields (see https://github.com/lhotari/yugabyte-bugs-repro/commit/8e4b8ba3eeddb1821073a64b3fc88a271aeded4f). After this I got a different failure. The tserver process crashed.

2020-03-27 12:51:26.312 UTC [47] LOG:  server process (PID 201) was terminated by signal 11: Segmentation fault
2020-03-27 12:51:26.312 UTC [47] DETAIL:  Failed process was running: select * from my_child where my_table_id = $1
2020-03-27 12:51:26.312 UTC [47] LOG:  terminating any other active server processes
2020-03-27 12:51:26.312 UTC [322] WARNING:  terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [322] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [322] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [312] WARNING:  terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [312] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [312] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [238] WARNING:  terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [238] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [238] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [301] WARNING:  terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [301] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [301] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [220] WARNING:  terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [220] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [220] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [267] WARNING:  terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [267] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [267] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.312 UTC [291] WARNING:  terminating connection because of crash of another server process
2020-03-27 12:51:26.312 UTC [291] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.312 UTC [291] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.313 UTC [322] WARNING:  terminating connection because of crash of another server process
2020-03-27 12:51:26.313 UTC [322] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.313 UTC [322] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
F0327 12:51:26.313344   323 ybc_pggate.cc:77] void yb::pggate::YBCDestroyPgGate() can only be called once
2020-03-27 12:51:26.313 UTC [188] WARNING:  terminating connection because of crash of another server process
2020-03-27 12:51:26.313 UTC [188] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.313 UTC [188] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-03-27 12:51:26.313 UTC [279] WARNING:  terminating connection because of crash of another server process
2020-03-27 12:51:26.313 UTC [279] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-03-27 12:51:26.313 UTC [279] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
W0327 12:51:26.314390   189 reactor.cc:717] pggate_ybclient_R001: Failed to schedule process outbound queue task
W0327 12:51:26.315039   194 outbound_call.cc:171] Failed to schedule invoking callback on response for request yb.tserver.TabletServerService.UpdateTransaction to 4adc4b0e86d3: Aborted (yb/rpc/thread_pool.cc:249): Service is shutting down
W0327 12:51:26.316138   194 rpc.cc:170] Aborted (yb/rpc/rpc.cc:169): Failed to schedule: 0x00007fa134006280 -> UpdateTransaction: tablet_id: "1ccd1c6fdb2a46d19eba53964ccd4cfc" state { transaction_id: "\373\321\374@~*GI\250,yO\225\362\317\365" status: PENDING } propagated_hybrid_time: 6493444039943536640, retrier: { task_id: -1 state: kScheduling deadline: 124116.755s }
W0327 12:51:26.316169   194 transaction.cc:864] fbd1fc40-7e2a-4749-a82c-794f95f2cff5: Send heartbeat failed: Aborted (yb/rpc/rpc.cc:169): Failed to schedule: 0x00007fa134006280 -> UpdateTransaction: tablet_id: "1ccd1c6fdb2a46d19eba53964ccd4cfc" state { transaction_id: "\373\321\374@~*GI\250,yO\225\362\317\365" status: PENDING } propagated_hybrid_time: 6493444039943536640, retrier: { task_id: -1 state: kScheduling deadline: 124116.755s }
2020-03-27 12:51:26.336 UTC [379] FATAL:  the database system is in recovery mode
Fatal failure details written to /mnt/disk0/yb-data/tserver/logs/yb-tserver.FATAL.details.2020-03-27T12_51_26.pid322.txt
F20200327 12:51:26 ../../../../../../src/yb/yql/pggate/ybc_pggate.cc:77] void yb::pggate::YBCDestroyPgGate() can only be called once
    @     0x7fa15072674c  yb::LogFatalHandlerSink::send()
    @     0x7fa14c03ad16  google::LogMessage::SendToLog()
    @     0x7fa14c03879a  google::LogMessage::Flush()
    @     0x7fa14c03b869  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa15034e0c7  YBCDestroyPgGate
    @           0xa063eb  YBOnPostgresBackendShutdown
    @           0x85be7b  quickdie
    @     0x7fa14f74dba0  (unknown)
    @     0x7fa14ee829f3  __GI_epoll_wait
    @     0x7fa15074e7cf  boost::asio::detail::epoll_reactor::run()
    @     0x7fa1484a5c81  yb::rpc::IoThreadPool::Impl::Execute()
    @     0x7fa1507b3faf  yb::Thread::SuperviseThread()
    @     0x7fa14f745694  start_thread
    @     0x7fa14ee8241d  __clone
    @              (nil)  (unknown)

*** Check failure stack trace: ***
    @     0x7fa150724b31  yb::(anonymous namespace)::DumpStackTraceAndExit()
    @     0x7fa14c038c4d  google::LogMessage::Fail()
    @     0x7fa14c03adbd  google::LogMessage::SendToLog()
    @     0x7fa14c03879a  google::LogMessage::Flush()
    @     0x7fa14c03b869  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa15034e0c7  YBCDestroyPgGate
    @           0xa063eb  YBOnPostgresBackendShutdown
    @           0x85be7b  quickdie
    @     0x7fa14f74dba0  (unknown)
    @     0x7fa14ee829f3  __GI_epoll_wait
    @     0x7fa15074e7cf  boost::asio::detail::epoll_reactor::run()
    @     0x7fa1484a5c81  yb::rpc::IoThreadPool::Impl::Execute()
    @     0x7fa1507b3faf  yb::Thread::SuperviseThread()
    @     0x7fa14f745694  start_thread
    @     0x7fa14ee8241d  __clone
    @              (nil)  (unknown)
2020-03-27 12:51:26.374 UTC [47] LOG:  all server processes terminated; reinitializing

I wonder why the behavior is different when switching to use char(36) fields instead of UUID?

@frozenspider I did an experiment where the crash goes a way, see this one:
https://github.com/lhotari/yugabyte-bugs-repro/commit/f7ae7f152730ac7fd89ddcaed7ae3d68122ede1f#diff-3c5ac48ca8be8c67006721243e32e109R112

if (USE_STATIC_SQL) {
   // PASSES WITH THIS QUERY 
   results = jdbcOperations.queryForList("select * from my_child where my_table_id = '" + my_table_id + "'");
} else {
    // FAILS WITH THIS QUERY
    results = jdbcOperations.queryForList("select * from my_child where my_table_id = ?", my_table_id);
}

When the select query is a static query without dynamic parameters, the test completes successfully.

I added this target to the repro repo Gradle build to run it:

./gradlew readRestartBugStaticSql

It seems that this issue is about SQL parameter binding and not about read restarts at all.

Thx a lot @lhotari for narrowing down the test case so much! Really appreciated.

I continued the experiments and noticed that I was finally able to reproduce the "read restart" issue that was my original goal. This happens when the Postgres JDBC driver is configured to use "simple" in the "preferQueryMode" parameter. I added a target to the repro repo for that as well in https://github.com/lhotari/yugabyte-bugs-repro/commit/f23f54fc76d9fcc39dff3d79a2ec27f57e96e9d2 .
Running that reproduces the "ERROR: Query error: Restart read required at: { read: { physical: 1585316553533711 } local_limit: { physical: 1585316553541734 } global_limit: in_txn_limit: serial_no: 0 }" type of failure that I have been struggling with.

Postgres JDBC driver's "preferQueryMode" documentation:

preferQueryMode = String

Specifies which mode is used to execute queries to database: simple means ('Q' execute, no parse, no bind, text mode only), extended means always use bind/execute messages, extendedForPrepared means extended for prepared statements only, extendedCacheEverything means use extended protocol and try cache every statement (including Statement.execute(String sql)) in a query cache. extended | extendedForPrepared | extendedCacheEverything | simple

The default is extended

Postgres docs for Simple query and Extended query at the protocol level: https://www.postgresql.org/docs/11/protocol-flow.html#id-1.10.5.7.4 .
Perhaps a read restart crashes the extended query mode and the failure doesn't get passed to the client?

Thanks @lhotari, I'm looking into it right now. I've created #4094 to track these issues separately.
(Looks like there's some bug in how we track data sent to the client in simple mode specifically - for some reason we don't reset the "flushed" flag and thus cannot do the restart...)

A (small) follow-up: yb-tserver crash was caused by #3779 (recently fixed)
As for empty results returned by a prepared statement with UUID column - I was able to reproduce this with local installation and simplified test, will start looking deeper into that soon

Found a cause of no results being returned when binding a UUID (or char(36)): #4254 (fix for this is currently being reviewed).
As such, I'm closing this issue in favor of #4254.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rkarthik007 picture rkarthik007  路  5Comments

joeblew99 picture joeblew99  路  5Comments

rahuldesirazu picture rahuldesirazu  路  3Comments

ndeodhar picture ndeodhar  路  5Comments

rhzs picture rhzs  路  5Comments