Spanner: v0.37.4 (But as far as I know, it also occurs when we use latest version)
Alpine Docker on GKE
If possible, this client library should retries a transaction when it fails with Session not found error.
We sometimes get the Session not found errors.
This client library retries transactions only when the Abort error occurred, so when taken session is not active, it just returns the NotFound error to callers without retrying.
Since this library creates the pool of the sessions, I think it should retry failed transactions caused by Session not found by taking another session. But are there any problems to take another session from the pool?
The client library creates and maintains a session pool, and takes measures to keep these sessions alive by doing a GetSession ping call for sessions that have not been used for a while. Still it is possible that Cloud Spanner (or another user) deletes a session server side, causing a Session not found error when using the client library.
This could be mitigated by retrying failed transactions and other server calls that operate on a session by first taking a new session and then retrying. This is however not complete straightforward, as there are roughly 4 different categories of calls that we would need to consider in order of increasing complexity:
executePartitionedUpdate.Session not found. These are already executed in a retry-loop, and the Session not found condition could be added to the conditions that should cause the transaction to be retried, but with the additional requirement that the transaction needs to take a new session.BeginTransaction call of a MultiUseReadOnlyTransaction. Read-only transactions are normally not executed within a retry loop, but the initial BeginTransaction call could be retried on a new session.MultiUseReadOnlyTransaction that fails with a Session not found error would need a new read-only transaction to be created on a different session with the same read timestamp as the session that failed, and the query must be retried on the new transaction. The client library must execute any subsequent query for the transaction on the new read-only transaction.@110y Do you have any specific use case that reproduces this problem more frequently than others (or even always)?
@olavloite
Do you have any specific use case that reproduces this problem more frequently than others (or even always)?
Unfortunately, I do not understand why this problem occurs...
One of the my case is:
ReadWriteTransaction.Session not found are rarely happen (recently, there are 2 errors per week).The document says:
There are two ways to delete a session:
- A client can delete a session.
- The Cloud Spanner database service can delete a session when the session is idle for more than 1 hour.
and, as far as my understanding is correct, if we are continuously sending requests to the spanner then there are no idle sessions for more than 1 hour since this library manages sessions in FIFO manner and used session will be pushed back to the session pool. That's why I'm wondering this problem and guessing the spanner deletes sessions even though idle time is not over 1 hour.
Do you have any thoughts or suggestions?
@110y To my understanding, it is possible (although not very common) that Cloud Spanner deletes sessions that have been idle for less than 1 hour, which could cause this problem. The reason I asked whether you had a specific use case that would always (or often) cause this problem, was to check whether you were running into some unknown bug in the session pool. Considering your error rate of 1-2 errors per week at 1QPS, I don't think that this is a specific bug in the Go session pool.
The Java client library for Cloud Spanner added a protection against this problem a couple of months ago along the lines that I mentioned above. I'll have a look to see if it is feasible to add this protection for the Go client library as well.
@olavloite
As trial, I send a CL that makes ReadWriteTransaction retry on Session not found error (case 2 you mentioned above).
@olavloite
Could you please take a look this CL which fix the problem for ReadWriteTransaction?
@110y
Sorry for the delay on this. We had an integration test that started failing after one of the related changes for this was merged (not this one, the one for SingleUse transactions). It looks now like those fails were unrelated to these changes, but I wanted to make sure before proceeding with this. But I'll take another look into this change asap.
@olavloite
I've updated my CL based on your comments.
Could you please take another look?
@olavloite We still suffered from this error frequently. It happens hundreds of times everyday. What's the status of this issue?
@olavloite
I've fixed my CL: https://code-review.googlesource.com/c/gocloud/+/45910/ to follow latest master.
Please take another look and could you merge the change if it is +1 ?
@110y and @kazegusuri
Thanks for updating your CL and sorry for taking so long to merge this. I'll have a look at this this morning and try to get it in ASAP.
@olavloite
By the way, do you have a plan to revive this CL which make ROTxn retry on SessionNotFound error?
@110y Yes (and additional transaction types).
@olavloite Thanks you for handling issues about spanner. It seems many issues are fixed since last spanner client release as v1.1.0. Do you have a plan to cut a new release for spanner?
@hengfengli Do you know if there is a date planned for the next release?
I'll ask @skuruppu and make a release soon.
I'll ask @skuruppu and make a release soon.
@hengfengli, yes please cut a release.
The release has been cut so closing the issue. Please refer to the release notes.
@skuruppu I still got Session not found error on v1.5.1. May be it's because I am running on emulator?
@kanekv
I assume that you are getting this during a test, as you wrote that you are using the emulator. Is that correct?
The protection against Session not found errors was added because the Cloud Spanner backend can sometimes delete sessions without the client knowing about it. The emulator does not do that, at least not to my knowledge. At the same time, there are some known differences between the emulator and the Cloud Spanner backend. One of them is that the emulator reports some errors differently than Cloud Spanner. This also applies to Session not found errors.
So if you have a test that explicitly deletes a session on the emulator, and then tries to use that session in the client, then this error is explainable.
@olavloite Not sure why it happened, I didn't run tests, I had an app connected to emulator instance and after coming back in a couple of hours (making a request after some idle time) it started throwing this error. May be it is indeed specific to emulator.
@kanekv Thanks for the quick reply. That is interesting information, though. This seems to indicate that the client library is not keeping sessions alive on the emulator. In addition to the Session not found retry protection, the client library also contains a background process that keeps all idle sessions alive. It seems like that one is not working as it should with the emulator. I'll open a separate issue and have a quick look at it. I guess it hasn't really been noticed yet, as most users use the emulator for tests, and those are not running long enough to ever cause a session to time out.