Stackexchange.redis: Commited transaction with task not completed in Redis 2.x

Created on 7 Sep 2018  Â·  18Comments  Â·  Source: StackExchange/StackExchange.Redis

I have updated a project from Redis 1.2.6 to Redis 2.0.495 under target framework net462. Since the update, some tasks that are part of a transaction are not completed after executing the transaction. I suspect it is a 2.x bug since it was working consistently before, and I haven't seen a related release note (do tell if I missed it).
Here is a simplified code sample:

```ITransaction transaction = this.Database.CreateTransaction();
ConditionResult keyNotExists = transaction.AddCondition(Condition.KeyNotExists(someKey));
Task hashSetTask = transaction.HashSetAsync(someKey, someValues);
Task expireTask = transaction.KeyExpireAsync(someKey, ttl);

bool committed = await transaction.ExecuteAsync();
Debug.Assert(hashSetTask.IsCompleted, "This fails sometimes");
Debug.Assert(expireTask.IsCompleted, "This also fails sometimes");
```
Both Debug.Asserts fail very frequently (as I run tests), but they don't fail all of the time. In addition, the asserts also fail when I put them inside an "if(commited)" conditional. Thank you.
[Edit] I forgot to mention I'm using Redis 3.2.100 (00000000/0) 64 bit

transactions

Most helpful comment

This is very high on my priority list to investigate, thanks.

On Fri, 7 Sep 2018, 19:05 Benedicte, notifications@github.com wrote:

Interesting; I'm running my tests on an average 4C/8T machine.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/943#issuecomment-419520563,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsHltSN6QJ9dn6UikYPBUwQzHu5zyks5uYrVagaJpZM4Wd9bo
.

All 18 comments

Are you deleting (or setting) someKey before the test? Or is it inconsistent? e.g. does the transaction fail or succeed?

The key is generated at random. The transaction succeeds.

Here is the script I just ran in LinqPad, and the issue repros.
c# ConnectionMultiplexer conn = ConnectionMultiplexer.Connect("localhost,abortConnect=false"); var db = conn.GetDatabase(); for (int i = 0; i < 40000; i++) { string key = "key~" + Guid.NewGuid(); HashEntry[] hashEntries = new HashEntry[] { new HashEntry("blah", DateTime.UtcNow.ToString("R")) }; ITransaction transaction = db.CreateTransaction(); ConditionResult keyNotExists = transaction.AddCondition(Condition.KeyNotExists(key)); Task hashSetTask = transaction.HashSetAsync(key, hashEntries); Task<bool> expireTask = transaction.KeyExpireAsync(key, TimeSpan.FromSeconds(30)); bool committed = await transaction.ExecuteAsync(); if (committed) { if (!hashSetTask.IsCompleted) "Oh no!".Dump(); if (!expireTask.IsCompleted) "Oh no2!".Dump(); } }

The 40000 key-value pairs get created (I view them using RedisDesktopManager), and they do expire (get deleted) after 30s.

I forgot to mention I'm using Redis 3.2.100 (00000000/0) 64 bit.

Seeing the exact same thing here, we have a bunch of tests that now never complete with 2.0.495 (whereas 1.2.6 worked perfectly fine), with code very similiar to @BenedicteLeo's. The interesting thing is that, on my dev machine (which is a 4C/8T machine), I can't reproduce the issue. But it reproduces 100% of the time on our slow 1C/1T build agents. We are using Redis 4.0.11.

Interesting; I'm running my tests on an average 4C/8T machine.

This is very high on my priority list to investigate, thanks.

On Fri, 7 Sep 2018, 19:05 Benedicte, notifications@github.com wrote:

Interesting; I'm running my tests on an average 4C/8T machine.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/943#issuecomment-419520563,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsHltSN6QJ9dn6UikYPBUwQzHu5zyks5uYrVagaJpZM4Wd9bo
.

Right; I've been looking at this, and I think this is a minor expectation failure rather than a "OMG the lib is on fire" problem.

There is a slight behaviour difference between 1.* and 2.* here - you aren't imagining things; the difference is:

  • in 1.* it attempted to set the task result eagerly, but there were scenarios where this could cause thread-stealing (very very bad; and no, the "don't allow sync continuations" flag doesn't help)
  • in 2.* it sets the task results via a worker

the difference, then, is that when under load, the result can be "seriously, I'm just about to be completed, but right now I'm not". The current solution then (without any lib changes), is to either await the result (which will often be already complete), or if you're in sync code... meh you can probably just use .Result if you know that the transaction has completed - the inner tasks will be literally about to complete.

For example, if I do this to the test provided above, and add counters, then I get:

09:03:29.865: hash hit: 39991, miss: 9; expire hit: 39993, miss: 7

(edit: where "hit" means "was already completed", and "miss" means "wasn't")

with the main code change being:

c# if (committed) { if (hashSetTask.IsCompleted) hashHit++; else hashMiss++; if (expireTask.IsCompleted) expireHit++; else expireMiss++; await hashSetTask; await expireTask; }

so... it is pretty rare, and await (or .Result) works; I'll take a look to see if we can perhaps reverse the orders, so that the transaction is the last thing added to the worker - however, there is inherently a race scenario here, and in theory it is possible that even if we queue them in the "right" order, very occasionally they might get completed in the other order. Because: threads.

Unfortunately, it is painfully hard to avoid this kind of problem because of the design of Task<T> and the risk of thread-stealing.

Upon further reading, it looks like we're already processing them in the "correct" (inner-first) order, so the 16 misses above are exactly the number of times the completion race happened such that the outer task (the transaction itself) overtook one of the inner tasks.

Right now, I'm thinking the best approach here might be to presume that most transaction tasks won't have pre-emptively had continuations added; specifically, that this is the exception rather than the rule:

c# Task<bool> expireTask = transaction.KeyExpireAsync(key, TimeSpan.FromSeconds(30)); expireTask.ContinueWith( /* some continuation here */); // <==== this line bool committed = await transaction.ExecuteAsync();

If we are right about that being rare, then because we're no longer targeting NET45, we could use - purely for the inner tasks in a transaction - TaskCreationOptions.RunContinuationsAsynchronously - and set the result directly rather than via the worker (however, all other tasks would continue to use the new dedicated thread-pool, not the default async pool). The result of this would be:

  • the inner tasks are always completed eagerly (meaning: before the transaction itself completes)
  • if you have scheduled a continuation on an inner task of a transaction, it'll uses the default pool rather than the custom pool, so when heavily contested there might be pauses; this is probably fine

Frankly, I expect that when using continuations, most people are moving to await rather than ContinueWith, in which case they would only be await-ing after the transaction has completed, which will then be a sync-await. In fact, with the above change it would now always be a sync await, rather than almost always.

I'll spike the change, and see how it works.

fix looks viable:

09:34:09.699: hash hit: 40000, miss: 0; expire hit: 40000, miss: 0

adding PR (#950) for review

merged; closing as "fixed next build" (I'll try to get this out today)

should be available in build 505 onwards

Thank you so much for looking into this promptly.

Thanks @mgravell for the in-depth explanation. I've been playing around with this, and am still seeing an issue with 505 that I can't quite figure out.

We have some code that looks like this:

var redisSuccess = await _redis.ExecuteAsync(
    async db =>
    {
        var redisValue = RedisSE.ConvertValueToRedis(value);
        var tran = db.CreateTransaction();
        tran.AddCondition(Condition.HashLengthLessThan(key, maxCount));
        tran.HashSetAsync(key, field, redisValue);
        var keySet = tran.StringSetAsync($"{key}:{field}", redisValue, expiry);
        tran.KeyExpireAsync(key, expiry);
        var result = tran.HashLengthAsync(key, CommandFlags.DemandMaster);
        var success = await tran.ExecuteAsync(CommandFlags.DemandMaster);
        if (success) redisCount = redisCount = (int)await result;
        if (!success && !keySet.IsCanceled)
        {
            await keySet;
            await db.HashDeleteAsync(key, field, CommandFlags.DemandMaster);
        }
        return success;
    }, key, nameof(TryAddAsync));

and in one particular test, we have 50 tasks being spun up, each calling the above 3 times.

With 1.2.6, this test ran fine on our build machines, which are again 1C/1T smallish VM's on Azure. The average runtime was about 1s. But with 2.0.505, I'm getting the following error, every time:

StackExchange.Redis.RedisTimeoutException : Timeout awaiting response (8594ms elapsed, timeout is 5000ms), inst: 0, qs: 15, in: 0, serverEndpoint: Unspecified/credis1.ci.nudge.land:6380, mgr: 7 of 10 available, clientName: buildagent3, IOCP: (Busy=1,Free=999,Min=2,Max=1000), WORKER: (Busy=153,Free=32614,Min=2,Max=32767), v: 2.0.505.18761 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

The same test completes in under 10ms on my local dev box, which has much more resources available to it. I'm not sure where the problem is here.

Well that's waiting a response from the server. There are 15 messages
awaiting replies, and nothing in the input socket. Did the server stall at
all? Anyhing in "slowlog" on the server?

On Thu, 13 Sep 2018, 16:50 Michael Scrivo, notifications@github.com wrote:

Thanks @mgravell https://github.com/mgravell for the in-depth
explanation. I've been playing around with this, and am still seeing an
issue with 505 that I can't quite figure out.

We have some code that looks like this:

var redisSuccess = await _redis.ExecuteAsync(
async db =>
{
var redisValue = RedisSE.ConvertValueToRedis(value);
var tran = db.CreateTransaction();
tran.AddCondition(Condition.HashLengthLessThan(key, maxCount));
tran.HashSetAsync(key, field, redisValue);
var keySet = tran.StringSetAsync($"{key}:{field}", redisValue, expiry);
tran.KeyExpireAsync(key, expiry);
var result = tran.HashLengthAsync(key, CommandFlags.DemandMaster);
var success = await tran.ExecuteAsync(CommandFlags.DemandMaster);
if (success) redisCount = redisCount = (int)await result;
if (!success && !keySet.IsCanceled)
{
await keySet;
await db.HashDeleteAsync(key, field, CommandFlags.DemandMaster);
}
return success;
}, key, nameof(TryAddAsync));

and in one particular test, we have 50 tasks being spun up, each calling
the above 3 times.

With 1.2.6, this test ran fine on our build machines, which are again
1C/1T smallish VM's on Azure. But with 2.0.505, I'm getting the following
error, every time:

StackExchange.Redis.RedisTimeoutException : Timeout awaiting response (8594ms elapsed, timeout is 5000ms), inst: 0, qs: 15, in: 0, serverEndpoint: Unspecified/credis1.ci.nudge.land:6380, mgr: 7 of 10 available, clientName: buildagent3, IOCP: (Busy=1,Free=999,Min=2,Max=1000), WORKER: (Busy=153,Free=32614,Min=2,Max=32767), v: 2.0.505.18761 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

The same test completes in under 10ms on my local dev box, which has much
more resources available to it. I'm not sure where the problem is here.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/943#issuecomment-421056202,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsIGaIkxphYbpmM0_dEmU8N5rdGboks5uan7SgaJpZM4Wd9bo
.

slowlog is not showing any transaction commands AFAICT, just a bunch of Scans. Is there a way I can list the transactions commands the lib is running so I can compare 1.x to 2.x? I do agree that the server is busy and can't process them fast enough, I just can't rationalize why it seemingly wasn't an issue with 1.x.

nevermind, figure out how using the monitor command .. will report back with anything interesting.

I just can't rationalize why it seemingly wasn't an issue with 1.x.

That's an easy one (from the perspective of being familiar with the 2.x changes - I don't mean this disparagingly):

1.x doesn't notice async timeouts. 2.x: does.

What does that mean exactly? Was the 1.x code doing a sort of fire and forget?

I've attached monitor logs of the 1.x vs. 2.x transactions from this test. The main thing I notice is that the transactions commands when running on 2.x don't start until approx 20s after I start monitoring, whereas they start showing up almost immediately with the 1.x code. You can also see that the 1.x logs have way more transactions completed than the 2.x one, in a much smaller time period.

1x.log
2x.log

What does that mean exactly? Was the 1.x code doing a sort of fire and forget?

No, not F+F - it just didn't have he machinery in place to observe timeout conditions on the async path. Timeouts are initiated by the client, not the server (there's no server-based timeout in redis). It is pretty easy to check for timeouts on the sync path (you basically do a wait-with-timeout), but checking for timeouts in async/await code - efficiently - is actually pretty hard (doing a Task.Delay and Task.WhenAny is workable but very far from efficient). So: the pieces to implement this are new in 2.x

Was this page helpful?
0 / 5 - 0 ratings