The bug is around Etag being set on ClearStateAsync call and it's a bit tricky to reproduce (code in provider is a bit convoluted). The key point is to re-use grain instance ( while it's still in memory
Reproduction steps:
await grain.Write("First write is sucessful");
var isCleaned = await grain.CleanState();
await grain.Write("Second write - doomed to be lost");
var secondData = await grain.Read(); // This reads because grain is in memory =) but there is no state saved
What happens is actually an interesting thing ( this why we were missing this bug for a while in our custom copy of BlobStorageProvider):
1. Empty state read, `grainState.Etag == null`
2. First payload is written, `grainState.Etag == <BlobEtag>`
3. First payload is cleared, `grainState.Etag == <BlobEtag>` HERE is where our Etag is invalid.
4. Second payload is trying to be written. **Write FAILS** because there is an incorrect Etag (anyone can check the blob storage - there will be NO data with `Second write - doomed to be lost` payload )
This happens because of [AzureBlobStorage.cs: L248](https://github.com/dotnet/orleans/blob/5ede8d848a469adcb36846f607aff80b1cda205a/src/OrleansAzureUtils/Providers/Storage/AzureBlobStorage.cs#L248)
5. Consequent READ succeeds because the grain activation is still in memory (this is how bug eluded us for sooo long).
What's also interesting is the code in the next write which fails to write BUT fixes the state. Here is the current method annotated with interesting comments
```C#
public async Task WriteStateAsync(string grainType, GrainReference grainId, IGrainState grainState)
{
var blobName = GetBlobName(grainType, grainId);
try
{
if (this.Log.IsVerbose3) this.Log.Verbose3((int)AzureProviderErrorCode.AzureBlobProvider_Storage_Writing, "Writing: GrainType={0} Grainid={1} ETag={2} to BlobName={3} in Container={4}", grainType, grainId, grainState.ETag, blobName, container.Name);
var json = JsonConvert.SerializeObject(grainState.State, jsonSettings);
var blob = container.GetBlockBlobReference(blobName);
blob.Properties.ContentType = "application/json";
var containerNotFound = false;
try
{
// This call guaranteed to fail because of grainState.ETag is not-null after ClearStateAsync
await blob.UploadTextAsync(
json,
Encoding.UTF8,
AccessCondition.GenerateIfMatchCondition(grainState.ETag),
null,
null).ConfigureAwait(false);
}
catch (StorageException exception)
// We will hit this handler
{
var errorCode = exception.RequestInformation.ExtendedErrorInformation.ErrorCode;
// This will be set to false as the error core is 412 ConditionNotMet (wrong etag).
containerNotFound = errorCode == BlobErrorCodeStrings.ContainerNotFound;
//Current state is not durable and will lost if grain is deactivated before another write call.
}
// This If will be skipped, but depending on Azure Storage Library - this may behave differently (and this also explains our old random Log entries about "Creating container" when there was a valid container , see https://github.com/Azure/azure-storage-net/issues/271 :D
if (containerNotFound)
{
// if the container does not exist, create it, and make another attempt
if (this.Log.IsVerbose3) this.Log.Verbose3((int)AzureProviderErrorCode.AzureBlobProvider_ContainerNotFound, "Creating container: GrainType={0} Grainid={1} ETag={2} to BlobName={3} in Container={4}", grainType, grainId, grainState.ETag, blobName, container.Name);
await container.CreateIfNotExistsAsync().ConfigureAwait(false);
await blob.UploadTextAsync(
json,
Encoding.UTF8,
AccessCondition.GenerateIfMatchCondition(grainState.ETag),
null,
null).ConfigureAwait(false);
}
// Because blob doesn't exist - ETag here will be correctly set to null as like with empty grain, so consequent 3rd write WILL succeed. This makes in-memory state aligned with the blob state - no data and Etag is null so grain activation is "valid" again.
grainState.ETag = blob.Properties.ETag;
if (this.Log.IsVerbose3) this.Log.Verbose3((int)AzureProviderErrorCode.AzureBlobProvider_Storage_DataRead, "Written: GrainType={0} Grainid={1} ETag={2} to BlobName={3} in Container={4}", grainType, grainId, grainState.ETag, blobName, container.Name);
}
catch (Exception ex)
{
Log.Error((int)AzureProviderErrorCode.AzureBlobProvider_WriteError,
string.Format("Error writing: GrainType={0} Grainid={1} ETag={2} to BlobName={3} in Container={4} Exception={5}", grainType, grainId, grainState.ETag, blobName, container.Name, ex.Message),
ex);
throw;
}
}
Sample grain:
```C#
public interface IDebugGrain: IGrainWithStringKey
{
Task
Task
Task
}
[Serializable]
public class DebugGrainState:
{
public string Data { get; set; }
}
[StorageProvider(ProviderName = "AzureBlobStorage")]
public class DebugGrain: Grain
{
public DebugGrain() { }
public DebugGrain(IGrainIdentity identity, IGrainRuntime runtime, DebugGrainState state, IStorage storage, IServiceLocator serviceLocator)
: base(identity, runtime, state, storage, serviceLocator) { }
public Task<string> Read()
{
// Return stored data
return Task.FromResult(State.Data);
}
public async Task<bool> Write(string input)
{
// Save data to state.
State.Data = input;
await WriteStateAsync();
return true;
}
public async Task<bool> CleanState()
{
// Delete the grain.
await ClearStateAsync();
return true;
}
}
```
PS: @richorama owes me a beer and some extra hair for my back that were pulled out investigating this ;)
Awesome work! Here you go: 馃嵑
Impressive deep dive!
I wonder if there is a similar bug in azure table storage provider?
:) I'll check it over the weekend, but you know - this kind of bugs is found only when you're bitten by them (and we are , good that it was on the feature-in-dev in test env =) )
Ouch! Will check on AWS as well... Since the code is very similar I may had the same problem o.O
Good work @centur
Checked the AzureTableStorage - it's good one, on a first glance it seems that it assigns similar dodgy record.Etag value here, but it actually will have a correct null because
it cleared here. That GrainStateTableDataManager handles deletion correctly.
@centur, thanks for following through. Is this one resolved via #2669 now?
Yep, unless we want to keep it open for @galvesribeiro aws check
We can always open a separate one for AWS.
Resolved via #2669.
No worries, I'll open a PR in case I found same issue on AWS.
This is why we need a test kit to run all the storage providers to a minimum set of tests, as we discussed in https://github.com/dotnet/orleans/issues/2011
Most helpful comment
This is why we need a test kit to run all the storage providers to a minimum set of tests, as we discussed in https://github.com/dotnet/orleans/issues/2011