Orleans: Serialization difference between direct grain call and SimpleMessageStreamProvider

Created on 14 Jul 2016  路  16Comments  路  Source: dotnet/orleans

Update - Completely changed the content of the issue after extra investigation:
(subject was: Registered custom serializers and Streams message serialization )

We have strange behaviour of SimpleMessageStream in cluster (2 silos, when there is 1 silo - everything is fine).

Some context:

  • We have DTOs which we are using in Grain interface methods as input parameters.

    • Some of them have JSON.NET's JObject as a property.

    • All DTOs are marked as [Serializable].

    • JObject is not marked as [Serializable], and we can't change it.

  • We have custom serializers registered for Json.net case: we are using custom serializer which is almost identical (now identical actually, with some logging as in SerializationTests.JsonTypes JsonSerializationExample2 which is marked with [RegisterSerializer] attribute.

The problem is that when we are passing our DTO with JObject property directly to a grain method, e.g. doing await grainProxy.Perform(ourDtoWithJObject) - everything is working fine and custom serializer is invoked.
When we are doing very similar action, but invoking another DTO with JObject property via SimpleMessageStream - we have serialization problem and exception bubbles up from the depths of Orleans code (see interesting lines with <---!!! comments in the stack trace):

Orleans.Runtime.OrleansException: System.Runtime.Serialization.SerializationException: Type 'Newtonsoft.Json.Linq.JObject' in Assembly 'Newtonsoft.Json, Version=9.0.0.0, Culture=neutral, PublicKeyToken=30ad4fe6b2a6aeed' is not marked as serializable.
   at System.Runtime.Serialization.FormatterServices.InternalGetSerializableMembers(RuntimeType type)
   at System.Runtime.Serialization.FormatterServices.GetSerializableMembers(Type type, StreamingContext context)
   at System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.InitMemberInfo()
   at System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.InitSerialize(Object obj, ISurrogateSelector surrogateSelector, StreamingContext context, SerObjectInfoInit serObjectInfoInit, IFormatterConverter converter, ObjectWriter objectWriter, SerializationBinder binder)
   at System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Write(WriteObjectInfo objectInfo, NameInfo memberNameInfo, NameInfo typeNameInfo)
   at System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Serialize(Object graph, Header[] inHeaders, __BinaryWriter serWriter, Boolean fCheck)
   at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(Stream serializationStream, Object graph, Header[] headers, Boolean fCheck)
   at Orleans.Serialization.BinaryFormatterSerializer.Serialize(Object item, BinaryTokenStreamWriter writer, Type expectedType)
   at Orleans.Serialization.SerializationManager.FallbackSerializer(Object raw, BinaryTokenStreamWriter stream, Type t)
   at Orleans.Serialization.SerializationManager.SerializeInner(Object obj, BinaryTokenStreamWriter stream, Type expected)
   at Orleans.Serialization.BuiltInTypes.SerializeImmutable[T](Object original, BinaryTokenStreamWriter stream, Type expected)
   at Orleans.Serialization.SerializationManager.SerializeInner(Object obj, BinaryTokenStreamWriter stream, Type expected)
   at Orleans.Serialization.BuiltInTypes.SerializeInvokeMethodRequest(Object obj, BinaryTokenStreamWriter stream, Type expected)
   at Orleans.Serialization.SerializationManager.SerializeInner(Object obj, BinaryTokenStreamWriter stream, Type expected)
   at Orleans.Serialization.SerializationManager.Serialize(Object raw, BinaryTokenStreamWriter stream)
   at Orleans.Runtime.Message.Serialize_Impl(Int32& headerLengthOut, Int32& bodyLengthOut)
   at Orleans.Messaging.OutgoingMessageSender.Process(Message msg)

Server stack trace: 
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Providers.Streams.SimpleMessageStream.SimpleMessageStreamProducerExtension.StreamConsumerExtensionCollection.<DeliverToRemote>d__7.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Providers.Streams.SimpleMessageStream.SimpleMessageStreamProducer`1.<OnNextAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
   at DBCloud.ActorCollection.BaseGrain`1.<raise>d__18.MoveNext()<---!!! This method sends an event to the Stream
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
   at DBCloud.ActorCollection.Annotations.AnnotationCommandsStackGrain.<SetLastSnapshot>d__6.MoveNext() <---!!! this method receives DTO, doing some required work and raises an event with JObject
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.PublicOrleansTaskExtensions.<BoxAwait>d__8`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.InsideRuntimeClient.<Invoke>d__57.MoveNext()

Exception rethrown at [0]: 
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at DBCloud.ActorCollection.Annotations.DrawboardEngineGrain.<ProcessUpdates>d__10.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.PublicOrleansTaskExtensions.<BoxAwait>d__8`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.InsideRuntimeClient.<Invoke>d__57.MoveNext()

Exception rethrown at [1]: 
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at DBCloud.ActorCollection.Annotations.PageAnnotationPingEventListener.<ProcessCommand>d__3.MoveNext()

It seems that SimpleMessageStreamProvider serializes message in some different way than a direct grain invoker, and it's not using provided custom serializer for a particular type

All 16 comments

Though that SMSProvider default value about immutability may affect this, but switching our SMSProv to
{ "OptimizeForImmutableData", "false" } doesn't solved the problem

Try to define the grain method Perform to accept object and not DTO, and pass DTO at runtime. It suspect it will fail the same way as SMS stream and that will help narrow the issue down.

Thanks @gabikliot . Going to try this today.

Is there any way I can imitate a cluster pipeline with a single instance ? E.g. to force all messages to go through full serialization & copiying, down to a network call.
Right now to experiment with any code changes in our current codebase, I have to deploy to azure with 2 instances per worker role, entire build and deploy CI pipeline takes ~15 mins.

Of course. All the unit tests are doing that.

@centur

Though that SMSProvider default value about immutability may affect this, but switching our SMSProv to
{ "OptimizeForImmutableData", "false" } doesn't solved the problem

Just to make sure. You are building latest code, right? Because OptimizeForImmutableData was added only recently.

@sergeybykov yes, 1.2.3, not off the master branch.
I actually want to disable this optimization... Well anyway I'm going to find a reproduceable case, after 3 days of hunting for this in our code I have good understanding of what may cause it, just need some time to build repro. It's WIP but moving slow, I need to sort out business-valued bits first.

I can't reproduce it in local cluster deployment (non-azure, based on Tutorial.Minimal Sample ) and it works fine :( but it's persistently reproducible when we are running in cluster on Azure. I'm not sure how to replicate the exact behaviour with emulator as setting number of instances to 3 in local emulator doesn't trigger the problem either

@centur when @gabikliot stated

All the unit tests are doing that.

I believe he was referring to our test harness which allows one to spin up multiple silos in different app domains on a single machine. For instance tester/StreamingTests/SampleStreamingTests.cs. In that test harness two silos are spun up, so any serialization issues would show up in a test like that. If you build a similar test case using your objects, it should break the same as it does in azure. A repro like this should make it much easier to track down the issue.

Regarding the serialization tests posted, I didn't dig deeply into these, but it looks like you were testing the serialization of the json objects, but not the DTOs containing the json object, so you're not testing the serialization path that's actually being run in the service. I am not intimately familiar with the serialization logic, but it seems possible to me that the serializer selection may be different for objects embedded inside objects that use a different serializer. The DTO uses the default serializer, while the object within it uses the json serializer. I'm not suggesting that this won't work as much as pointing out that the tests written don't cover this scenario.

Another test to run, if you don't mind, is to add a GenerateSerializerAttribute attribute for the DTO type your sending. If a serializer is not being generated for the DTO object, it may be falling back to the .net binary serializer which won't look up serializers for embedded types.

@jason-bragg Thanks for the hints. I actually spinned up 2 silos using app-domain sample from Tutorial.Minimal and tried to reproduce the problem with some quickly made-up objects. I also tried to spin up multiple instances via Azure Emulator ( which presumable starts separated threads) and it didn't triggered ( OR I have wrong assumptions why it should trigger...)
I'm going to continue on this anyway, as it's a partial blocker for us and definitely would try with a dedicated unit test and GenerateSerializerAttribute.
And I didn't know about such serializer behaviour, I assumed that every time Orleans serializer ( including binary one) stumbles over DTO or nested class - it'll go back and lookup for registered serializers first. It may not be the case here, and DTO is always being attempted to handle with binary...

@jason-bragg looking into the samples and can't understand - how does that sample guarantee that StreamProducer will be on a different silo with stream consumer ? Unless that is guaranteed - it's not really testing end-to-end with serialization code between silos, right ?

Upd: nvm, guys in the chat helped me with this

"I assumed that every time Orleans serializer ( including binary one) stumbles over DTO or nested class- it'll go back and lookup for registered serializers first"

The Orleans binary serializer does (should) do this, but .net binary serializer does not. Orleans generates serializers for any types used in grain interfaces, but using types with streams won't trigger the generation of a serializer. This leads me to suspect that the serializable DTO's are being serialized using the .net serializer, not the Orleans serializer. By using the attribute to force codegen to generate Orleans serializers for the DTO it will ensure that the Orleans binary serializer is being used, which should handle the nested type correctly.

@jason-bragg Thanks a lot, you're right Orleans Serialization code for classes in question was missin - once I've added GenerateSerializerAttribute - everything started to work.

Is there a clean way to disable .NET binary serialization completely ? I want to break my code on running unit tests and in local emulator - so I can see all these problems before it hit the runtime

Great! Glad to here your unblocked. Sorry it took so long to track this down.

I don't know of any way to disable the .net serializer. :/ There are some elaborate ways around it, but no simple way to disable it.

As for catching serialization problems before they go to production, the way we deal with this is by having our tests use at least two silos (as you've seen in the tests discussed earlier) and by having test environments have at least two silos. This is not perfect as serialization errors will show up as intermittent errors, but has been sufficient for our purposes. I wish I had a more reliable suggestion..

I think at this time it'll be too time-consuming to re-write our current tests to run inside 2 silos and add logic to guarantee that we will have cross-silo activations of communicating grains. Also, I'm not sure it worth to run all tests ins silos - our suit is already takes longer than 5 mins which slow down our dev iteration loop. I've added runtime codegen to one of the tests I wrote with 2 silos - it makes a single test to run ~1m longer due to 2 silos and code generation, which is not what we want to add to everyday routine, really.

So far we wrote a reflection-based unit-test that searches for all our DTOs and walks up and down inheritance chain and breaks if there is no [Serializable] attribute on something. And we are running cluster on dev env in azure, and writing end-to-end tests in cucumber to test the actual dev system behaviour, not a partially mocked environment.
Weird solution for the issue we even don't want to have, but this is the best we can come to.

If you comment out the serialization generation attribute for a DTO that was breaking then run a serialization test, similar to the one you ran to test the json serializer, against it, the test should fail. If this is the case, you can uncomment the attribute and use the mentioned reflection logic to test all your TDOs in a similar way.

Was this page helpful?
0 / 5 - 0 ratings