Orleans: Silo cannot join cluster after upgrade to 3.1

Created on 3 Mar 2020  路  14Comments  路  Source: dotnet/orleans

Hi,

I have upgraded my solution to 3.1 and now I see a lot of errors like that:

category: "Orleans.Runtime.GrainDirectory.GrainDirectoryHandoffManager"
message: "ProcessSiloAddEvent(S10.0.2.40:11111:320596133) failed, will be retried: 
Exc level 0: System.NullReferenceException: Object reference not set to an instance of an object.
   at Orleans.Runtime.GrainDirectory.GrainDirectoryHandoffManager.ProcessAddedSiloAsync(SiloAddress addedSilo, List`1 splitPartListSingle, List`1 splitPartListMulti)
   at Orleans.Runtime.GrainDirectory.GrainDirectoryHandoffManager.ExecutePendingOperations()."
timestamp: "2020-03-03T12:57:33Z"
logLevel: "Warning""

followed by

category: "Orleans.Runtime.GrainDirectory.GrainDirectoryHandoffManager"
message: "ProcessSiloAddEvent(S10.0.2.184:11111:320936250) failed, will be retried: 
Exc level 0: System.NotImplementedException: InterfaceId: 0xF4EEC268, MethodId: 0x445A8F20
   at Orleans.Runtime.GrainDirectory.GrainDirectoryHandoffManager.ProcessAddedSiloAsync(SiloAddress addedSilo, List`1 splitPartListSingle, List`1 splitPartListMulti)
   at Orleans.Runtime.GrainDirectory.GrainDirectoryHandoffManager.ExecutePendingOperations()."

Most helpful comment

BTW, 3.1.2 is on nuget.org.

All 14 comments

Edit:

0xF4EEC268 seems to be IRemoteGrainDirectory and the method is AcceptSplitPartition

I don't know why, but the id has changed:

3.0.2

case 1146785568:
await((IRemoteGrainDirectory)grain).AcceptSplitPartition((List<ActivationAddress>)arguments[0], (List<ActivationAddress>)arguments[1]);
                return null;
// IRemoteGrainDirectory.cs
Task AcceptSplitPartition(List<ActivationAddress> singleActivations, List<ActivationAddress> multiActivations);



md5-86f01a1ef3a3aa42ad635eed5a417eb8



case -1389756233:
await casted.AcceptSplitPartition((List<ActivationAddress>)arguments[0], (List<ActivationAddress>)arguments[1]);
                return null;



md5-017373edde927a35999a101102e882e4



// IRemoteGrainDirectory.cs
Task AcceptSplitPartition(List<ActivationAddress> singleActivations, List<ActivationAddress> multiActivations);

When I upgrade to 3.1.1 I also get this exception:

message: "Unsupported type 'Orleans.GrainDirectory.MultiClusterRegistrationStrategy' encountered. Perhaps you need to mark it [Serializable] or define a custom serializer for it?"
stackTrace: "   at Orleans.Serialization.SerializationManager.DeserializeInner[TContext,TReader](SerializationManager sm, Type expected, TContext context, TReader reader)
   at Orleans.Serialization.BuiltInTypes.DeserializeDictionary[K,V](Type expected, IDeserializationContext context)
   at Orleans.Serialization.SerializationManager.DeserializeInner[TContext,TReader](SerializationManager sm, Type expected, TContext context, TReader reader)
   at OrleansGeneratedCode9E8BDEFD.OrleansCodeGenOrleans_Runtime_GrainInterfaceMapSerializer.Deserializer(Type expected, IDeserializationContext context)
   at Orleans.Serialization.SerializationManager.DeserializeInner[TContext,TReader](SerializationManager sm, Type expected, TContext context, TReader reader)
   at Orleans.Serialization.BuiltInTypes.DeserializeOrleansResponse(Type expected, IDeserializationContext context)
   at Orleans.Serialization.SerializationManager.DeserializeInner[TContext,TReader](SerializationManager sm, Type expected, TContext context, TReader reader)
   at Orleans.Runtime.Messaging.MessageSerializer.OrleansSerializer`1.Deserialize(ReadOnlySequence`1 input, T& value)
   at Orleans.Runtime.Messaging.MessageSerializer.TryRead(ReadOnlySequence`1& input, Message& message)
   at Orleans.Runtime.Messaging.Connection.ProcessIncoming()
   at Orleans.Internal.OrleansTaskExtentions.<ToTypedTask>g__ConvertAsync|4_0[T](Task`1 asyncTask)
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem`1.Execute()
   at Orleans.Runtime.TypeManager.GetTargetSiloGrainInterfaceMap(SiloAddress siloAddress)"

it seems that 3.1.X is not compatible with 3.0 anymore :(

I'm not sure what the compatibility guarantees are supposed to be, but that methodId message brings to light a significant issue.

The new code generation does not always generate the same methodIds as the old one, despite it trying its best to do so. (Even worse, the new code gen does not always generate the same methodIds as itself, as explained below.) The methodIds come from a hash of a reformatted method signature.

The old code generation when run on the Desktop Framework would generate the signature to be hashed as: AcceptSplitPartition(System.Collections.Generic.List`1[[Orleans.Runtime.ActivationAddress,Orleans.Core.Abstractions]],System.Collections.Generic.List`1[[Orleans.Runtime.ActivationAddress,Orleans.Core.Abstractions]]). The important part here is representing List<...> as System.Collections.Generic.List`1[[...]].

The new generation when targeting netcoreapp3.1 will represent List<...> as System.Collections.Generic.List`1[[...]],System.Collections.

The reason is that the old code generation uses the runtime location of List<>, which is in mscorlib, so it omits the assembly name (since it is the "system assembly" where the primitives live). The new code gen will use the reference assembly where the type lives at compile time. For netcoreapp3.1, List<> lives in the System.Collections.dll reference assembly. Since that is not where the primitives live, it will include the assembly name.

The new code gen when targeting netstandard2.0, will generate System.Collections.Generic.List`1[[...]] like the old code gen. However, this would not always be the case. For example, the old code gen would represent HashSet<> as System.Collections.Generic.HashSet`1[[...]],System.Core, while new code gen targeting netstandard2.0, would omit the assembly name.

Thus not only is the new code gen not fully compatible with the old one, but changing which framework you are targeting when compiling a grain interface assembly can result in incompatible methodIds. That is really unfortunate, and undesirable.

Thank you very much @KevinCathcart. I was also looking at the same path, because with Orleans 3.1 the .NET Core 3.1 support was added. So my reference has actually moved from .NETStandard2 to .NETCore3.1 assembly.

But the issue you describe is a bigger problem, because it does not allow mixed setups and make migrations much harder.

I'm slightly more confused/concerned by the error you encountered trying to use 3.1.1. Tracking that down, i actually proved it was impossible to happen when the code tagged as 3.1.1 is communicating to 3.0.0. However, it would happen if the latest development code was being used instead. Which is indeed the case. The Nuget 3.1.1 packages does not contain the code it is supposed to contain.

it would happen if the latest development code was being used instead

Do you mean the code from the nightly MyGet feed/master?

@sergeybykov:

If i were to compile master, and try to connect a client using it to a 3.1.0 silo, it will fail with that exception, because the serializer for GrainInterfaceMap, is incompatible due to changes in master (specifically #6305). Totally reasonable and expected, as master can have breaking changes vs a previous release. (I'd assume the MyGet nightly act the same.)

However, If I update a client project to using <PackageReference Include="Microsoft.Orleans.[whatever]" Version="3.1.1">, pulling from the main nuget feed, the same error occurs.

If I compile the 3.1.1 tag, and project reference it, it does not fail to connect to other 3.1.0 silos.

Furthermore ILSpy shows a bunch of changes in the 3.1.1 Orleans.Core nuget package that should not be there, and that don't show up when building 3.1.1 from the git tag.

If i were to compile master, and try to connect a client using it to a 3.1.0 silo, it will fail with that exception,

That's expected as we made a number of breaking changes in master towards 4.0. We plan to work on backward compatibility with 3.x later after me make all major changes there.

An Update:
Sergey is working on getting a fixed 3.1.2 package out. That fixes the new problems in the 3.1.1 package.

The methodIds potentially varying by target framework remains to be addressed. I'm not even sure what a good approach for addressing that one would even be. It is kind of inherent to the new code gen approach. If assemblies where not used in the type names at all, the problem would go away, but obviously that is not backwards compatible at all.

Yes, we messed up by building 3.1.1 from master instead of the 3.1.1 branch. 3.1.1 nugets have been unlisted. 3.1.2 will be published ASAP.

Thank you, @SebastianStehle and @KevinCathcart for pointing to this!

An ugly approach would be to get a mapping table from somewhere. If I understand it correctly we are only talking about Framework types.

BTW, 3.1.2 is on nuget.org.

Resolved via #6394 and #6396.

Was this page helpful?
0 / 5 - 0 ratings