Orleans: Cassandra provider for Orleans

Created on 16 Jun 2018  路  25Comments  路  Source: dotnet/orleans

After much slacking, I finally got my Cassandra integration module in shape and put it on GitHub. The repo is here, and I would love it if someone would take a look at the source. We have been running a 1.5-compatible version of this source for a few months now, without errors so far, though it hasn't been under a lot of load. I'm hoping to eventually get the source into OrleansContrib.

All 25 comments

This is awesome! I think it's better to put it to OrleansContrib right away. That should make it visible to more people for review and use.

Thanks! So (forgive the noob question, I'm an avid Microsoft fan and only ever used TFS up until now) how does one put a project on OrleansContrib?

No worries. Somebody will create a repo there for you and grant your permissions to it. Paging @richorama and @galvesribeiro who've done this for many projects already.

Hey @Arshia001, I have made you an 'owner' in the OrleansContrib org. You should now be able to transfer the ownership of your repo over to OrleansContrib (you should see this in the danger zone of your repo settings). You'll still have full control over it.

Ping me an email if you have any problems ([email protected])

It's done. Thank you.

Thank you, @Arshia001!

But now we have 2 implementations of Clustering and Persistence providers for Cassandra?

https://github.com/OrleansContrib/Orleans.Persistence.Cassandra and
https://github.com/OrleansContrib/Orleans.Clustering.Cassandra by @denisivan0v
https://github.com/OrleansContrib/OrleansCassandraUtils

Does it make sense to merge them somehow?

That's really great to have 2 implementations, it's much better than nothing :)

I've took a quick look at implementations by @Arshia001 and found out that we use different Cassandra driver API at the moment.

I'm really looking forward to having a full and solid support for Cassandra and will be happy to take best parts from both implementations. @Arshia001 can we make a call to discuss it?

@denisivan0v I'm open to any and all discussions. Naturally, I also looked through yours, and here's a quick comparison (I'll use "yours" and "mine" throughout this, because there's no simpler way to distinguish the two. I'll also place my own reasoning within brackets to keep it separate from the comparison itself):

  1. Yours seems to have performance counter support. Mine does not.
  2. You use JSON, I use binary (but also allow custom serialization providers, so one could technically UTF8-encode JSON). [I think it'd be best if both options (text, JSON or otherwise, and binary) were allowed, both with the ability to add custom implementations and both with a few default serializers out of the box].
  3. Yours allows disabling ETags I think? Under what circumstances is this beneficial?
  4. I use a DB table for storing queries and consistency levels. Yours are hard-coded. I also believe we're both using the DataStax driver, but I use vanilla text queries while you use additional driver API. [I don't have anything against OR mappers in a system with (tens of) thousands of queries, but we only need about twenty or so for persistence, clustering and reminders combined, so the overhead can be avoided in this case.]
  5. I think yours allows multiple clusters to coexist in the same keyspace. Mine doesn't. [I never figured out why someone would be unable to create a new keyspace for each cluster (this is the cloud we're talking about, not some medium trust shared hosting).]
  6. I have no idea which one performs better, we should probably profile the two.

That's about it I think. As for merging the two, I don't really believe it's possible, since we use such radically different approaches. I think the best course of action is to choose one, and expand it to include features from the other.

@denisivan0v any thoughts?

@Arshia001, sorry for delay and many thanks for the detailed comparison. Here are few comments:

  1. At the moment it is enough for us to have a JSON serialization only, but it's very easy to have a an external dependency on serializer. I'm going to implement it in the near future.
  2. That's correct. Lightweight transactions use needed to implement ETag based concurrency and it's not very efficient with Cassandra. However, if your data is already linearized (for example, consuming data from a queue), no transactions need since there is no concurrency. For these cases use of ETags can be avoided in favour of performance.
  3. I'm using mapping features since it's much more refactoring friendly. Strongly typed code can be reused as well, I guess (I need to take a look at Reminders provider API).
  4. I've just followed with implementations for other types of storages. Also, I'm looking forward to implement cross-DC deployments.
  5. Agreed.

Can I ask you to give a try to mine providers in your environment? You can take them here https://www.myget.org/feed/Packages/orleans-cassandra.

The project that I'm working on is going to prod in the early fall, so all missing features in providers for Cassandra will be implemented in the near future, and we will be highly focused on performance.

JSON is extremely wasteful in both space and performance. It's best to avoid it where possible.

I don't know why you think etags are slow. You just need an IF in your query. I don't think the performance impact is noticeable compared to serialization. Your current implementation defaults to ne ETag for all types, this will cause errors later on. You should at least assume all types need ETags by default IMO.

Mapping equals overhead, which is hardly necessary in this case.

Cross DC deployments don't need the cluster ID field. A cluster can span many datacenters.

I'll give it a try when I have time, but I don't know when that'll be.

JSON seems to be the recommended approach for storing state, since the data will be human readable. Maybe it's easier to change the state object when JSON is used too?

I remember reading somewhere that the team decided JSON wasn't a suitable option... @sergeybykov can we have your opinion?

I just did a little test. When used to serialize a relatively big grain (containing 1,000,000 records, each with a GUID and a ulong), JSON is ~43% slower and the resulting data is ~172% larger. For JSON serialization, I'm using Newtonsoft.Json. For binary, I'm using a modified version of Bond, which allows me to specify a custom serialization routine for any type. I also used each serializer once as warm-up before measuring the time. The results are as follows:

  • Binary: takes 1.027 seconds, results in a 24,000,009-byte blob. Note the original data is 24,000,000 bytes, which means this technique creates an overhead of only 9 bytes.
  • JSON: takes 1.469 seconds, results in a 65,481,689-character string, which means an additional 41,481,689 bytes (4.5 million times as many as with binary) are required to serialize to JSON. I'm using relatively short names (3-5 bytes), so this number could get larger.

I also tried deserializing the same data (which is originally a skip list) with these results:

  • Binary: deserializes the data in 0.562 seconds.
  • JSON: takes 2.261 seconds simply to deserialize the string. It also takes an additional 1.482 seconds to get the data into skip list format. It would probably be a little faster if it could deserialize directly into the skip list, but it'd certainly still take more than the 2.261 seconds of the first step. So that's between 302% and 566% slower.

Version tolerance is an important feature you should consider, which is a benefit of JSON.

@richorama version tolerance is also a feature of some binary serializers.

@Arshia001 actually, as far as I can tell, the Orleans team now actively discourages using binary serialization and recommends JSON. It's already been mentioned above that this allows people to read that data, but the additional benefit of that is the ability to analyse and maintain (e.g. patch) that data where needed.

To me it looks like the Orleans team recommends something that is versionable.

I think that no matter what is done, a refactoring (changing class name, namespace, whatnot) can make the serialized data non-backwards compatible what comes to tooling and requires a human transformation inserted into the pipeline. This can be whatever custom code run to make the transformations to succeed. It could function so that the transformation function takes as parameter the cluster ID, grain ID, grain type and data and the developer can use that information to make the transformation while the system has data that needs transforming (also in-storage transformations could be done).

The (de)serialization could also work so that one could use arbitrary (de)serializaers and choose them with arbitrary parameters, like in case of transformations. It would be helpful if one could, as with transformations, change also the serialization format.

There is some prior work to this in Ado.Net provider. See for instance https://github.com/dotnet/orleans/blob/master/src/AdoNet/Orleans.Persistence.AdoNet/Storage/Provider/OrleansStorageDefaultJsonDeserializer.cs, which implements a canonical interface to load in a JSON deserializer. This is pre-2 DI system, but the idea then was to allow user to wrap any (de)serializer and the system would be happy to use it. Ado.Net currently supports JSON, XML and binary, all stored in their special respective field types if available (in relational there's some extra to be gained when using "native types"). It does have the other mentioned features and even a test for change of serialization formation. A bit crude, but prior art is there if there's interest for you guys to work towards a common ground on this. :)

Again, version tolerant binary is possible and already available. If viewing the stored data is a requirement, one can implement a viewer utility. My storage module already supports custom serializers, and I'm using it together with Bond in production. I can put the code somewhere and we can all have a look once I'm back home.

@Arshia001 It might make sense to come up with a system that looks like the same in both systems. My point wasn't solely about version tolerant binary, but plugging in any (de)serializer one thinks is called for. And if necessary, using even per-grain basis (I'm thinking of avoiding all sorts of format transformation overhead).

@veikkoeeva Per-grain storage selection is already supported in Orleans, I don't know how beneficial it'd be to also support at the storage provider level. As for plugging in serializers, my storage module already does that, and also supports Orleans' default serializer as a fallback.

Anyway, here's the serializer I was talking about. It's version-tolerant, though a bit of manual work is required. You just assign new IDs to new data members and remove old ones. The rest is handled by Bond. I'm using it in production, and it works really well, but the source is a mess. You've been warned XD

@Arshia001 Now that we have https://github.com/OrleansContrib/OrleansCassandraUtils as well as https://github.com/OrleansContrib/Orleans.Clustering.Cassandra and https://github.com/OrleansContrib/Orleans.Persistence.Cassandra in OrleansContrib, should we close this issue? IIRC there was a discussion about merging these three repos into one, but that's a topic for a separate issue I think.

@sergeybykov Yes, I agree. I'm still open to discussions and ready to help with integrating the modules too.

@Arshia001 Thanks for confirming.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jt4000 picture jt4000  路  3Comments

JorgeCandeias picture JorgeCandeias  路  3Comments

jdom picture jdom  路  3Comments

scharada picture scharada  路  3Comments

turowicz picture turowicz  路  3Comments