Runtime: [Api Proposal] Add CollectionsMarshal ref accessors for Dict

Created on 4 Aug 2018 · 44Comments · Source: dotnet/runtime

They aren't "safe" operations as the backing store can change if items are added, removed, etc and the changes aren't tracked

namespace System.Runtime.InteropServices
{
    public partial static class CollectionsMarshal
    {
        public ref TValue ValueRef<TKey, TValue>(Dictionary<TKey, TValue> dict, TKey key);
    }
}

However it is useful to get a reference to modify struct TValue entries without a full copy and replace update.

/cc @jkotas

/cc @stephentoub is it valid for ConcurrentDictionary?

api-needs-work api-suggestion area-System.Collections

Source

benaadams

❤12 👍1

Most helpful comment

struct LargeStruct
{
   // ...
   int Value;
}

1 hash lookup, no copies to update a struct

dict.ValueRef(key).Value++; // hash lookup

2 hash lookup, 2 copies to update a struct

var s = dict[key]; // hash lookup, struct copy
s.Value++;
dict[key] = s; // hash lookup, struct copy

benaadams on 27 Aug 2019

👍7

All 44 comments

Same comment https://github.com/dotnet/corefx/issues/31597#issuecomment-410452767 applies here.

jkotas on 4 Aug 2018

Updated

benaadams on 4 Aug 2018

@benaadams

public ref TValue ValueRef(Dictionary TKey key);

Did you mean to give the dictionary a parameter name and have two different parameters? Like

public ref TValue ValueRef<TKey, TValue>(Dictionary<TKey, TValue> dictionary, TKey key);

vcsjones on 7 Aug 2018

👍1

@benaadams does it make sense to create 1 issue for: dotnet/runtime#27061 and this guy, to propose adding CollectionsMarshal class and having some rationale and usage, to bring it over to API review?

Marking as 3.0 for now to not loose attention and then we can decide if we bring them individually or together as 1 issue to API Review.

safern on 12 Mar 2019

Not required for 3.0 release

danmosemsft on 30 May 2019

@benaadams same question as on the other issue for CollectionMarshal: what are compelling scenarios for these APIs?

terrajobst on 27 Aug 2019

struct LargeStruct
{
   // ...
   int Value;
}

1 hash lookup, no copies to update a struct

dict.ValueRef(key).Value++; // hash lookup

2 hash lookup, 2 copies to update a struct

var s = dict[key]; // hash lookup, struct copy
s.Value++;
dict[key] = s; // hash lookup, struct copy

benaadams on 27 Aug 2019

👍7

So if we were to expose an API that would allow you to cache the hash key, would this address your need? IOW, between struct copies and hash lookups, what hurts more?

terrajobst on 29 Aug 2019

The hashing isn't particularly expensive (e.g. if its an int key there is no cost); its then taking the mod of that and traversing the arrays while doing the lookup (i.e. the search).

The struct copies are expensive as if you want to modify one property of the struct its disproportionately expensive.

I do accept its inherently unsafe if you mix it with other operations on the list or dictionary; and then try to keep using the ref, so it would be nice to add something like a [CallerMustBeUnsafe] attribute https://github.com/dotnet/roslyn/issues/8663 so it can only be done in an unsafe block

benaadams on 29 Aug 2019

👍2

The hashing isn't particularly expensive (e.g. if its an int key there is no cost); its then taking the mod of that and traversing the arrays while doing the lookup (i.e. the search).

The struct copies are expensive as if you want to modify one property of the struct its disproportionately expensive.

That makes sense.

it would be nice to add something like a [CallerMustBeUnsafe] attribute

Other people might feel differently but the whole point is that this should be a good enough speed bump. Putting this on a type in System.Runtime.InteropServices that isn't an extension method is sufficient to me. We could prefix the APIs with Unsafe too, but I'm not sure we're helping much with this.

The larger question is: where do we draw the line with escape hatches like this? In a sense, we're chasing the 1% cases here, but I'm a bit scared that we're creating a mess the more APIs like this we're exposing throughout the BCL. And I don't have a good handle on this where we hold the bar.

Do you have any real workload where this would show up? Of course, one can always fabricate a micro benchmark; I don't doubt there are cases where this API would speed things up. The point is: how much does this matter in real code?

terrajobst on 30 Aug 2019

Looking at these apis more holistically:

For the dictionary you can achieve a similar; but safer, effect by wrapping the struct in a class then using the class handle. Its safer because the reference will remain valid across dictionary operations; and then a "live" reference can be more easily passed around and stored.

For the list List<T> AsSpan https://github.com/dotnet/corefx/issues/31597 would enable the same ability (working from the returned Span<T> which returns ref); but also be a more useful api.

For the dictionary using a class indirection doesn't cause too much issue because the data layout isn't important to the calling code (as how its internally stored in dictionary can't be consumed).

For list; that its a straight in-order array is useful, as then a consumer if they had access to it, can do things like vectorize the processing of the data, which a class indirection would break.

So in preference I'd drop these apis and add the List<T> AsSpan https://github.com/dotnet/corefx/issues/31597 one as it enables the same scenario as this api, plus other opportunities like vectoization (which this doesn't).

e.g. using the other api to achieve the same

AsSpan(list)[index].Value++; // though likely would cache the span if it was in a loop

ItemRef(list, index).Value++

benaadams on 31 Aug 2019

@benaadams, could you update this proposal to just be for Dictionary?

Seeing as we already reviewed/approved the List<T> changes here: https://github.com/dotnet/corefx/issues/31597 (and implemented https://source.dot.net/#System.Private.CoreLib/CollectionsMarshal.cs,18)

tannergooding on 16 Jan 2020

Done

benaadams on 16 Jan 2020

👍1

Also update the proposed API list in the top post?

jkotas on 16 Jan 2020

Dropped List and ConcurrentDictionary (as getting a live ref for ConcurrentDictionary is probably asking for trouble as other threads could invalid the ref)

benaadams on 16 Jan 2020

If we approve this API we'd also need to approve Unsafe.IsNullRef at the same time.

GrabYourPitchforks on 20 Feb 2020

@benaadams What's the expected behavior when the key is not found? Return a null ref, throw an exception, insert default(T) into the collection, or something else? Would a caller need to distinguish between "not found" and "I just created a dummy placeholder for you" scenarios?

GrabYourPitchforks on 20 Feb 2020

How about a method similar to the specialized SpanAction string allocator?

Dictionary.AddOrUpdate(TKey key,
TState state,
SpanAction factory);

TState is provided to help in no allocation scenarios.
The factory accepts a Span of size 1 item, the state and a bool saying if the key is found or not.
Both found/not found cases can benefit from being implemented as SpanAction in case the value type TValue has mutable fields itself - you can then update only the fields you have to without copying the whole struct.

Then use it like:

var dict = new Dictionary();

dict.AddOrUpdate(56, state, (value, s, b) => b ? ++value[0] : value[0] = state.Value);

MaximGurschi on 17 Sep 2020

@MaximGurschi I think we'd rather have a _ref T_ parameter than a single-element _Span\

GrabYourPitchforks on 17 Sep 2020

@GrabYourPitchforks Yes definitely the ref T is nicer, haven't used that so much (only recently got used to Span). Anyway - i would say leave the effects of the dictionary as they were after the action is invoked. I don't think it needs to guarantee atomicity.

For example HashSet.UnionWith will take as many items as possible from the IEnumerable even if that throws during a yield. So there is precedent in the framework for mutators that possibly do not leave the updated collection in a strictly speaking consistent state.

For the Dictionary the caller has enough information to handle partial updates - he just needs to grab a copy of the data.

MaximGurschi on 17 Sep 2020

The difference is that HashSet<T>.UnionWith is logically a set of individual adds, each of which is performed atomically. The hashset instance state is consistent with the fact that all adds up until the point of failure have been fully committed. The failure doesn't result in a "partial add".

For dictionary, consider this sequence of events:

Caller invokes AddOrUpdate.
Dictionary sees that the key doesn't exist so it quickly adds a dummy entry for that key to the dictionary. (This is required to get a _ref_ to where the value would go.)
Dictionary invokes callback, passing a ref to the slot carved out in step (2) above.
Callback throws an exception.

What internal state should the dictionary have at this point? Does the dictionary try to undo (2)? Do we logically say that AddOrUpdate consists of two operations: Add(key, default(TValue)) (always atomic), followed by an update (which may result in a partial write)?

GrabYourPitchforks on 17 Sep 2020

Wouldnt I potentially switch to DictionarySlim if I wanted these semantics?

danmosemsft on 17 Sep 2020

@GrabYourPitchforks Not important to this topic but the HashSet was just an example of something that possibly a large part of developers would overlook and expect "all or nothing" and actually have a bug in their code especially if the set had values before the call.

I have thought about the potential of a slot being left empty before. But this is where imho it didn't feel like such a big deal in the sense that the caller can be made aware that the key was left present, possibly with an empty slot. The documentation is important here. Ideally there would not need to be any more booking structures required inside the dictionary. The types performance characteristics should not degrade.

The mere fact that the callback was invoked can be sufficient to indicate that (even if it throws while execution) the slot is present/added.
Alternatively we have APIs in .NET that attempt to do work, fail but indicate that some partial state update has actually happened. An example is Monitor.Enter(obj,ref bool) where the bool can be set even if there is an exception. Thus the caller can decide for example to remove the key on an error or keep it. Adjusted API for illustration purposes only:
Dictionary.AddOrUpdate(TKey key,
TState state,
SpanAction factory, ref bool slotAdded);
The third approach, is deviating from the original proposal slightly, where in case the operation is actually and Add, the API can generate the slot on the stack, pass a ref to it, then once the callback is complete, add the slot and do the copy. The deviation here of course results into the entire value type being copied but that is not a big deal.

@danmosemsft That type looks great but lacks the ability to distinguish what actually happened (add or update) - is it official in the sense I can pull it down from a nuget package? Also it seems more dangerous for the reasons mentioned before like keeping a reference to the slot and the accidentally resizing the Dictionary. The SpanAction keeps that within the scope of the method so removes that risk.

MaximGurschi on 18 Sep 2020

@MaximGurschi yes you can pull it down in the Microsoft.Experimental.Collections package. It should be solid - the experimental part is mostly because we’re not committed to the API. As you point out, it has potentially unsafe properties. Which may be OK so long as those that use it are aware. Feedback on it is welcome maybe in an issue in the corefxlab repo

danmosemsft on 18 Sep 2020

Related: #36942.

@danmosemsft As stated in that issue, DictionarySlim doesn't support other cases that people may care about.

Joe4evr on 18 Sep 2020

@danmosemsft Ah looks like it is pre-release. I think i will use the approach of wrapping the value in a reference type. DictionarySlim is a nice start, but the safety concerns around dangling refs and the inability to distinguish add from update (for arbitrary TValues) is a deal-breaker.

MaximGurschi on 18 Sep 2020

@MaximGurschi Wanted to clarify - none of the problems I brought up are dealbreakers. As you had suggested, there are potentially viable workarounds for all of them. The reason I had brought them up is that these are the types of questions that will be brought up in API review, and we'd collectively need to agree on behaviors before the API can be marked approved.

GrabYourPitchforks on 21 Sep 2020

👍1

Do you have any _real_ workload where this would show up? Of course, one can always fabricate a micro benchmark; I don't doubt there are cases where this API would speed things up. The point is: how much does this matter in real code?

Perhaps it would be better to modify Dictionary<,> to return ref for the indexer. I think this would certainly meet beginner expectations who based on StackOverflow questions seem to be puzzled by why modifying the indexer return value doesn't affect the dictionary value.
I think it would also be clearer for code that uses the value as a count of some sort - e.g. implementing a SortedBag type based on SortedDictionary<T,int> or counting word occurrences with a Dictionary<string,int> where Perl's ref return from indexing makes using ++ seem very natural.
I am not sure the speed improvement is significant for these types of cases, though @damageboy had a use case in a comment on the related issue https://github.com/dotnet/runtime/issues/22121 :

I'm using this dictionary in a very specialized use case (receiving and parsing packets with single digit microsecond latency).

I have multiple dictionaries to keep updated, and switching the dictionaries to ref return (a.k.a my bastardized SuperDictionary) saved my a great deal of that latency when compared to the normal Dictionary.

NetMage on 8 Oct 2020

👍1

@GrabYourPitchforks Are there any other considerations that need to be addressed before we mark this ready for review?

eiriktsarpalis on 30 Oct 2020

@eiriktsarpalis Until the question "what happens if the key isn't found?" is answered, I don't think this is ready for review. The scenario is incomplete. I'll mark it needs-work.

GrabYourPitchforks on 31 Oct 2020

Until the question "what happens if the key isn't found?" is answered, I don't think this is ready for review.

My €0,02: For the scenario I'd use this in, I don't want to create a new entry since that just seems like it would bloat the dictionary. If that is a compelling scenario, then there may be a need for two different methods here, but I'd like the path of finding out the key didn't exist, meaning the reference to the value isn't valid, so I'll simply throw a KeyNotFoundException.

Joe4evr on 31 Oct 2020

Until the question "what happens if the key isn't found?"

Return Unsafe.NullRef<T>(); its an unsafe api, that should thus be valid?

benaadams on 31 Oct 2020

👍3

In the linked issue https://github.com/dotnet/runtime/issues/15059 I had proposed an alternative: always inserting default(TValue) into the dictionary if it doesn't already exist, and outing a Boolean that tells you whether the value already existed or whether it was freshly added.

public static class CollectionsMarshal
{
    public static ref TValue GetValueOrAddDefault<TKey, TValue>(Dictionary<TKey, TValue> @this, [NotNull] TKey key, out bool exists);
}

The upside is that it doesn't require us to make Unsafe.IsNullRef<T>(ref T value) public, and it is more useful in the likely common case of "I'm just going to put a value into the dictionary if it didn't already exist."

GrabYourPitchforks on 31 Oct 2020

👍1

The upside is that it doesn't require us to make Unsafe.IsNullRef(ref T value) public

We already made it public: https://docs.microsoft.com/en-us/dotnet/api/system.runtime.compilerservices.unsafe.isnullref?view=net-5.0 😄

it is more useful in the likely common case of "I'm just going to put a value into the dictionary if it didn't already exist."

I agree this might be a common case, but its also not a case we directly support on Dictionary today.
You can user the indexer, which throws KeyNotFoundException, Add which throws if a key already exists, TryAdd which does nothing if the key does exist, and TryGetValue which returns false if it does not exist.

The most similar thing would be to just throw KeyNotFoundException for the ValueRef API.
You could then have an additional TryGetValueRef API which returns NullRef if false or a GetValueOrAddDefault which always adds a new value.

tannergooding on 31 Oct 2020

Doh. I was checking the wrong refs. In that case, a family of three related methods might be useful.

GrabYourPitchforks on 31 Oct 2020

The most similar thing would be to just throw KeyNotFoundException for the ValueRef API.

I'd be fine with that, for one. Curious if Ben is as well.

So we're looking at 3 different methods now?
Method | Behavior on non-existent key
----------|-------
(Get)ValueRef | throws KeyNotFoundException
TryGetValueRef | returns NullRef<TValue>
GetValueOrAddDefault(Ref) | Creates a new entry in the dictionary and returns that location

I put some tweaks on the names there, since I would argue that these methods shouldn't skimp on established prefixes/suffixes.

Joe4evr on 31 Oct 2020

I think a ref TValue GetValueOrAddDefault(TKey, out bool); method would probably provide the most value. I can see how the other two variants could be useful in certain cases, however insofar as the motivation is to work around access patterns that currently require duplicate lookups, my impression is that they would be much less commonly used.

eiriktsarpalis on 2 Nov 2020

I believe that the second motivation is to avoid copies of large structs. I think we should deal with both the double lookups and copying of the large structs at the same time, so that we do not have to ever look at this again.

jkotas on 2 Nov 2020

👍4

I am fine with the three methods suggested in @Joe4evr's latest proposal.

GrabYourPitchforks on 3 Nov 2020

👍1

So, something like this?

public static class CollectionsMarshal
{
    /// <summary>
    ///   Gets a reference to the value associated with the specified key, or throws a KeyNotFoundException
    /// </summary>
    public static ref TValue GetValueRef<TKey, TValue>(Dictionary<TKey, TValue> dictionary, [NotNull] TKey key);

    /// <summary>
    ///   Gets a reference to the value associated with the specified key, or returns Unsafe.NullRef<TValue>
    /// </summary>
    public static ref TValue TryGetValueRef<TKey, TValue>(Dictionary<TKey, TValue> dictionary, [NotNull] TKey key, out bool exists);

    /// <summary>
    ///   Gets a reference to the value associated with the specified key, or inserts it with value default(TValue).
    /// </summary>
    public static ref TValue GetValueRefOrAddDefault<TKey, TValue>(Dictionary<TKey, TValue> dictionary, [NotNull] TKey key, out bool exists);
}

eiriktsarpalis on 5 Nov 2020

👍1

I believe public static ref TValue TryGetValueRef<TKey, TValue>(Dictionary<TKey, TValue> dictionary, [NotNull] TKey key, out bool exists); would just be public static ref TValue TryGetValueRef<TKey, TValue>(Dictionary<TKey, TValue> dictionary, [NotNull] TKey key); the user would check Unsafe.IsNullRef(ref result) for does not exist.

tannergooding on 5 Nov 2020

👍1

As for public static ref TValue GetValueRefOrAddDefault<TKey, TValue>(Dictionary<TKey, TValue> dictionary, [NotNull] TKey key, out bool exists);, does the out bool exists make sense? Is it important for users to know if the value already existed in this scenario?

tannergooding on 5 Nov 2020

As for public static ref TValue GetValueRefOrAddDefault<TKey, TValue>(Dictionary<TKey, TValue> dictionary, [NotNull] TKey key, out bool exists);, does the out bool exists make sense? Is it important for users to know if the value already existed in this scenario?

One use case I can think of would be something like this:

if (!dictionary.GetOrAdd(key, out var value, valueFactory))
{
    await value.ExpensiveOneTimeInitializationAsync();
}

await value.SomePocessRequiringInitializedStateAsync();

In this case, this GetOrAdd extension method would return whether the value existed, using the out bool exists parameter.

FiniteReality on 6 Nov 2020

Is it important for users to know if the value already existed in this scenario?

Yes, because it's just added a default entry so you need to know that it requires full initialization rather than just an update to some fields

benaadams on 6 Nov 2020

👍3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

PEVerify Source?

matty-hall · 3Comments

'Type' does not contain a definition for 'IsEnum'

aggieben · 3Comments

Workaround for apps dependent on DTC in respect to porting to NetCore 3.0

iCodeWebApps · 3Comments

RC2 netstandard1.5 'Object' is defined in an assembly that is not referenced.

jchannon · 3Comments

JIT should recognize a popular type check pattern

omariom · 3Comments