Rfcs: Set-based operations on HashMap

Created on 9 Feb 2017 · 11Comments · Source: rust-lang/rfcs

HashSet has all kinds of useful set functionality, but these functions are incompatible with HashMap keys:

difference
symmetric_difference
intersection
union
is_disjoint
is_subset
is_superset

It would be nice to enable the use of these functions between HashMaps and HashSets, either by allowing semi-cheap conversions from HashMap.keys to HashSet or by enabling those functions to take either maps or sets.

T-libs

Source

jethrogb

👍19

Most helpful comment

Bump, I think this would be quite useful!

dralley on 4 Dec 2019

👍17

All 11 comments

How about:

std::collections::btree_map::BTreeMap::keys_as_set(&self) -> BTreeSet<'a, K> { .. }
fn std::collections::hash_map::HashMap::keys_as_set(&self) -> HashSet<'a, K> { .. }

or possibly:

std::collections::btree_map::Keys::as_btree_set(&self) -> BTreeSet<'a, K> { .. }
fn std::collections::hash_map::Keys::as_hash_set(&self) -> HashSet<'a, K> { .. }

It might not be that trivial implementation-wise because {BTree,Hash}Set internally use their Map counterparts:

https://github.com/rust-lang/rust/blob/956e2bcbaa00c05e051718b1375375915064f1c3/src/libstd/collections/hash/set.rs#L118

https://github.com/rust-lang/rust/blob/956e2bcbaa00c05e051718b1375375915064f1c3/src/libcollections/btree/set.rs#L74

... ~~but I think it should be possible~~ (EDIT: nevermind, after looking into implementing this, it doesn't appear to be so straightforward).

frewsxcv on 13 Feb 2017

I've been thinking about this some more, and I think it would be useful to have the all set operations supported on Map<K,V1> and Map<K,V2>. The difference iterator type is (&K,&V1). The intersection iterator type is (&K,&V1,&V2). The symmetric_difference and union iterator type is &K, or possibly (&K,Occurence<V1,V2>) with

enum Occurence<V1,V2> {
  InSelf(&V1),
  InOther(&V2),
  InBoth(&V1, &V2),
}

jethrogb on 13 Feb 2017

👍1

Bump, I think this would be quite useful!

dralley on 4 Dec 2019

👍17

A lot of the advice online seems to be to first collect the keys of the HashMap into a HashSet but it would be far more efficient to operate on the map directly rather than re-hashing everything so I would very much like these functions.

tuzz on 17 Jul 2020

👍1

A lot of the advice online seems to be to first collect the keys of the HashMap into a HashSet but it would be far more efficient to operate on the map directly rather than re-hashing everything so I would very much like these functions.

I Agree.

Boscop on 12 Nov 2020

Increasing their compatibility becomes problematic due to existing differences. We'd ideally define type HashSet<K, S = RandomState> = HashMap<K, V = (), S = RandomState>; except HashSet::insert returns true on fresh insertion, while HashMap::insert returns Some<V> on non-fresh insertion, so the opposite. We might add analogs of these methods between HashMaps when reasonable, so then code bases that require interoperation or consistency could ban HashSet entirely.

Ill note is_* methods are improper grammar, so you asked for are_* methods.

Although difference and symmetric_difference make sense, I suppose intersection, union, and are_* methods all require a closure that merges values, although union could simply be non-commutative. Instead you could just add iterator methods

pub fn entries(&mut self) -> impl Iterator<Item=Entry<'_, K, V>> iterates over the entries, from which users implement difference, intersection, union, and
pub fn entries_both<V1,S1: BuildHasher>(&mut self, &mut HashMap<K,V1,S1>) -> impl Iterator<Item=(Entry<'_, K, V>,Entry<'_, K1, V1>)>, from which users implement symmetric_difference.

All code should assume distinct HashMaps use distinct RandomStates, which makes double hashing unavoidable. Avoiding double hashing requires either type-level non-const values aka full dependent types, or else build a caching Hasher using interior mutability and then share S: BuildHashers.

As should be obvious, you increase DoS attack risks by reusing S: BuildHashers. We've one past proposal that HashMap: Clone should rehash everything, after which we'd add some HashMap::insecure_clone, which remains a concern here too. At minimum, if methods clone without recopying then they should be marked as insecure in the documentation.

burdges on 12 Nov 2020

All code should assume distinct HashMaps use distinct RandomStates, which makes double hashing unavoidable

Wow thanks for pointing this out. The entire idea of this feature is predicated on not having to rehash things. I assumed the functions on HashSet were efficient, but you point out they can't be, and I just looked at the code for HashSet::difference and HashSet::union and they are indeed just “dumb” implementations in terms of repeated calls to HashSet::contains.

I suppose we need some notion of Hasher equality to implement these operation efficiently when possible.

jethrogb on 12 Nov 2020

Even if the hash function is the same, depending on the implementation the hash may not actually be stored, or only part of the hash may be stored, so even in that case you may end up re-hashing the keys.

And you can't just assume the same element would be in the same bucket in the other hasmap, since the hashmaps may be different sizes, or collisions may have been resolved differently.

Diggsey on 12 Nov 2020

Even if the hash function is the same, depending on the implementation the hash may not actually be stored, or only part of the hash may be stored, so even in that case you may end up re-hashing the keys.

And you can't just assume the same element would be in the same bucket in the other hasmap, since the hashmaps may be different sizes, or collisions may have been resolved differently.

These are just implementation details. Without hash equality you can never be efficient. With, you can be as efficient as the implementation allows. It might not be as efficient as the operations on two sorted lists (hash, K, V) but that's no reason not to do better.

jethrogb on 12 Nov 2020

I once "mostly" worked out a low-level cuckoo table interface that employs "detached hashing", so you can compute the "index" for a key once, and then manipulate cuckoo tables however you like. Inserts cause reorgs still, but only cuckoo table style.

In fact, tables store only values not keys, so keys of different types make sense in the same table, although this makes security worse, so you'd usually want some static key type enforcement layer on top somehow.

I'd never considered using the same hasher across multiple tables, because actually cuckoo tables are more vulnerable to bad hashers than Robin-hood hashing, but actually that'd maybe become an attractive property if security does not enter your picture.

It'd require some tuning too: In theory, cuckoo tables with a strong hasher handle fuller tables better than Robin-hood hashing. I forget if this analysis considers cache locality though, so not sure in practice.

I'll try to remember to post back here if I find time to write it up. I could discuss this somewhere if someone else is interested too, like I an point out the "2.5" literature around cuckoo tables and discuss the interface. And stuff like https://twitter.com/danrobinson/status/1272267659313176578 helps if you want non-2^k sized tables, which you do when you want to send a cuckoo table in a wire protocol.

burdges on 12 Nov 2020

I assume in theory you could also just compute the hash as H(data||random) and clone the hash state just before hashing the random value. This would allow you to efficiently compute the hash for an element for multiple hashers. But the current API doesn't really support this.

jethrogb on 12 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings