Rust: Collisions in type_id

Created on 9 Nov 2013  ·  53Comments  ·  Source: rust-lang/rust

The implementation of type_id from #10182 uses SipHash on various parameters depending on the type. The output size of SipHash is only 64-bits, however, making it feasible to find collisions via a Birthday Attack. I believe the code below demonstrates a collision in the type_id value of two different ty_structs:

use std::hash;
use std::hash::Streaming;

// I believe that this pretty much the same thing as hash_crate_independent() in ty.rs
// for a ty_struct on a 64-bit platform
fn hash_struct(local_hash: &str, node: u64) -> u64 {
    let mut state = hash::default_state();
    state.input([18]);
    state.input(local_hash.as_bytes());
    do node.iter_bytes(true) |bytes| { state.input(bytes); true };
    state.result_u64()
}

fn main() {
    // This represents two structures with different node values from a crate with a 
    // local_hash value of "local" that end up getting the same hash and thus, 
    // I think, the same type_id
    assert!(hash_struct("local", 0x9e02c8943c336302) == hash_struct("local", 0x366a806b1d5f1b2));
}
A-typesystem C-bug I-unsound 💥 P-low T-lang

Most helpful comment

use std::any::Any;

fn main() {
    let weird : [([u8; 188250],[u8; 1381155],[u8; 558782]); 0] = [];
    let whoops = Any::downcast_ref::<[([u8; 1990233],[u8; 798602],[u8; 2074279]); 1]>(&weird);
    println!("{}",whoops.unwrap()[0].0[333333]);
}

Actually a soundness issue. playground: http://is.gd/TwBayX

All 53 comments

I'm not entirely sure how feasible it is for a program to have 0x366a806b1d5f1b2 node ids (2.5 trillion), but this is still concerning.

We could in theory have very cheap inequality among types, and then have an expensive equality check. Something which may walk the respective TyDesc structures in parallel to make sure that they're the same. We could also bump up the hash size to using something like sha1/sha2 and have the type_id() intrinsic return [u8, ..N] to reduce the possibility of a collision.

Either way, I don't think that this is a super-pressing issue for now, but I'm nominating to discuss whether we want to get this done for 1.0. This could in theory have serious implications depending on how frequently Any is used.

Ah, it was already nominated!

Why not compare an interned version of the type data string? (i.e. what is currently passed as data to be hashed, possibly SHA-256 hashed first)

The linker can be used for interning by emitting a common symbol with the type data string as name and taking its address, and otherwise the same thing can be done manually in a global constructor.

This way it's always a pointer comparison, and there are no collisions.

I don't know how node id values are generated, but assuming that they are generated sequentially, this particular collision is not realistic. However, its not hard to find collisions for more realistic node id values by picking particular values for the crate hashes:

assert!(hash_struct("a2c55ca1a1f68", 4080) == hash_struct("138b8278caab5", 2804));

The key thing to consider isn't the number of node id values, though: its the total number of type id values. Some quick (hopefully correct) math shows that there is a 0.01% chance of a collision once there are around 60 million type id values. That's still a pretty large number of type id values for a somewhat low probability of a collision, thought. So, its unclear to me how big a deal this is for the Rust 1.0 timeframe. It all depends on what the acceptable probability of a collision is.

When I saw that @alexcrichton proposed using a hash, my first reaction was "collision!" but then I thought "...but exceedingly unlikely to occur in practice". I think this is not a matter of imminent destruction but if we can leverage the linker or some other scheme to avoid this danger, we should -- and perhaps we should just go ahead and mark the current scheme as deprecated and just plan on finding a replacement scheme.

A cryptographic hash designed for this purpose (larger output) would be enough. Although, a larger output would be more expensive to compare (four u64 comparisons for SHA2).

We don't need to deal with this right now. P-low.

How relevant is this issue today? I think that it's all the same, but am not sure.

It's 64-bit so collisions are _likely_ with enough types (consider recursive type metaprogramming) and it doesn't have any check to bail out if one occurs. Bailing out is not a very good solution anyway, because it pretty much means that there's no way to compile the program, beyond using a different random seed and hoping for the best. It's a crappy situation.

Note that "hoping for the best" by iteratively changing the seed might work with overwhelmingly large probability after very few iterations.

use std::any::Any;

fn main() {
    let weird : [([u8; 188250],[u8; 1381155],[u8; 558782]); 0] = [];
    let whoops = Any::downcast_ref::<[([u8; 1990233],[u8; 798602],[u8; 2074279]); 1]>(&weird);
    println!("{}",whoops.unwrap()[0].0[333333]);
}

Actually a soundness issue. playground: http://is.gd/TwBayX

I'd like the lang team to devote a little time to this now that we are post 1.0. Nominating

OK, lang team discussed it, and our conclusion was that:

  1. This issue ought to be fixed, it's silly not to.
  2. This is an implementation detail that we could change whenever we want (right?)
  3. Nonetheless, we probably ought to open an RFC or at least a discuss thread, with a proposal to do better, since probably people will have some clever ideas.
  4. Probably the runtime overhead of the virtual call in the Any trait is way more than a strcmp anyhow for all realistic types.

I was wondering about a design where we do something like:

  • generate a static string representing the full type; in static builds, at least, this will be interned by the linker;
  • generate a hash

compare the string pointers for equality (to give a fast equality check). If that fails, compare the hashes for inequality (to give a fast inequality check). If THAT fails, compare the strings for content (to handle dynamic linking).

Although re-reading the thread I see @bill-myers may have had an even more clever solution.

@nikomatsakis putting the hash of the data at the start is a good idea, to increase the probability that we catch unequal things quickly. It seems to me like @bill-myers' approach composes fine with that strategy.

I doubt the "problem" is limited to Any. You can probably confuse the compiler just as effectively by colliding hashes for symbol mangling, or many other things. What is the objective here? Since Rust is not a sandbox language, I don't think "protect memory from malicious programmers" should be one of our goals (we should document the types of undefined behavior that can be hit in safe code, and fix the ones that are _possible_ to hit by accident; if someone is determined to break the type system, they can already write an unsafe block, or use std::process to launch a subprocess that ptraces its parent and corrupts memory).

@nikomatsakis Should this be marked as I-unsound? I've done so for now, since that seems to be the general conclusion a couple of times by different people, but please unmark if I'm wrong.

You can probably confuse the compiler just as effectively by colliding hashes for symbol mangling, or many other things.

Would any of these result in incorrect runtime behavior, or just bizarre compiler errors?

EDIT: The comment I'm replying to is over 2 years old.

Would any of these result in incorrect runtime behavior, or just bizarre compiler errors?

It could be unsound, if it fooled Any into an incorrect downcast.

@nikomatsakis Right, I meant sorear's other examples (name mangling and such). The fact that this can cause a soundness bug separates it from other mishashes which would just result in a compiler error.

eventbus crate uses a lot of Any, as it encourages you to use a lot of different event types and emulated inheritance. this could become an issue...

What if, instead of making TypeId bigger, we add another implementation detail that works like eventbus?

Calls to Any, TypeId, etc are converted into a set of static usize that get initialized at runtime.

https://internals.rust-lang.org/t/static-generics/8734?u=soni

we can then use an expensive check (string lookup/comparison) once and use a cheap check every other time.

Why not use a pointer comparison of an inline symbol? VTables should work well, if we remove the thing where vtables can be distinct across TUs.

(in C++)

template <typename T>
struct VtableForAny {
  std::size_t size;
  std::size_t align;

  void (*destroy)(void *);
};
template <typename T>
inline VtableForAny<T> VTABLE_FOR_ANY = {
  sizeof(T),
  alignof(T),
  T::~T(),
};

struct Foo {
};

template <typename T>
T *downcast(AnyRef any) {
  if (any.vtable == &VTABLE_FOR_ANY<T>) {
    return static_cast<T*>(any.ptr);
  } else {
    return nullptr;
  }
}

We could do a similar translation for Rust, and this is guaranteed to work by LLVM.

if we remove the thing where vtables can be distinct across TUs.

I think that would not be easy. It's not just TUs; if you create a trait object in a const then it will have a vtable generated by miri, and I am not sure if that will get deduplicated with the one generated for run-time use.

@ubsan

if we remove the thing where vtables can be distinct across TUs.

We can't, for the same reason generic statics couldn't be unique across TUs if we added them (dynamic linking on Windows, possibly other platforms).

@rkruppe ah, yeah, forgot about dynamic linking. It's likely we'd have to do a strcmp then, in the case where addresses are distinct, so you'd likely get a fast path and a slow path.

@ralfjung it's absolutely possible and, in fact, quite easy. It just requires a simple LLVM attribute.

What strings do you want to compare? Lacking a global namespace, the only way I know of to get a "type name" that won't be the same across different crates is using the crate disambiguator, which is a hash, which is therefore susceptible to the same sort of collision. Or can we arrange for such collisions always resulting in a linker error, even when the crates are only linked together dynamically and indirectly?

@rkruppe the name of the type, I think? That should be globally unique.

crate_name::foo::bar::Baz

maybe crate_name:v0.1.0::foo::bar::Baz

Crate names are not globally unique, and neither are (crate name, version) tuples.

@rkruppe huh! then I guess whatever information we pass to the hash function :P

The disambiguator is passed in using the -C metadata argument, rustc doesn't know where it comes from. Arguably that places it outside of the domain of rust and thus collisions of it aren't our soundness problem, but let's consider Rust workflows more broadly. Typically it's computed as a hash by Cargo, and AFAIK that hashing includes a whole ton of things that are impractical to put into the binary.

You can work around the generic statics issue the same way eventbus does it currently.

Language support would be an improvement over using a macro, tho.

So, first, for fixing the collisions:

struct TypeId(&'static [&'static [&'static str]]);

Why this specific layout? Three reasons:

  1. We want an identifier that's consistent across builds, so a string.
  2. We want an identifier that's relatively lightweight, so we want to reuse substrings, or a slice of strings.
  3. We want to handle generics, which contain identifiers, so a slice of slices of strings.

So, for example, we can end up with some of the following TypeIds:

TypeId(&[&["rustc-1.30", "cargo-1.30", "eventbus-0.4.0", "Handlers<T>"], &["rustc-1.30", "cargo-1.30", "ircclient-0.1.0", "IrcLineEvent"]])
TypeId(&[&["rustc-1.30", "cargo-1.30", "std-1.30", "result", "Result<T, E>"], &["rustc-1.30", "cargo-1.30", "eventbus-0.4.0", "Handlers<T>"], &["rustc-1.30", "cargo-1.30", "ircclient-0.1.0", "IrcMessageEvent"], &["rustc-1.30", "cargo-1.30", "eventbus-0.4.0", "MyError"]])

Note that these are only examples and the real thing probably wouldn't look like this, but this shows the basic idea.

It's still possible to intentionally create collisions, ofc, but that's different from accidental collisions (what we're trying to prevent here).

Also note that these are still opaque, so the contents need not be unambiguous to humans, only to the compiler (or the debugger, if we want that...).

@rkruppe With your comment about the metadata string, did you mean that it's "safe" to use it as a disambiguator between different versions of the same crate (be it versions or feature-sets), or that we _shouldn't_ use it (because with non-cargo workflows it may not be passed).

Storing the type "name" and comparing that (potentially with a short hash for a fail-fast option) and trusting the -C metadata argument to be sufficient for disambiguation sounds like it could be the best option for reducing the chance of collisions.

I didn't really mean to recommend any course of action, just list constraints. One of them is that the metadata string is itself a hash in the typical workflow and thus could have collision.

But thinking about the options you gave, it seems to me that: if we are OK with a neglegible chance for collisions, we can achieve that (in principle, I don't know how large the collision risk is today) with a single hash. Well, assuming collisions in the crate disambiguator can't be detected -- I am not sure whether there is perhaps some clever use of the linker to detect them.

I have tricks to improve performance, but they need to come after this

(am I even being understood? I have no idea if ppl can understand me but they often seem to ignore me)

What about keeping somewhere a "cache" of calculated TypeIds? That cache would store a TypeId-type pair for each type for which TypeId::of::<T> has been called. If a collision is detected, another TypeId is calculated using a hash table collision resolution algorithm. So, TypeIds would be the index in a hash table with 2^64 buckets. With this approach, we can continue to use a 64-bit integer as TypeId (or maybe it could be reduced to 32-bit). However, this means that, if the cache is deleted, all crates should be compiled again.

We could also forbid TypeId zero, so size_of::<Option<TypeId>>() == size_of::<TypeId>().

A... well, not resolution, but partial mitigation might be achieved just by calculating the TypeId of all types in a compilation run and making sure they're unique. This doesn't solve the problem but can at least make it noisy for an end-user instead of possibly silently failing. Doesn't help the networked/IPC use case like eventbus much, but it's something.

eventbus is not networked/IPC. the idea was to use multiple dll's/so's.

Why doesn't type_id use a perfect hash function like rust-phf?

In computer science, a perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions.

Why doesn't type_id use a perfect hash function like rust-phf?

Because it already has to decide on type ids during codegen of one crate, at which point type ids from upstream crates are not yet known.

Because it already has to decide on type ids during codegen of one crate, at which point type ids from upstream crates are not yet known.

Perhaps type_id should have some bits for a crate id, and some bits for the per-crate phf. This way you ensure there is no collisions.

Depending how nuts you want to go with it, you could use the package checksum (truncated to 128-bits) as e.g. a SipHash key, with the crate name and the rest of the type information as the SipHash input.

No idea how hard it'd be to make rustc aware of package checksums if it isn't already, but it should suffice to eliminate collisions so long as the names of crates within the same package are unique.

The output size of SipHash is only 64-bits

Note that there's a variant of SipHash with a 128-bit output as well (although I'm guessing an implementation thereof is not available in-tree)

Depending how nuts you want to go with it, you could use the package checksum (truncated to 128-bits) as e.g. a SipHash key, with the crate name and the rest of the type information as the SipHash input.

No idea how hard it'd be to make rustc aware of package checksums if it isn't already, but it should suffice to eliminate collisions so long as the names of crates within the same package are unique.

The output size of SipHash is only 64-bits

Note that there's a variant of SipHash with a 128-bit output as well (although I'm guessing an implementation thereof is not available in-tree)

But usage of more than 64 bit will change size of 'TypeId' struct. That's the worse. I like idea of making crate id bits a part of TypeId.
If we change the size of TypeId to 128 bits, then we can make first 64 bit be a crate id and second to crate local typeid which,in that case, can be built buy perfect hash function.

I hit this bug when tried to create type map (HashMap<TypeId, _>) to store global data for some types. Since static variables with generic parameters from outer functions are not allowed (this pattern was used in previous version of app written in C#) type map was the only option. Types were auto-generated from existing Protobuf-files and there were few hundreds of them. Thankfully I was able to catch collision in test where I put all types inside map and checked if they exist.

And I think this behavior (potential collisions) should be documented in TypeId and Any docs.

Probably the best approach to your problem would have been to generate a separate static for each type, rather than generating a type map. A type map only really makes sense when the set of types in it can only be determined at runtime.

Do you know how many types were generated as part of this process? You say "few hundred" but I assume you mean protobuf files; the chance of a collision in a map with only a few hundred types should be very low.

Probably the best approach to your problem would have been to generate a separate static for each type

That's what I had in C#, and at first tried in Rust something like this:

fn get_data<T>() -> &'static mut Data<T> {
    static mut DATA: Data<T> = Data::new();
    unsafe { &mut DATA }
}

but it doesn't work. Type map was a nice solution without additional codegen. But at the end I did just like you said - forked protobuf compiler and added separate static for each type in generated files.

Do you know how many types were generated as part of this process?

Probably around 400. In protobuf-file there were way more, but in type map I used only a small subset of them.

the chance of a collision in a map with only a few hundred types should be very low.

Indeed. I hit it accidentally and didn't even expect that.

You can't have generic statics but you can generate a static of a different type for each type, and then associate the static with the type using a static method on a trait.

So I believe the upper limit on a 64 bit hash would give a probability of this occurring at less than 1 in 8.3 million (512 out of 2^32). Obviously SipHash is not designed to be collision resistant and the probability is greater than this, but I can't find any good number as to how much greater. If the collision resistance is so low that a user has a reasonable chance of encountering a collision among less than 500 TypeIds, we should prioritize a resolution to this higher than we have - mainly because its an awful footgun as you've mentioned.

Does anyone have better numbers on the lack of collision resistance in SipHash than the birthday bound?

From the SipHash paper, the major concerns around collision resistance are the small output size and correspondingly small birthday bound as you noted:

https://www.131002.net/siphash/siphash.pdf

Security is also limited by the output size (64 bits). In particular, when SipHash is used as a MAC, an attacker who blindly tries 2^s tags will succeed with probability 2^(s−64)

The paper further notes:

We comment that SipHash is not meant to be, and (obviously) is not, collision resistant.

(this is due to the small output size)

@Y0ba Wooooah, you hit the... what, 1 in 10 million chance of this happening. Congrats! 😁

@tarcieri Yea, but I suspected that the collision resistance may be substantially lower (because it's not a priority), since a collision in less than 500 hashes seems surprising otherwise. The paper doesn't discuss collision resistance in any real length since even a perfect 64 bit hash would not be sufficiently resistant and its not a concern. Do you know any further information?

If I'm reading the birthday table correctly, 500 hashes colliding within a 64-bit hash space should have a probability around 10^-14, so that suggests something else is wrong.

SipHash is allegedly a secure PRF and therefore has uniformly random outputs, so I doubt it's something wrong with the algorithm.

Was this page helpful?
0 / 5 - 0 ratings