Currently Crystal ignores existence of HashFlood: hash function for base types is trivial "identity", and for strings it is very simple and easy to forge, and no any seeding performed.
Hash functions should be at least not-trivially forged when random seed used.
Design with separated hasher allows switch between different hash algorithms: default "safer" for public facing applications, and "faster" for non-tainted data.
As proposed by @RX14
class Object
def hash
Hashing::StdHasher.build do |hasher|
hash(hasher)
end
end
def hash(hasher)
hasher << object_id
end
end
class MyClass
property foo : Foo
property bar : Bar
def hash(hasher)
hasher << @foo << @bar
end
end
Extracted from #4557 discussion because it may be discussed and implemented before.
/cc @RX14, @funny-falcon
Looks like we need
module Hashing
alias Type = Int32
module Build
def build : Type
builder = new
yield builder
builder.calculate
end
def seed
555
end
end
class StdHasher
extend Build
property val
def initialize
@val = 0
end
def <<(value) : self
self.val += value
self
end
def calculate : Type
val + self.class.seed
end
end
end
hash = Hashing::StdHasher.build do |hasher|
hasher << 12 << 24
end
p hash
Any suggestions?
Looks a lot like I thought about.
There is also need for random seed. I understood how to generate it with SecureRandom, but I'm in doubdt how to share it between StdHasher and Int32Hasher (special case for Hash(Int32, V)). Can you give me an advice?
I suppose that You may implement seed in another module like a Build. Extend classes with it and do self.class.seed.
By the way - hashers are reference classes in example above now.
Here is draft of Hashing::StdHasher https://gist.github.com/funny-falcon/52eb622d689a2eae3e2d7b4e4a77d5ac
I strongly miss #1517 for convenience. Had to do wrapper struct holding pointer to implementation to achieve stack allocation + pass-by-pointer + convenient method calling. Compiler is not smart enough to allocate class-object on a stack.
why not
class String
def hash(hasher)
hasher << to_slice
end
end
also looks like Object.hash() with build already does *.hash with build, so need no duplicated code.
```
why not
Ah, you are right.
also looks like Object.hash() with build already does *.hash with build, so need no duplicated code.
Sorry, I doesn't understand what you mean :-( This were example that compiles with unmodified master.
Of course I will modify original Object#hash() in pull request, if you mean that.
But still Hash may initiate builder by itself (for example, if per-hash-table seed will be accepted).
Yes, I mean that.
Please feel free to clone crystal repo, make crystal and use bin/crystal.
It allows to do anything.
btw, I can do a lot with standard Crystal: https://play.crystal-lang.org/#/r/27bo
Yes, I use bin/crystal.
Opening issues with code and no further explanations is only asking volonteers to search and seek for what the hell you are trying to solve. Please at least explain the issue, or nobody from the core team will care to take a look.
Explanation:
Currently Crystal ignores existence of HashFlood: hash function for base types is trivial "identity", and for strings it is very simple and easy to forge, and no any seeding performed.
Hash functions should be at least not-trivially forged when random seed used.
Design with def hash(hasher); hasher << @a << @b; end allows switch between different hash algorithms: default "safer" for public facing applications, and "faster" for non-tainted data.
I'm working on an issue, and will do pull request next week, if no one complains against.
Thanks for taking the time to explain. I agree hash codes as currently implemented are simple but basic and problematic (leading to attack vectors).
allows switch between different hash algorithms, one safer, one faster.
No. There should be a single hash solution, good for all use cases (safety first, then the fastest), implemented just like today: a generic Object#hash implementation, which can be overriden to better fits each struct/class data.
@ysbaddaden, in this case overriding hash would be difficult, as user will have to think about safety and seeding (or just use boilerplate h = Hasher.new; h<<field1...; h.finalize). Design with hash(hasher) looks more convenient even if there will be only one way to hash (and overriding hash directly will still be possible).
After all, hash(hasher) is very reusable : one may use even SHA3 without changing a line, just supplying other hasher.
Though, there is a one pity point: Hash#hash currently key/insertion/iteration order agnostic. It will be quite difficult to preserve this property with hash(hasher). Either one have to sort by keys before hashing, or declare that hashes with different iteration order are different.
#hash is something nobody should have to care about, even less deal with: the stdlib must do all the job, and do it so well that nobody else than stdlib maintainers have to worry about it. Overiding is for stdlib structs like Float64 or Time may provide specific solutions, if need be.
@funny-falcon while that might be useful, I think it's a bad idea to overly abstract this and implement some kind of class hierarchy so that you can use multiple hashing algorithms. You should just implement a single Hasher class which implements one good generic hash. If people want to use multiple hashing algorithms they can do so through duck typing.
@ysbaddaden everyone who defines custom equality also has to define a custom hash, and I think that's pretty common. We even have a macro def_equals_and_hash to make it simple for people to override the equality.
@RX14,
I think it's a bad idea to overly abstract this and implement some kind of class hierarchy so that you can use multiple hashing algorithms.
No one talks about class hierarchy.
If people want to use multiple hashing algorithms they can do so through duck typing.
With def hash(h); h << @a << @b ; end it is quite easy to pass non-default hasher with duck-typing.
Non-default hasher just have to implement following contract:
class NonDefaultHasher
def <<(v : Nil)
end
def <<(v : Int::Primitive)
end
def <<(v : Bytes)
end
def <<(v)
v.hash(self)
end
# initialize and calculate are omitted, cause they are not part of contract.
end
Without such contract use of duck-typing it is quite hard.
@funny-falcon yes you're exactly correct. I thought someone had proposed something complex but I was wrong. My bad.
Most helpful comment
Thanks for taking the time to explain. I agree hash codes as currently implemented are simple but basic and problematic (leading to attack vectors).
No. There should be a single hash solution, good for all use cases (safety first, then the fastest), implemented just like today: a generic
Object#hashimplementation, which can be overriden to better fits each struct/class data.