Crystal: Discussion: Hashing namespace with Hasher classes

Created on 16 Jun 2017  路  18Comments  路  Source: crystal-lang/crystal

Currently Crystal ignores existence of HashFlood: hash function for base types is trivial "identity", and for strings it is very simple and easy to forge, and no any seeding performed.

Hash functions should be at least not-trivially forged when random seed used.

Design with separated hasher allows switch between different hash algorithms: default "safer" for public facing applications, and "faster" for non-tainted data.

As proposed by @RX14

class Object
  def hash
    Hashing::StdHasher.build do |hasher|
      hash(hasher)
    end
  end

  def hash(hasher)
    hasher << object_id
  end
end

class MyClass
  property foo : Foo
  property bar : Bar

  def hash(hasher)
    hasher << @foo << @bar
  end
end

Extracted from #4557 discussion because it may be discussed and implemented before.

/cc @RX14, @funny-falcon

Most helpful comment

Thanks for taking the time to explain. I agree hash codes as currently implemented are simple but basic and problematic (leading to attack vectors).

allows switch between different hash algorithms, one safer, one faster.

No. There should be a single hash solution, good for all use cases (safety first, then the fastest), implemented just like today: a generic Object#hash implementation, which can be overriden to better fits each struct/class data.

All 18 comments

Looks like we need

module Hashing
  alias Type = Int32

  module Build
    def build : Type
      builder = new
      yield builder
      builder.calculate
    end

    def seed
      555
    end
  end

  class StdHasher
    extend Build

    property val

    def initialize
      @val = 0
    end

    def <<(value) : self
      self.val += value
      self
    end

    def calculate : Type
      val + self.class.seed
    end
  end

end

hash = Hashing::StdHasher.build do |hasher|
  hasher << 12 << 24
end

p hash

Any suggestions?

Looks a lot like I thought about.

There is also need for random seed. I understood how to generate it with SecureRandom, but I'm in doubdt how to share it between StdHasher and Int32Hasher (special case for Hash(Int32, V)). Can you give me an advice?

I suppose that You may implement seed in another module like a Build. Extend classes with it and do self.class.seed.

By the way - hashers are reference classes in example above now.

Here is draft of Hashing::StdHasher https://gist.github.com/funny-falcon/52eb622d689a2eae3e2d7b4e4a77d5ac

I strongly miss #1517 for convenience. Had to do wrapper struct holding pointer to implementation to achieve stack allocation + pass-by-pointer + convenient method calling. Compiler is not smart enough to allocate class-object on a stack.

why not

class String
  def hash(hasher)
    hasher << to_slice
  end
end

also looks like Object.hash() with build already does *.hash with build, so need no duplicated code.
```

why not

Ah, you are right.

also looks like Object.hash() with build already does *.hash with build, so need no duplicated code.

Sorry, I doesn't understand what you mean :-( This were example that compiles with unmodified master.
Of course I will modify original Object#hash() in pull request, if you mean that.
But still Hash may initiate builder by itself (for example, if per-hash-table seed will be accepted).

Yes, I mean that.

Please feel free to clone crystal repo, make crystal and use bin/crystal.

It allows to do anything.

btw, I can do a lot with standard Crystal: https://play.crystal-lang.org/#/r/27bo

Yes, I use bin/crystal.

Opening issues with code and no further explanations is only asking volonteers to search and seek for what the hell you are trying to solve. Please at least explain the issue, or nobody from the core team will care to take a look.

Explanation:
Currently Crystal ignores existence of HashFlood: hash function for base types is trivial "identity", and for strings it is very simple and easy to forge, and no any seeding performed.

Hash functions should be at least not-trivially forged when random seed used.

Design with def hash(hasher); hasher << @a << @b; end allows switch between different hash algorithms: default "safer" for public facing applications, and "faster" for non-tainted data.

I'm working on an issue, and will do pull request next week, if no one complains against.

Thanks for taking the time to explain. I agree hash codes as currently implemented are simple but basic and problematic (leading to attack vectors).

allows switch between different hash algorithms, one safer, one faster.

No. There should be a single hash solution, good for all use cases (safety first, then the fastest), implemented just like today: a generic Object#hash implementation, which can be overriden to better fits each struct/class data.

@ysbaddaden, in this case overriding hash would be difficult, as user will have to think about safety and seeding (or just use boilerplate h = Hasher.new; h<<field1...; h.finalize). Design with hash(hasher) looks more convenient even if there will be only one way to hash (and overriding hash directly will still be possible).

After all, hash(hasher) is very reusable : one may use even SHA3 without changing a line, just supplying other hasher.

Though, there is a one pity point: Hash#hash currently key/insertion/iteration order agnostic. It will be quite difficult to preserve this property with hash(hasher). Either one have to sort by keys before hashing, or declare that hashes with different iteration order are different.

#hash is something nobody should have to care about, even less deal with: the stdlib must do all the job, and do it so well that nobody else than stdlib maintainers have to worry about it. Overiding is for stdlib structs like Float64 or Time may provide specific solutions, if need be.

@funny-falcon while that might be useful, I think it's a bad idea to overly abstract this and implement some kind of class hierarchy so that you can use multiple hashing algorithms. You should just implement a single Hasher class which implements one good generic hash. If people want to use multiple hashing algorithms they can do so through duck typing.

@ysbaddaden everyone who defines custom equality also has to define a custom hash, and I think that's pretty common. We even have a macro def_equals_and_hash to make it simple for people to override the equality.

@RX14,

I think it's a bad idea to overly abstract this and implement some kind of class hierarchy so that you can use multiple hashing algorithms.

No one talks about class hierarchy.

If people want to use multiple hashing algorithms they can do so through duck typing.

With def hash(h); h << @a << @b ; end it is quite easy to pass non-default hasher with duck-typing.
Non-default hasher just have to implement following contract:

class NonDefaultHasher
   def <<(v : Nil)
   end
   def <<(v : Int::Primitive)
   end
   def <<(v : Bytes)
   end
   def <<(v)
     v.hash(self)
   end
   # initialize and calculate are omitted, cause they are not part of contract.
end

Without such contract use of duck-typing it is quite hard.

@funny-falcon yes you're exactly correct. I thought someone had proposed something complex but I was wrong. My bad.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ArthurZ picture ArthurZ  路  3Comments

will picture will  路  3Comments

asterite picture asterite  路  3Comments

lgphp picture lgphp  路  3Comments

asterite picture asterite  路  3Comments