Timescaledb: Use PG's own hash function instead of murmur3?

Created on 23 Mar 2017  路  8Comments  路  Source: timescale/timescaledb

I did some code for a PG extension, and I needed a good and fast hash function; so I included murmur3 source files in the extension to get that good hash function. Later I realized that all the hash* functions accessible via #include "access/hash.h" were perfectly good hash functions for my needs, and I didn't have to include all that murmur3 source code in the extension. What's more, the PG hash functions already deal with CPU-architecture differences.

Do you have a special reason that requires murmur3 to be your hash function to create partition keys? The PG hash functions are based on Bob Jenkins's hash function (called lookup3 I think).

https://doxygen.postgresql.org/hashfunc_8c_source.html#l00177

If these hash functions are appropriate for your needs, you could use hashvarlena to get the int4 hash, then do the modulo arithmetic.

Most helpful comment

I think I can say with some confidence that our intention is for the open source version to be fully functional on its own, offering great value. Apart from that, it is a bit too early for us to commit to a particular strategy or make hard promises. Currently, we aren't even offering a commercial version. I can assure you, however, that we will try hard to avoid ending up in a situation where we feel it necessary to remove functionality from the open source version for commercial reasons, like what happened to clustering in InfluxDB, to take one example. We have no intention for a commercial version to diverge in a way that doesn't make it a strict super set of the open-source one. Ideally, the added value of a commercial version would be mostly in tooling and administrative functionality that are entirely optional but would save customers a great deal of time and frustration. But right now it is simply to early to say.

Regarding hashlib, we actually used to have that as a dependency. But we preferred for our extension to be self-contained as hashlib is not distributed with PostgreSQL. This is for convenience. Why do you think including the source code is less graceful? I mean, we're doing the same thing as hashlib, i.e., including the Murmur3 source code that is in the public domain. So, I don't see why having hashlib as a dependency would be an improvement.

All 8 comments

TimescaleDB actually allows you to set your own partitioning function when creating a hypertable, so you could use the PG hash functions, or any other function of your own choice. The reason we are using murmur3 is mostly historical because we adopted it for internal use at Timescale. Murmur3 is a good general-purpose hash function with implementations widely available outside of PG. This is important because other services that interact with your PG server might want to perform the same partitioning, consistent with your hypertable.

That said, I agree that it might make sense to remove this dependency for the open source project and instead rely on something internal to PG. We're happy to receive PRs.

That leads me to another question: how does this open-source repository relate to any proprietary code that may be part of a commercial/enterprise release of TimescaleDB? I ask because I'm more comfortable contributing to an open source project if I know, at least roughly, its intended future.

  • Is open-source timescaledb intended to be the "core functionality", and an enterprise version adds additional features in a modular fashion, without modifying the functionality of the open-source core?
  • Will the commercial version diverge from the open-source version, or has it already? Such that a bugfix made to one version must be manually re-implemented in the other?
  • Will the commercial version also be packaged as a PG extension? Or will it involve more than an extension install?

I also wanted to mention that PGXN offers the hashlib extension, which packages murmur3 and many other common hash algorithms. That might be a more graceful way to use murmur3 in TimescaleDB applications, rather than include the murmur3 source code.

https://github.com/markokr/pghashlib

I think I can say with some confidence that our intention is for the open source version to be fully functional on its own, offering great value. Apart from that, it is a bit too early for us to commit to a particular strategy or make hard promises. Currently, we aren't even offering a commercial version. I can assure you, however, that we will try hard to avoid ending up in a situation where we feel it necessary to remove functionality from the open source version for commercial reasons, like what happened to clustering in InfluxDB, to take one example. We have no intention for a commercial version to diverge in a way that doesn't make it a strict super set of the open-source one. Ideally, the added value of a commercial version would be mostly in tooling and administrative functionality that are entirely optional but would save customers a great deal of time and frustration. But right now it is simply to early to say.

Regarding hashlib, we actually used to have that as a dependency. But we preferred for our extension to be self-contained as hashlib is not distributed with PostgreSQL. This is for convenience. Why do you think including the source code is less graceful? I mean, we're doing the same thing as hashlib, i.e., including the Murmur3 source code that is in the public domain. So, I don't see why having hashlib as a dependency would be an improvement.

Sorry to be unclear: I meant that if TSDB switched to using the built-in PG hash function by default, but you wanted murmur3 to be used for certain projects and applications, it would be easy to pull in hashlib to obtain murmur3.

I'd submit a PR to switch to built-in PG hash if you think it's worth considering a switch to PG hash as the default hash function for TSDB.

And thanks very much for sketching the TSDB roadmap for me!

@robin900 we have CLA and contributor process in place now, so if you are still interested in doing a PR to replace the hash function we'd be happy to review it.

@cevian Closing the issue.

Was this page helpful?
0 / 5 - 0 ratings