Conan: deterministic short_paths

Created on 20 Nov 2018  路  21Comments  路  Source: conan-io/conan

(the issue template is shamelessly stolen from https://github.com/concourse/concourse)

What challenge are you facing?

We're building a big C++ application in our CI system that uses many Conan packages. Build times can be as long as 1 hour so we use compilation caches (ccache/clcache) to speed up the build. This use case is also mentioned in #3502.

Some of our Conan packages depend on the short_paths option, without this they don't build on Windows, at all. The problem is that the generated short paths aren't deterministic and this makes it impossible to re-use the compilation cache database on a different worker. Only worker-local cache is viable.

Our workers are short lived, their virtual machines are destroyed and recreated daily. Using a worker-local compilation cache is sub-optimal because its lifetime is one day, hence the system needs multiple builds daily to warm up.

What would make this better?

A deterministic short path would allow the use of a global compilation cache greatly speeding up our builds.

The following proposal builds upon the idea dicussed in #3502. Currently the conans.util.windows.path_shortener generates a redirect pointing to (using the default values):

C:\.conan\<random_name>\1

Imagine, instead of this, having something like

C:\.conan\<deterministic_short_path>\1

where the

deterministic_short_path = hash(name, version, user, channel, package_id)

In other words, all the path components appearing in the Conan data directory (available from conan_reference and package_reference in conans.paths.py) will be hashed to a single, deterministic value.

To keep the path short we would truncate the output of a typical hash function such as MD5 or SHA. So the short path would look like:

C:\.conan\f5b96df\1

This would also solve issues like #1881

Are you interested in implementing this yourself?

If you think this proposal is feasible we would be very happy to implement this feature.

low medium feature

Most helpful comment

@memsharded, @lasote, @jgsogo we (@wagdav and myself, and surely also @piponazo) wanted to tell you that we admire how the Conan project treats contributors and users. You are doing a fantastic work.

All 21 comments

The only problem with this approach, is the possibility of a collision.

Known as the "birthday problem", the probability for 7 digits is here: https://github.com/source-foundry/font-v/issues/2#issuecomment-327189455

That will be not very likely, but still very possible. Might need to go for 8 digits. That is also 2 more than current random length, which also might eventually break some package that was really in the limit.

I think it could make sense to take the risk, but need to discuss it.

Thanks for the offer to implement it yourself! Will contact shortly if approved.

@memsharded thanks for your quick response.

The collisions are possible, however generated hashes don't have to be unique globally. The collisions need to be sufficiently rare on a particular host.

Picking the right length is important, though. Using the simple square approximation from this answer

p ~= (n^2)/(2m)

Where n is the number of items and m is the number of possibilities for each item. The number of possibilities for a hex string is 16^c where c is the number of characters.

  • n = 100, c=4 p ~= 0.08
  • n = 1000, c = 6 gives p ~= 0.03
  • n = 5000, c=7 gives p ~= 0.05
  • n = 10000, c=8 gives p ~=0.01

I guess the question is how many packages a typical host has in its database (assuming that the short_path feature is always on).

(co-author of proposal here)
... and, since these hashes would have meaning only locally, we could even consider a user-customizable parameter for the hash length if really needed. This would not solve the problem of the path being too long in certain unlucky cases, but it would solve the problem of the collisions if the local host has a huge number of packages.

Yes, that means, that with a local cache of up to 1000 packages, if using 8 bits, the probability of collision would be around 0.0001 (0.01%)

Discussing with @marco-m we could make the redirect collision-free if we implemented this as follows:

  • compute the hash
  • try to create the redirect using the first N characters
  • if this path already exists under c:\.conan\ use an extra digit and try to create the redirect using the first N+1 characters

Not sure it is enough @wagdav

There is no guarantee that they will be evaluated in the same order

Creating of short paths:

  • Pkg/1@user/channel => hash f12345
  • Pkg2/1@user/channel => (collides) => hash f123456

If for some reason, not unlikely, a different run later that tries to use the same cache runs Pkg2/1@user/channel first, then won't it map to f12345?

Is there something I am missing?

I think that the approach could work, @memsharded. In your scenario, if Pkg2/1@user/channel is created first, then it could be already in the cache (will use the previously used path stored in the CONAN_LINK file) or created again and look for a new tmp directory not existing in the cache.

Something like this in the path_shortener function should work:

    full_hash = hash_function(path)  # Use path as input, all the settings/options/ids should already be there
    min_length = 4  # Any number will work
    max_length = 8
    assert len(full_hash) > max_length  # Should be easy enough for a hash function

    redirect = os.path.join(short_home, full_hash[:min_length])
    while os.path.exists(redirect) and min_length != max_length:
        min_length += 1
        redirect = os.path.join(short_home, full_hash[:min_length])

    # If everything exists, then fallback to previous behaviour
    if min_length == max_length:
        redirect = tempfile.mkdtemp(dir=short_home, prefix="")

Main issue here: if you remove a package from the Conan cache, will the linked folder be removed too? If not, all the proposed redirect will eventually fail for packages that has been created and removed several times.

After sharing this issue with the team, we are willing to accept this PR. The implementation should follow the idea above:

  • hash_function: any sha should be enough
  • let's get a deterministic path from 6 to 15 chars (fallback to mkdtemp). Although short paths should be erased when the path in the conan cache is deleted, we keep the _temp-dir_ fallback just in case there is a bug for any package and the short paths are not deleted.
  • add a _real_path.txt_ file inside the shortened path directory with the full path to the Conan cache it is linked from. This way we will be able to track the issue mentioned above.

Thanks very much for offering your help @wagdav !

@memsharded, @lasote, @jgsogo we (@wagdav and myself, and surely also @piponazo) wanted to tell you that we admire how the Conan project treats contributors and users. You are doing a fantastic work.

hash_function: any sha should be enough

this might be the case, but I would strongly advise against using sha-1. Even this might not be a security relevant thing in this use case (debatable), insecure hashing algorithm should just go away altogether. I would say it is counterproductive from a "big picture" point of view to introduce broken hashing algorithm in new code no matter the usage.

(Also I'm pretty sure that I have heard that several e.g. government institutions are forbidden from using any software which contains broken hashing algorithm no matter the usage. But that was a while ago please don't pin me down on this.)

I don't see any security concern here at all, so sha1 should be perfectly fine. We are taking a known, public path, and mapping it to another path. If someone has access to the system, then the smallest problem is this hash. Furthermore, we are just going to keep a very few 6-7-8 characters of it...

Furthermore other hashing algorithms are already being used in Conan (md5 for file checksums, sha1 for https deduplication checks). There are plans to upgrade them, but at the moment I see no concern for using a sha1 at this moment. Even git is sha1 based :) :) :)

Hi @wagdav , @marco-m did you have the chance to work on this? Would you need some help or guidance? I am tentatively assigning @wagdav this issue, but tell otherwise and we will plan for it. Thanks!

I don't see any security concern here at all

Yes, but what I tried to say it no matter in what environment you are using it I don't think it is good to have a dependency on a broken hash algorithm.

There are plans to upgrade them, but at the moment I see no concern for using a sha1 at this moment. Even git is sha1 based :) :) :)

Even more when you have plans to upgrade, why introducing SHA-1 now once more?
And git is actively working to get rid of of SHA-1 since SHAttered and settled for SHA-256.

Maybe asking this in the other direction: is there a downside of using e.g. SHA-256? (even though, yes it gets cut of anyway and doesn't seem to be security relevant in this specific use case).

Fair point, yes, lets do sha2 then.

I was not arguing against it, I was arguing against the fact that using other algorithm here would imply some kind of vulnerability or insecurity in conan. Thanks!

@memsharded Thanks for the follow-up! We didn't forget about this issue! We are planning to work on this next week! I'll let you know if we're blocked!

@memsharded FYI we have to postpone to beginning of next year.

Hi, we want to release this for Conan 1.12 at the end of the month. Would you be able to do it? We will do it otherwise

Yes! We'll be submitting a PR next week! :crossed_fingers:

Hello, I was thinking, what about the documentation ? :-)

Hello, I was thinking, what about the documentation ? :-)

The Conan doc repo documents the user facing behavior of the short path feature (how to enable/configure etc). This didn't change with this PR so I didn't modify anything.

If anything, it would be nice to add a line there that says that the short_paths version "tries" to use a hash of the long one to achieve some degree of repeatibility, but without global guarantees.

Was this page helpful?
0 / 5 - 0 ratings