Cargo: (Option to) Fingerprint by file contents instead of mtime

Created on 9 Jan 2019  路  15Comments  路  Source: rust-lang/cargo

Describe the problem you are trying to solve
The primary problem I have is that when building my code on travis, the actual code in my workspace builds every time, even though much of it hasn't changed and I have target directory caching on. The reason is that travis makes a new clone of my git repo, which doesn't preserve mtimes. This can add about 5 minutes to every travis run. My project is mixed rust and non-rust code, so this adds 5 minutes to those runs even if no rust code has been affected. I started futzing with mtimes, but that seems fragile and not solving the root of the problem.

Additionally, edit-undo loops cause re-compilation locally, which is a little annoying.

Describe the solution you'd like
Add a new LocalFingerprint::ContentBased(Digest, PathBuf) variant to https://github.com/rust-lang/cargo/blob/b84e625c4401be3893a4d6d44d5dbac82b33f1c4/src/cargo/core/compiler/fingerprint.rs#L204-L209 which reads the content of the PathBuf, passes it through a SipHasher, and mixes that into any aggregate fingerprints. Use this instead of LocalFingerprint::MtimeBased.

Notes
This will probably slow down no-op builds slightly (in some circumstances, such as with large build script inputs over NFS, significantly), so may want to be behind a flag (perhaps --fingerprint-strategy={mtime,content}).

This would probably also make more shared caching (that people talk about a lot, most recently at https://internals.rust-lang.org/t/idea-cargo-global-binary-cache/9002) easier.

I'd be happy to implement this if the PR is likely to be accepted :)

This would probably also fix #4425 and #3076

A-rebuild-detection C-feature-request

Most helpful comment

I think this also has a chance of speeding up some dev workflows around switching branches. Specifically, I often check out branches temporary to look at the PRs (without building them). Sometimes I also run git switch master && git pull --rebase && git switch my-feature-branch && git rebase master. I think both of these workflows lead to busted mtimes for a bunch of files, which haven't actually changed between compilations. For sprawling workspaces (like rust-analyzer one), I hypothesize that this leads to some unneeded crate rebuilds.

All 15 comments

Funnily enough this is also highly related to recent discussions on https://github.com/rust-lang/cargo/issues/2426 and https://github.com/rust-lang/cargo/pull/6484 for entirely different reasons (filesystems with second-level mtime granularity and modifications/builds all happen in the same second).

cc @Eh2406, @matklad, this might be a good tracking issue for this!

At the moment I would start by removing the mtime based system and replacing it with a hash based one. (except for the corner cases where we have to use mtime, as we don't know which files to hash until after build. cc #5918) Then I would see how big a perf impact this has in reality. If it is small then grand, if not then we need to have a hybrid system that uses mtime to decide whether to hash.

I was thinking of trying to impl this soon, but would much prefer to help you do it @illicitonion. I seem to recall someone had an abandoned branch with a start... was it @ehuss?

Do we want to use SipHasher or a different hashing scheme that is designed for fingerprinting? I only ask because of this:
"Although the SipHash algorithm is considered to be generally strong, it is not intended for cryptographic purposes. As such, all cryptographic uses of this implementation are strongly discouraged."

We're not really too interested in cryptographic hashing here for its security properties, so I think picking any reasonable fast algorithm should be fine (and I think SipHasher is reasonably fast as well). I would personally prefer we not wholesale switch away from mtimes just yet, but having this as at least a runtime option to choose from is a great start!

Would be interesting to compare it to using FxHash.

I'm not sure what I did with my old branch, so I'm just going off memory. My original attempt tried to only hash the contents if the mtime was near the start of compile time. However, since it is not known which files to hash until after compilation, it didn't work out too well. I don't know how to solve that problem. I guess one option is to pre-hash every file in the package directory and maybe hash anything that was included from outside after compilation? But that sounds like it could be very problematic. Or maybe hash protection is just not available for the very first compile (or in other words the second compile is not guaranteed to be correct). Maybe another option is to run --emit=dep-info first, which seems to be pretty fast, but there might be some races still.

I don't know of any easy answers here.

I just came across a long blog post on the troubles with using mtime in build tools. It discusses using only hash, and points out that go switched to that, but recommends a hybrid approach. Witch is surprisingly similar to what we are doing with the fingerprint hash as a database.

That blog post [1] is an excellent reference! It describes some of the problems in using mtimes and hashes in build systems and describes the approach used by the tool redo, roughly:

  • Storing cheap metadata (mtime, size, inode number, owner uid and gid, a sequence number (for outputs only)) for each input and output file.
  • Rebuild whenever any of that metadata changes.

I found it helpful to consider the sequence number for an output as a logical time that we control, so it should avoid problems with low resolution, mtime going backward, builds being missed.

This wouldn't stop the false-positive builds we see on CI systems that don't preserve mtimes, but to avoid these we could check hashes of source files when their metadata changes. One of the problems claimed in [1] is that hashing large build output files (to confirm downstream should rebuild) is expensive, but we could benchmark this to examine the trade offs. Maybe we could get away with hashing source files only?

[1]: "mtime comparison considered harmful"

Please also have a look at chash. This is a mechanism for a meaningful fingerprint (based on the AST of the program). Maybe this can be adapted for cargo as well.

I was just hit by this bug. Docker doesn't refresh timestamps of already existing files and I hacked around the need to cache dependency builds, so I did echo fn main() {} > src/main.rs. Pretty painful to debug and the touch I added to work around looks clunky.

Steps to reproduce

  1. Create the following files;

Dockerfile

FROM rust
COPY project/Cargo.toml project/Cargo.toml
RUN cd project && mkdir src && echo 'fn main() { println!("hello1"); }' > src/main.rs && cargo build --release
COPY project/src project/src
RUN cd project && cargo run --release

src/Cargo.toml

[package]
name = "hello"
version = "0.1.0"
authors = ["Jacek Wielemborek"]
edition = "2018"

[dependencies]
flate2 = "*"

project/src/main.rs

fn main() {
    println!("Hello2");
}
  1. Run docker build .

Expected:

Hello2 is printed

Actual result:

hello1 is printed

Workaround

Before building, do touch project/src/main.rs

I think this also has a chance of speeding up some dev workflows around switching branches. Specifically, I often check out branches temporary to look at the PRs (without building them). Sometimes I also run git switch master && git pull --rebase && git switch my-feature-branch && git rebase master. I think both of these workflows lead to busted mtimes for a bunch of files, which haven't actually changed between compilations. For sprawling workspaces (like rust-analyzer one), I hypothesize that this leads to some unneeded crate rebuilds.

Has anyone got a PR or partial WIP PR on this yet? This would be huge to have the option to do this for docker caching which as rust builds get bigger becomes more important for enterprises. Would love to be able to test this out in nightly under a -Z flag.

If there was progress it would have been linked here. As I recall I thought it could be done strate forwardly, but the fact that Eric's branch did not work suggest I was dunning-kruger myself. My memory of the approach was to change the code that reads the mtime to return a new type (mtime, Option<hash>). If the -Z flag is set fill in the hash. Thread that type all the way thru. When we get to the compare step, check the -Z flag for witch to use. If you are offering to try, that is where I would start.

Was just checking there's no PR out there to build on. If the mtime's not the same we could check cheaply if the file size was the same before doing an expensive hash contents check - that could cut the perf costs down a bit.

I believe this problem was already extensively studied in the past, so there's no need to reinvent the solution and just pick a good existing one. For example, git's index is known to store mtime+ctime+device+inode+hash.

ctime, device and inode will thwart trying to do clever things like copying the target dir and putting it back but on a different machine/dir. When mtime check fails it's probably best to rely on intrinsic properties of the content (size/hash). For a non-cryptographic hash, redox's seahash seems pretty quick.

Was this page helpful?
0 / 5 - 0 ratings