Cache: Can't store using hashFiles if files don't exist yet

Created on 19 Nov 2019  路  15Comments  路  Source: actions/cache

For example, in a Rust project, the target directory is almost never committed as it includes build artifacts. I'd like to cache it between builds, but using

name: Cache target dir
uses: actions/cache@v1
with:
  path: target
  key: ${{ runner.os }}-cargo-target-${{ hashFiles('target') }} # also tried variants of `target/**`...

triggers a failure that prevents the build from continuing:

##[error]The template is not valid. hashFiles('/home/runner/work/.../.../target') failed. Search pattern '/home/runner/work/.../.../target' doesn't match any file under '/home/runner/work/.../...'

which does make sense! After checking out, there is not yet a target directory. In fact, that's what I aim to _create_ (maybe) when I restore from the cache. So _storing_ would work with this template, but _restoring_ would not, since the key can't be computed.

If I supply restore-keys, though, I still get the same error. It seems to me like something (maybe not in this codebase) is just dying early because the template isn't fully computable, instead of trying to fall back on the restore-keys first.

Most helpful comment

@rye perhaps we could update hashFiles with an additional parameter that you could set that would have it not fail if it didn't find any files. I suppose the other option is to simply warn rather than fail, however, that would mean in your case that you would perpetually have a warning on your jobs that you could likely never get rid of.

All 15 comments

So my desired behavior looks like this:

  • The runner tries to resolve the key, but can't because the directory doesn't exist. Instead of dying, it replaces the hashFiles expression with nothing. (So, a key like ${{ runner.os }}-blah-${{ hashFiles('non-existent-file') }} gets turned into ${{ runner.os }}-blah-)

    • If a cache matches _that_ key (the one without the result from hashFiles), that key is used, and its key is recorded as the key that was restored. This way, when the storage step runs, the originally-specified key _can_ be fully computed, and only if it differs from the restored key would it trigger another cache store. (If the files still don't exist at the end of a build, that's when another error could be thrown)

In short, this is what a potential

  • Try to restore from key key: foo-${{ hashFiles('non-existent-dir') }}, fail; search for foo-, find a hit with "actual" key foo-deadbeef0123456789. Use that instead, unpacking the contents of the cache, (which could include non-existent-dir, perhaps) and proceed.
  • At the end of a successful build, recompute the key, this time with non-existent-dir present so you can successfully evaluate the key to foo-[something]. If the result matches, then you restored the same cache that currently exists on the runner, so nothing has changed during the build. If it doesn't match, something has changed, so store it.

I hope that helps clarify what I'm looking for.

Hi @rye, in the Rust-Cargo example configuration, it caches target using a hash of the Cargo.lock file(s):

- name: Cache cargo build
  uses: actions/cache@v1
  with:
    path: target
    key: ${{ runner.os }}-cargo-build-target-${{ hashFiles('**/Cargo.lock') }}

Would this handle your use case? This would cache target and get updated whenever the project's dependencies change. If not, could you please clarify why? Thanks!

Hi @dhadka, thanks for getting back to me on this.

First, I'll note that Cargo.lock is not the only determiner of the contents of the target directory. Different nightly Rust compiler versions could result in different build intermediates, too, if something changes internally鈥擨'd rather let cargo determine if it needs to change anything from a cached target dir than rely on Cargo.lock to describe the state of everything.

But my main reason for wanting this change is that I use other tools that are outside of my dependency manifest, for example cargo-tarpaulin, which takes 10 minutes to build for each from-scratch installation. To compile, I add env CARGO_TARGET_DIR=target to the invocation of cargo install, which makes it share the target directory. Compilation, it turns out, is the most expensive part of using such tools, which is the main reason why I want to cache them. I certainly _could_ add cargo-tarpaulin to my dev-dependencies, but given that it's a standalone tool I'd rather keep it outside of my manifest.

I hope that helps explain why I feel a change is needed! Thanks again.

I think the core of this discussion really boils down to helping the user understand they have specified a pattern in hashFiles that does not match anything or not. In the current model if I typeo something and no files are found then the user will know that and can try to debug. If we move to the model where the files don't need to exist then the user will have no idea until they get a few runs in and realize they are not getting updated caches.

I do agree that the current error form makes things stricter, but I don't think the strictness is helpful鈥攁 hashFiles error (due to files not existing) seems to be causing a _parsing_ error, i.e. restore_keys is not even used at all; the cache restore just completely fails. My use case would be satisfied if restore_keys was used in such cases.

Of course, I could leave off the hashFiles in my key, but then I would basically have a persistent cache that never got updated if things changed.

@rye perhaps we could update hashFiles with an additional parameter that you could set that would have it not fail if it didn't find any files. I suppose the other option is to simply warn rather than fail, however, that would mean in your case that you would perpetually have a warning on your jobs that you could likely never get rid of.

@chrispat yeah, that first idea sounds like a good solution to me, if that works best.

I am still wondering if it is possible to just handle the case where hashed files don't exist by falling back on restore-keys (and possibly proceeding to fail if those keys aren't specified or can't be restored)? I'd be happy to make those changes, but I don't know where I could change that behavior.

What this would mean for the user is:

uses: actions/cache@v1
with:
  path: target
  key: cargo-deps-${{ hashFiles('target') }}
  restore-keys:
    - cargo-deps-

In the event that hashFiles cannot find its parameter, it tries to find something in restore-keys that does exist. Here it finds the most recent cache, cargo-deps-abcdef, and restores it, proceeding to continue through the build. If the path does not change from its restored value, cargo-deps-abcdef, (i.e. at the end of the build, if the key is recomputed as cargo-deps- abcdef) then nothing has changed and we do not need to store again. However, if something in the specified path changed, then a new cache is produced, cargo-deps-012345, which would then be restored later.

In short, I would still expect a failure of hashFiles to cause the cache restoration to try to fall back on whatever is specified in restore_keys鈥攊f that change is made, no additional parameters would be needed, and the existing API would remain exactly the same. Indeed, I don't think anything relating to caching should stop a build from continuing by default, but that is configurable through a parameter.

+1 for @rye's idea.

Regarding @chrispat comment,

I think the core of this discussion really boils down to helping the user understand they have specified a pattern in hashFiles that does not match anything or not.

I vote for giving the developer the benefit of the doubt that they are intelligent and chose the correct file path, even if that file path might not yet exist in their repo but would eventually show up in their build process.

Regarding the argument that hashFiles should fail if it doesn't find at least one file in the project I think could be argued against by saying it doesn't guard against finding _more_ files than the developer anticipated. If the hashFiles function allows for unbounded number of files to be found, zero matches seems reasonable to me too.

If the overall build workflow isn't working as the user anticipated then the developer would go debug it anyways. A message in the log at runtime by the cache action to say it found a hit or not should suffice for the developer to figure out what's going on to realize they have a typo or the file(s) they expected to be there never showed up, etc.

Thanks

Just want to +1 this feature, my CI process downloads a .jar file that I only want to download once, so I have this:

    env:
      FIREBASE_EMULATORS_PATH: ${{ github.workspace }}/emulator-cache

// ...

    - name: 'Cache Emulators'
      uses: actions/cache@v1
      with:
        path: ${{ env.FIREBASE_EMULATORS_PATH }}
        key: ${{ runner.os }}-firebase-emulators-${{ hashFiles('emulator-cache/**') }}
      continue-on-error: true

This doesn't work because there's nothing in the emulator-cache directory when I start so hashFiles() fails. I could probably work around this by creating a dummy file there in a previous run step but I would really prefer not to.

We have been looking at this problem from a few different angles and while it would be easy to make hashFiles just return an empty string if it does not find any files that won't fully solve the probelm. The action stores the key that is generated when it is run and uses that to save the cache at the end so in this case the key would be generated with no files and even if those files existed at the end of the run they would not be considered as part of the key.

I think the only way to make this work well would be to have independent restore and save actions which would lead to even more configuration.

The action stores the key that is generated when it is run and uses that to save the cache at the end [...]

Yeah, I think that's definitely a big part of the problem. The other part is that hashFiles will fail the build (inelegantly) if it can't find the files. Something about both of these parts of the runner's behavior would need to change to extend the existing syntax so that the key can be recomputed for comparison at the end of the build, which wouldn't necessarily require breaking backwards compatibility or adding any configuration.

Indeed, having independent save and restore actions would be an alternative to making hashFiles handle more situations; for what it's worth, at least one other major provider that I'm aware of (CircleCI) has save_cache and restore_cache steps which are used to interact with their caching system, which I do think is nice because it lets you control _when_ your storage is performed.

The way this action currently works seems really suboptimal. It only works if your Cargo.lock basically never changes. But having a Cargo.lock is advised against for libraries as you want to test your library against a wide range of semver compatible versions as this is how your library is going to be used in the end. So most Rust projects don't have a Cargo.lock for that reason.

You could generate one on the fly and use that as the hash, but the chances that nothing changed are very slim. So in the end you likely don't ever get a cache hit and end up recompiling everything from scratch.

So what we ended up doing is scrapping the whole hashing mechanism and just using the number of the week in the year as the cache key, as that keeps a reasonably recent cache around and overwrites it with a new one every week (actually every second week). This is so far the best workaround I've seen for Rust projects (or similar) where you don't really have a lockfile.

135 is a proposal aiming to solve this.

@CryZe with proximate caching, one problem is that the cargo registry gets bloated over time as old crate archives get dragged from old cached states to newer ones.

This example demonstrates how to bootstrap Cargo.lock from a smaller cache of the registry index that does usually get restored from secondary matches because the remote changes so often. Depending on how often the dependencies get refreshed, this lockfile may be good for meeting a match in the main cache for some time.

I have proposed new examples to examples.md in #325 that show how to bootstrap Cargo.lock in a library project. This would not solve the wider issue that is discussed here, though.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Cerberus picture Cerberus  路  5Comments

aleks-ivanov picture aleks-ivanov  路  5Comments

evandrocoan picture evandrocoan  路  3Comments

dhadka picture dhadka  路  5Comments

jwt27 picture jwt27  路  3Comments