Currently broken restore is not failing step, just sending a warning.
Example output
Run actions/cache@v1
[warning]connect ETIMEDOUT 13.107.42.16:443
Hi @remik, did the restore step fail once with this error, or is it happening often/always?
It is happening quite often 1/30 times and the errors are different.
Another example:
/bin/tar: Unexpected EOF in archive
[warning]Tar failed with error: The process '/bin/tar' failed with exit code 2
In both cases step was marked as a success instead of fail.
Is this in a public or private repo? Could you please share the org/repo name? If you don't want to post it here, please e-mail me at dhadka (at) github (dot) com.
Regarding the behavior of the step passing even though it hit an error, the cache action is best effort. The idea is that we don't want a caching issue to cause builds to fail, and instead the workflow falls back to retrieving the dependencies from their original source. This makes sense for transient issues, such as a networking issue, but might not make sense for a configuration issue. However, we don't distinguish those at the moment.
This is private repo: HealthByRo/roman_api
Hi @remik, sorry for the delay in getting back to you. I'm not seeing any specific errors on the backend servers that indicate there's an issue, but I would recommend the following two changes:
Please make sure you're using @v1.1.0 of the action. If you are using actions/cache@v1, you are automatically using the correct version. Specifically, this version breaks the upload into smaller chunks, so it's less likely you'll see a timeout while uploading a large cache. This was released one day before this issue was opened, so you may have seen improvements already.
Enable debugging mode and attach the logs here the next time the action fails. This should provide us some more details on the unexpected EOF in archive error. It sounds like the runner isn't downloading the complete tar.
uses: actions/cache@v1 and in the last days it is less often.Sorry, please try this link: https://help.github.com/en/actions/automating-your-workflow-with-github-actions/managing-a-workflow-run#enabling-debug-logging
I can confirm the occurrence of the same symptoms. Here some data collected in the last 14 days:
1.
[warning]connect ETIMEDOUT 13.107.42.16:443
e.g. in https://github.com/apache/camel-quarkus/pull/962/checks?check_run_id=528742770 https://github.com/apache/camel-quarkus/runs/500168788?check_suite_focus=true
2.
```gzip: stdin: unexpected end of file
/bin/tar: Unexpected EOF in archive
/bin/tar: Unexpected EOF in archive
/bin/tar: Error is not recoverable: exiting now
[warning]Tar failed with error: The process '/bin/tar' failed with exit code 2
e.g. in https://github.com/apache/camel-quarkus/runs/516962959?check_suite_focus=true https://github.com/apache/camel-quarkus/runs/510521863?check_suite_focus=true
Some observations:
* We have one initial job called `build` that basically runs `mvn install` which is supposed to build the modules in the source tree and store the built artifacts and their dependency artifacts under `~/.m2/repository`. `~/.m2/repository` is cached using the current git commit's sha1 as a part of the key. Then there are 10+ subjobs for running integration tests which are all pulling the very same cache key. It is strange that typically only one of them fails due to errors shown above. The fact that all those other subjobs can retrieve the chache seems to suggest that the cache is stored properly and that the problem occurs only during the retieval.
* When comparing the logs of successful and failing sibling subjobs for the case 2, I can see the following:
* A successful one:
Cache Size: ~487 MB (510241472 B)
/bin/tar -xz -f /home/runner/work/_temp/c8aff06d-e921-4cd8-bc56-6338f299e350/cache.tgz -C /home/runner/.m2/repository
Cache restored from key: maven-b9b6740744606c49e15f4f84440c8d0b37e833bf
* A failing one:
Cache Size: ~67 MB (69894144 B)
/bin/tar -xz -f /home/runner/work/_temp/77c41999-ed7e-406d-8741-901e6fa0fbf2/cache.tgz -C /home/runner/.m2/repository
gzip: stdin: unexpected end of file
/bin/tar: Unexpected EOF in archive
/bin/tar: Unexpected EOF in archive
/bin/tar: Error is not recoverable: exiting now
[warning]Tar failed with error: The process '/bin/tar' failed with exit code 2
```
Note the difference in size: the correct 487 MB vs. apparently wrong 67 MB. Why is one subjob seeing an incomplete file?
I agree that the above errors should make the step fail.
@ppalaga Thank you for bringing this to our attention. We are aware of flakiness caused by network issues between the Actions runner and the cloud storage provider used by the cache, and are looking at ways to improve reliability.
Please note, however, that the cache action is "best effort" and assumes that if it fails, the subsequent steps can recreate the cached content. For this reason, a failure to restore the cache does not fail the job. This works well for project dependencies, which can just be downloaded from their package manager (maven, npm, etc.) if the cache action fails to restore (either by a cache miss or other error). Caches are also subject to eviction, so they can disappear potentially at any time.
Looking at the workflow file used by the project, it looks like it is storing the build artifacts in the cache, and restoring them for use by each of the test runs. For this kind of workflow, it likely makes sense to use workflow artifacts instead. Workflow artifacts are persistent longer and are designed to be more reliable than a cache.
The readme on this action mentions skipping steps based on cache hits: https://github.com/actions/cache#Skipping-steps-based-on-cache-hit
But when the cache restore fails with the timeout mentioned by the OP, output.cache_hit is true. We had a recent failure, and the workflow continued expecting a successful cache hit (rather than recreating the cached content as it would with a cache miss)
2020-04-13T19:58:00.5904633Z ##[group]Run actions/[email protected]
2020-04-13T19:58:00.5904828Z with:
2020-04-13T19:58:00.5904952Z path: .
2020-04-13T19:58:00.5905237Z key: some-key
2020-04-13T19:58:00.5905368Z ##[endgroup]
2020-04-13T19:58:32.4250633Z [warning]connect ETIMEDOUT 13.107.42.16:443
I agree that failing the step isn't necessarily the best choice for this action, but it would be really helpful if the output was accurate so that workflows could recreate the content that didn't come back from the cache
I also want to report that we experienced this, and even though I understand that flakiness can happen from time to time, it would help with debugging build failures if GitHub Actions reported the step as a failure. Here is a job that we had where the cache restore failed on an "Unexpected EOF" but it's not obvious to check for that since the console output is hidden on the cache restore step behind a green checkmark: https://github.com/chipsalliance/rocket-chip/runs/677133344
I'd personally be fine with having to mark the cache restore step with continue-on-error: true to allow failure, but it's misleading to report it as a success when it actually failed.
@richardxia Thanks for your input! One recommendation for your particular setup since it's pointing to an unreleased version of the cache action...we recently merged a few changes to improve reliability and also detect/prevent the "Unexpected EOF" error, so updating the commit id to a newer version should help with flakiness.
Thanks, we'll give it a try!
Unfortunately, it looks like the latest version of the cache action runs into an error when trying to call zstd, where zstd states that it has been given "Incorrect parameters". This might be due to our job running on Ubuntu 16.04? I can file a separate ticket for that issue.
Most helpful comment
The readme on this action mentions skipping steps based on cache hits: https://github.com/actions/cache#Skipping-steps-based-on-cache-hit
But when the cache restore fails with the timeout mentioned by the OP,
output.cache_hitistrue. We had a recent failure, and the workflow continued expecting a successful cache hit (rather than recreating the cached content as it would with a cache miss)I agree that failing the step isn't necessarily the best choice for this action, but it would be really helpful if the output was accurate so that workflows could recreate the content that didn't come back from the cache