Flux: GitHub Docker Registry lists inaccessible `docker-base-layer` tag

Created on 14 Oct 2019  路  24Comments  路  Source: fluxcd/flux

Describe the bug
Unable to fetch image metadata on docker.pkg.github.com.

To Reproduce
Steps to reproduce the behaviour:

  1. Turn on automation for a workload: fluxcd.io/automated: "true"
  2. Use an image from docker.pkg.github.com

Expected behavior
Expected images to be listed on fluxctl list-images

WORKLOAD                                     CONTAINER                IMAGE                     CREATED
name:deployment/name                         container-name                                       image data not available
                                                                      '-> (untagged)            ?

Logs

caller=images.go:159 component=sync-loop err="fetching image metadata for docker.pkg.github.com/owner/repository/image_name: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""

Additional context
Add any other context about the problem here, e.g

  • Flux version: 1.15.0
  • Helm Operator version:

    • Chart: flux-0.15.0

    • App version: 1.15.0

  • Kubernetes version: 1.14.7
  • Git provider: GitHub
  • Container registry provider: GitHub

Flux is able to:

  • Pull images
  • Read / write to the private repository
blocked bug

Most helpful comment

After a short investigation it seems that we are trying to fetch image metadata from docker.pkg.github.com/owner/repository/image_name/manifests/<tag> while it should be docker.pkg.github.com/v2/owner/repository/image_name/manifests/<tag>.

Example: 404 error as reported in log
https://docker.pkg.github.com/stashed/stash/stash/manifests/v0.9.0-rc.0-25-g1aa27c95_linux_amd64

Example: authentication request when /v2/ is added
https://docker.pkg.github.com/v2/stashed/stash/stash/manifests/v0.9.0-rc.0-25-g1aa27c95_linux_amd64

This makes me think that we do not make (correct) use of the API version check as defined in the OCI Distribution spec and Docker Distribution spec.


Additional thing worth knowing about Github's Docker Registry is that authentication is always required, and an image pull secret must thus always be present.

All 24 comments

After a short investigation it seems that we are trying to fetch image metadata from docker.pkg.github.com/owner/repository/image_name/manifests/<tag> while it should be docker.pkg.github.com/v2/owner/repository/image_name/manifests/<tag>.

Example: 404 error as reported in log
https://docker.pkg.github.com/stashed/stash/stash/manifests/v0.9.0-rc.0-25-g1aa27c95_linux_amd64

Example: authentication request when /v2/ is added
https://docker.pkg.github.com/v2/stashed/stash/stash/manifests/v0.9.0-rc.0-25-g1aa27c95_linux_amd64

This makes me think that we do not make (correct) use of the API version check as defined in the OCI Distribution spec and Docker Distribution spec.


Additional thing worth knowing about Github's Docker Registry is that authentication is always required, and an image pull secret must thus always be present.

I have the exact same issue.

ts=2019-10-28T12:50:50.2737575Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2019-10-28T12:50:50.3553961Z caller=images.go:159 component=sync-loop err="fetching image metadata for docker.pkg.github.com/owner/repository/image_name: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""

Would this also explain why I'm seeing problems with glob filter automation on Docker Hub images? Right now we have an automated release with glob pattern 0.* that fails because of this:

ts=2019-11-04T08:32:20.906257839Z caller=images.go:106 component=sync-loop workload=mod-sirius:helmrelease/data-collector container=chart-image repo=docker.io/statisticsnorway/data-collector pattern=glob:0.* current=docker.io/statisticsnorway/data-collector warning="image with zero created timestamp" current="docker.io/statisticsnorway/data-collector (0001-01-01 00:00:00 +0000 UTC)" latest="docker.io/statisticsnorway/data-collector:0.16 (2019-11-04 08:31:30.2717874 +0000 UTC)" action="skip container"

When I check fluxctl list-images I see

mod-sirius:helmrelease/data-collector                    chart-image                                                                                 
                                                                                                   |   0.16                                          04 Nov 19 08:31 UTC
                                                                                                   |   0.15                                          30 Oct 19 08:10 UTC
                                                                                                   |   latest                                        30 Oct 19 08:10 UTC
                                                                                                   |   0.14                                          28 Oct 19 23:07 UTC
                                                                                                   |   0.13                                          28 Oct 19 13:33 UTC
                                                                                                   |   0.12                                          28 Oct 19 11:11 UTC
                                                                                                   |   0.11                                          28 Oct 19 08:55 UTC
                                                                                                   |   0.10                                          28 Oct 19 07:34 UTC
                                                                                                   |   0.9                                           27 Oct 19 17:21 UTC
                                                                                                   |   0.8                                           27 Oct 19 12:16 UTC
                                                                                                   '-> (untagged)                                    ?

So it seems that because the image is considered untagged, the automated glob pattern release will never work because the timestamp will always be considered a zero created timestamp. We're moving this project to semver to hopefully avoid this problem, but I'm writing this in case others has the same problem. And please correct me if I'm wrong here.

@kozejonaz no, this issue is likely because you untagged an image that is still running in the cluster and Flux is now unable to determine if this running image is newer or older than any of the tags it receives for the image. Manually moving it forward should enable automation again, this is something Flux can not (safely) decide for you.

@kozejonaz no, this issue is likely because you untagged an image that is still running in the cluster and Flux is now unable to determine if this running image is newer or older than any of the tags it receives for the image. Manually moving it forward should enable automation again, this is something Flux can not (safely) decide for you.

@hiddeco Thanks for the quick reply. If manual means another Git commit to the HelmRelease (which is what we're using here), that is already how I deployed a newer version. The failure is still there. Or did you mean something else?

@kozejonaz hope this illustrates the problem better:

  1. Flux deploys the HelmRelease on version 0.8.0.
  2. In the Docker registry, version 0.8.0 is _somehow_ untagged.
  3. Flux periodically updates the cache it has of image tags, it notices the removal of 0.8.0 and prunes it from the cache.
  4. A new image version is tagged and pushed as 0.9.0, as Flux is no longer able to tell when 0.8.0 was pushed because the metadata has been pruned, it can not decide if 0.9.0 is newer or older than what it is currently running and won't update the tag in the resource.

@kozejonaz hope this illustrates the problem better:

1. Flux deploys the `HelmRelease` on version `0.8.0`.

2. In the Docker registry, version `0.8.0` is _somehow_ untagged.

3. Flux periodically updates the cache it has of image tags, it notices the removal of `0.8.0` and prunes it from the cache.

4. A new image version is tagged and pushed as `0.9.0`, as Flux is no longer able to tell when `0.8.0` was pushed because the metadata has been pruned, it can not decide if `0.9.0` is newer or older than what it is currently running and won't update the tag in the resource.

@hiddeco Thanks for the detailed response. Yeah, I can see that this situation would cause some weirdness. However, from what I could understand from the devs pushing these images, these tags have been there the whole time, and nothing have been untagged. I will of course test this a bit more to be sure. Hopefully you are correct. I'll get back to you if I find anything else. Thanks again.

This makes me think that we do not make (correct) use of the API version check as defined in the OCI Distribution spec and Docker Distribution spec.

You can log the requests made to the registry by using the flag --registry-trace. That should tell you whether the API is being used correctly -- I would hope so, since Flux uses docker/distribution to interact with image registries.

Flux uses docker/distribution to interact with image registries.

This may actually be a clue as our version of this dependency is tied to Fons his (stale) fork, see: https://github.com/fluxcd/flux/blob/master/go.mod#L6

Am in the process of building an environment to make use of the --registry-trace flag, will report back if I find anything interesting.

Need to fix the billing settings of my GitHub account before I can continue, but this is what I was able to get from the logs thus far:

ts=2020-02-11T00:33:04.916537427Z caller=warming.go:198 component=warmer info="refreshing image" image=docker.pkg.github.com/hiddeco/podinfo/podinfo tag_count=2 to_update=2 of_which_refresh=0 of_which_missing=2
ts=2020-02-11T00:33:04.916967353Z caller=repocachemanager.go:246 component=warmer canonical_name=docker.pkg.github.com/hiddeco/podinfo/podinfo auth="{map[docker.pkg.github.com:<registry creds for [email protected], from default:secret/github-registry>]}" trace="refreshing manifest" ref=docker.pkg.github.com/hiddeco/podinfo/podinfo:docker-base-layer previous_refresh=1h0m0s
ts=2020-02-11T00:33:04.917054329Z caller=repocachemanager.go:246 component=warmer canonical_name=docker.pkg.github.com/hiddeco/podinfo/podinfo auth="{map[docker.pkg.github.com:<registry creds for [email protected], from default:secret/github-registry>]}" trace="refreshing manifest" ref=docker.pkg.github.com/hiddeco/podinfo/podinfo:something previous_refresh=1h0m0s
ts=2020-02-11T00:33:05.136764481Z caller=client_factory.go:42 component=registry url=https://docker.pkg.github.com/v2/hiddeco/podinfo/podinfo/manifests/something status="200 OK"
ts=2020-02-11T00:33:05.367020627Z caller=client_factory.go:42 component=registry url=https://docker.pkg.github.com/v2/hiddeco/podinfo/podinfo/manifests/docker-base-layer status="404 Not Found"
ts=2020-02-11T00:33:05.368172457Z caller=repocachemanager.go:223 component=warmer canonical_name=docker.pkg.github.com/hiddeco/podinfo/podinfo auth="{map[docker.pkg.github.com:<registry creds for [email protected], from default:secret/github-registry>]}" warn="manifest
for tag docker-base-layer missing in repository docker.pkg.github.com/hiddeco/podinfo/podinfo" impact="flux will fail to auto-release workloads with matching images, ask the repository administrator to fix the inconsistency"
ts=2020-02-11T00:33:05.483915961Z caller=client_factory.go:42 component=registry url=https://docker.pkg.github.com/v2/hiddeco/podinfo/podinfo/blobs/sha256:ec2f218e3268a10cb18cf7f83035d261d84a960baacdda5acbfd51ac7bb121c1 status="403 Forbidden"
ts=2020-02-11T00:33:05.485283554Z caller=repocachemanager.go:226 component=warmer canonical_name=docker.pkg.github.com/hiddeco/podinfo/podinfo auth="{map[docker.pkg.github.com:<registry creds for [email protected], from default:secret/github-registry>]}" err="denied: Encountered a billing-related error. Please verify the billing status for this account." ref=docker.pkg.github.com/hiddeco/podinfo/podinfo:docker-base-layer
ts=2020-02-11T00:33:05.48555372Z caller=warming.go:206 component=warmer updated=docker.pkg.github.com/hiddeco/podinfo/podinfo successful=0 attempted=2

@endrec thanks for digging this up, looks like this is an issue on GitHub's side and there is not much we can do about it until they hide this 'internal tag' in their registry API responses.

Googling on docker-base-layer results in an endless lists of GitHub pages that now return a HTTP 404, so it looks like they have already hidden it from the GitHub UI. Given this, I have given unsolicited advice on the post to also exclude the tag from registry API responses, which would resolve this issue.

@hiddeco Would it be useful and feasible to add a flag similar to --registry-exclude-images to exclude tags? Or just supporting tags on the existing flag?
It would solve the issue with any future repository providers who might decide to abuse tags.

That sounds like a good idea, and allows us to give an alternative other than _tell the registry's administrator to fix their corrupted tag_

@endrec I agree with Fons that this would be a great alternative, can you make a new enhancement issue for this, and maybe include a proposal of what the values would look like? Do we for example want to target images _and_ tags, or do we also want to be able to blacklist a certain tag for _all images_?

My advice to not list the docker-base-layer tag in the /v2/<name>/tags/list endpoint of the GitHub registry has been forwarded to the responsible team, and the person responsible for community communications will get back to me soon about their opinion.

Some images have many failing tags (see https://github.com/fluxcd/flux/issues/1701#issuecomment-585269757 ) so we need to make sure that usecase is covered too

So I just ran into this in my testing. I just want to make sure I'm understanding the current state... are we just currently out of luck as far as getting automated image updates working with the github package registry?

And as a followup, could you maybe simply filter out these tags from consideration altogether, and only work with the set of tags that actually have manifests?

I think I answered my own question here... I think the only reason I ran into this was because I didn't actually get the filter annotations correctly applied for my helm release, and it's using the glob: * pattern, which is not my intention. Am assuming once I get the annotations right, I'll be gtg...

Am assuming once I get the annotations right, I'll be gtg...

I think so; fluxd will try to scan all tags, but if you a filtering out the problematic images it won't care if it couldn't get data for them.

Yeah, confirmed. Once I got the filter annotations right, this worked fine.

I still am a little perplexed as to why the scanning process is so spammy, especially when flux is initially installed and hasn't even cloned the repo yet (it scans all tags for all images that happen to be installed in my cluster at the time I install flux, which have nothing to do with the repo I gave to flux to manage), but I found a way to turn that off as well. (see my comments on #2845 and #2780). The default behavior here has been an interesting bottleneck for me.

Yeah, confirmed. Once I got the filter annotations right, this worked fine.

I still am a little perplexed as to why the scanning process is so spammy, especially when flux is initially installed and hasn't even cloned the repo yet (it scans all tags for all images that happen to be installed in my cluster at the time I install flux, which have nothing to do with the repo I gave to flux to manage), but I found a way to turn that off as well. (see my comments on #2845 and #2780). The default behavior here has been an interesting bottleneck for me.

Assuming there's nothing sensitive, would you mind sharing your filter annotations? We are seeing this same issue. We don't want to pass over entire images, just the problematic base layer tags. We've tried

exclude-images: *:docker-base-layer

but that was not successful.

well for us, we don't want flux scanning the world when it's only managing a small number of specific images against a specific branch in a specific repo, so we basically blacklist the big ones and then whitelist the ones we actually care about:

  --set registry.excludeImage="docker.io/*\,index.docker.io/*\,quay.io/*\,k8s.gcr.io/*" \
  --set registry.includeImage="our.specific.image(s).that.we.care.about/*"

https://github.com/fluxcd/flux/issues/2516#issuecomment-630212546
Also did this, but it is not a great solution if you also care about the images that flux gives warning for and are out of your control.

So flux cannot work with Github Registry at the moment?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alejandrox1 picture alejandrox1  路  3Comments

anwarchk picture anwarchk  路  4Comments

errordeveloper picture errordeveloper  路  4Comments

IsNull picture IsNull  路  4Comments

kuburoman picture kuburoman  路  3Comments