Go: cmd/go: allow proxies to supply only some modules

Created on 11 Jul 2018  路  35Comments  路  Source: golang/go

This is a proposal for extending the vgo download API to add a mechanism to allow proxies to redirect vgo to a VCS. This functionality would be useful if a proxy has only a subset of packages in a go.mod file.

Currently, if someone adds two modules "a" v1.0.0 and "b" v1.0.0 to their go.mod file and they then run GOPROXY=myprox.com vgo install, everything works as expected if the proxy has both modules at the given versions. If not, the command fails.

It would be helpful to allow the proxy to tell vgo to fetch one or both of the modules from the VCS if it doesn't have them in its cache. This would be useful for proxy implementations where the proxy will not/cannot cache the module in its own storage or can but doesn't have that module/version in its cache. The Athens project is a present day use case for the latter - it fills its caches asynchronously.

One implementation possibility is adding a $GOPROXY/a/@v/v1.0.0?go-get=1 network request that expects the same output as already-existing ?go-get=1 requests define. This request could be made before starting the download protocol as it currently exists. The new mechanism would allow the proxy to choose to do one of the following for the given module and version identifier:

  1. Send vgo to the standard download API via the meta tag
  2. Send vgo to the VCS via the meta tag
  3. Return a 404
FrozenDueToAge NeedsInvestigation early-in-cycle modules

Most helpful comment

I don't think it makes sense for a proxy to tell the go command "go to this VCS instead". We're trying to migrate to proxy by default and while VCS will probably always be with us, I'd rather not mix the two.

I do think it would probably be OK to let GOPROXY be preference list and to also allow some setting like GOPROXY=direct as an explicit name for what the default behavior is. So you could say GOPROXY=https://myproxy/,direct and just let myproxy return a 404 for the things it doesn't know about. Then the proxy isn't in charge of the actual redirect; it's only in charge of "it's not me".

All 35 comments

Update after slack discussion with @myitcv and @zeebo:

  1. we could certainly do cache fills synchronously
  2. there could be some "gotchas" with hosting the proxy, like timeouts when doing a cache fill for a large module (i.e. github.com/kubernetes/kubernetes)
  3. assuming the proxy synchronously fills caches, the user experience would be about the same as if vgo get downloaded from VCS
  4. if the proxy has a partial failure and cannot serve code, GOPROXY=myprox.com vgo get will always fail with the current behavior

Another concern that was brought up is that with GOPROXY set, the current behavior of GOPROXY=myprox.com vgo get github.com/kubernetes/kubernetes doesn't let the proxy redirect vgo to another location (i.e. a CDN) it still has to serve metadata and the zipped source code itself.

The proxy can redirect vgo to another URL as long as that URL implements the download protocol. That's because the net/http client handles redirect status codes internally (up to 10 times by default)

The big question here is: can the GOPROXY tell Vgo (during a build) to go use a VCS source instead of the proxy itself? Which is a little different than a simple redirect to another Download Protocol enabled URL.

@rsc How do you envision the flow of the Proxy telling Vgo to switch to VCS?

I'm trying to follow the vgo code and it seems that on an initial build, the initial contact with the proxy's download protocol can be any of these endpoints: /@latest or /list or /@v/{certainRevision}.info.

This means that the proxy would need to return a consistent code (maybe a 404) on each of these endpoints to signal vgo to reconstruct the modfetch.Repo interface from a vcs source instead of the *proxyRepo one. Vgo would then have to back up a few steps and retry again.

Vgo can potentially always hit /list as the first entry point to the download protocol, and if the list is empty then switch to vcs. Or it can always hit @latest first and if it returns 404 then switch to vcs.

Another solution, is that the Download Protocol could implement a @probe call with a couple of parameters (module path and revision) and then the Proxy can early on tell vgo to just go for the VCS source for this particular module.

For efficiency, vgo can potentially send the entire list of modules it wants to probe to the proxy.

I haven't fully understood the last 2 days worth of changes to vgo so apologies if I'm a bit off.

I don't think it makes sense for a proxy to tell the go command "go to this VCS instead". We're trying to migrate to proxy by default and while VCS will probably always be with us, I'd rather not mix the two.

I do think it would probably be OK to let GOPROXY be preference list and to also allow some setting like GOPROXY=direct as an explicit name for what the default behavior is. So you could say GOPROXY=https://myproxy/,direct and just let myproxy return a 404 for the things it doesn't know about. Then the proxy isn't in charge of the actual redirect; it's only in charge of "it's not me".

@rsc that's about what I had in mind. The proxy shouldn't say "go to this vcs", it can just say "I don't have this module"

Does that mean the /@probe endpoint makes sense? Since that means Go can just ask the proxy whether it can work with a specific module before it even asks for @latest or /list.

I've altered the Go code if you'd like to look at a reference of what I'm suggesting: https://github.com/marwan-at-work/go/commit/3767be88ba3740d85853a28c2b1715f365d3b3dd

@marwan-at-work I'd rather not have newProxyRepo make any network calls. It turns out to be important to delay those as long as possible. I'd rather have the existing GET paths return 404s and then have the methods be able to return some kind of recognizable "not found error" (maybe satisfying os.IsNotExist is enough) and then something at a higher level will try the next repo method down the list. I think we should wait until Go 1.12 regardless.

@rsc That makes sense since cmd/go checks the cache before making network calls. I'm happy to take on this task if you'd like me to as I'll try to make it work for Athens in the near future.

Either way, I'm happy to know the Proxy can dynamically delegate modules fetching back to Go.

Thanks!

@rsc I have another pass at making this work. This time, we won't hit the network until necessary, and we won't need a @probe endpoint either. I'm hoping to see if this change is not too invasive for 1.11

The idea is that a *proxyRepo can take an alternative Repo interface that it can switch to in case of 404 (or other future codes). Similar to how *cachingRepo works.

Feel free to take a look if you get the chance https://github.com/marwan-at-work/go/commit/5117c8c267e58db3ef1b8cc3531f2fffebe2e9c3

I see other ways of doing this, such as having a top level Repo interface that accepts a slice of Repos and just tries one at a time in order: (cache, proxy, vcs, etc)

So my solution above of course may still be not what how you'd like to solve this problem but would love to hear your thoughts

Thanks :)

Since this issue discusses a change to GOPROXY to allow the list of proxies or direct method -

A slightly different case I am thinking of is when the proxy server I use by default is temporarily unreachable or unavailable (possibly in the middle of fetching all dependencies) and even we can't get
404s. Reruning go get again with GOPROXY=direct upon network failure is an option when noticing this failure, but I would be happier if I can specify a set of proxy servers or even 'direct' fetch option for fallback.

But @bcmills raised a concern about leaking private package paths in the event of the private proxy failure if we just fall back to the next (direct) blindly.

@hyangah I'm concerned with blindly falling back to git for two reasons:

  1. The proxy won't be able to block certain modules from being downloaded (maybe a company doesn't want their employees to use a certain open source package)
  2. If a proxy is down, maybe a build shouldn't happen at all so that the user is aware of what's happening. And of course, the user can switch to turning off the GOPROXY environment explicitly.

@marwan-at-work code freeze date is today. do you plan to mail in the change for review as described in the contribution guideline? https://golang.org/doc/contribute.html#sending_a_change_github

Change https://golang.org/cl/147177 mentions this issue: cmd/go: fallback to VCS if GOPROXY 404s

https://golang.org/cl/147177 was mailed before the freeze, so I think this will make the cut.

Change https://golang.org/cl/148377 mentions this issue: cmd/go: allow comma separated GOPROXY URLs.

This didn't make 1.12 after all: needs a bit more design work to indicate (and implement) when it's ok to fall back to the origin vs. failing outright.

@hyangah notes an interesting interaction: when we are resolving the module for a given package, the go command today starts by querying for a module at the full package path, then progressively shorter prefixes. We ideally want to do those queries in parallel.

That means that the search space has (at least) two dimensions: one across proxies, and another across paths. Probably we should exhaust all of the paths for the first proxy before we try the next one in the list, and only fall back if it returned a module or a 404 code for each path. That way, the first proxy has the opportunity to reject paths (e.g. due to licensing or vetting policy) before the go command attempts to fetch from a public mirror or the origin server.

Speaking with @hyangah yesterday, I'm concerned about the approach discussed here.

The scenario we discussed is offline yesterday was when a company wants to provide a proxy that serves private modules.

In this scenario the company should run an internal (intermediary) proxy that contains the information about the private modules. This proxy should be configured to have an upstream proxy that it uses for public requests.

If the company wants to have high availability it can run multiple internal proxies.

The client should only be configured to use the internal proxy(ies).

The company can also provide a whitelist / blacklist for the internal proxy for what upstream packages are allowed/disallowed.

@spf13

  1. The use case for an internal company multi-proxy setup is to be able to switch not from one proxy to another, but from one Proxy to VCS. Where a Proxy is not trusted to have credentials but the machine where the Go command is run has VCS credentials. For example, when the Go Module Index comes out but the company is not yet ready to roll their own internal proxy for private modules yet, they should be able to do GOPROXY=moduleIndex,direct go build and trust that both public and private modules are provided.

  2. On another note, I'm also wondering how having multiple mirrors such as the Module Index (and potentially other companies) will play out? If the user can't do GOPROXY=mirror1,mirror2 go build, how would Go be able to get a module from one of many mirrors? Will every user need to have their own proxy implementation that fans out to different proxies? It sees much easier to just do the comma-separated command from the client side.

  3. With the Module Index being built to be the default public mirror, I imagine all other proxy implementations must be aware of it? Meaning, if a module does not exist, we need to redirect the user to the public proxy as a worst case scenario.

Thanks!

@marwan-at-work

  1. It will leak the company's private modules, packages and repo paths to the public Go Module Index, so it's best to avoid. I think it's better for the Go command itself to be configured to route the requests to the right proxies if an internal proxy is hard.

  2. This is a reasonable use case I can think of, but in this case, I think the primary purpose is the redundancy. In this case, 404 HTTP error code-based chaining doesn't seem right. (What if mirror1 is down and can't respond with 404?)

@hyangah

  1. The only thing will leak is the module path and nothing else, not sure how bad this is but I can understand that it should be avoided.

  2. My thinking was that if your "first" proxy is down, it's best to stop the build. If your thinking is that we can have multiple proxies for redundancy reasons but not care about what the returned code is, then this is potentially bad because then a proxy will not be able to block a build for security reasons: for example if the first proxy did not allow anyone to download "github.com/malicious/package", it would return a 400 bad request, but then Go will just move on to the next proxy which might not have the same security rule.

I am a little confused about all this discussion. I thought we were going to do, for GOPROXY=proxy1,proxy2,proxy3:

  • Try proxy1. If proxy1 says 200, we're done. If proxy1 says anything but 404, we're also done (error). Otherwise, proxy1 said 404, so continue down the list.
  • Try proxy2. Anything but 404? Done.
  • Try proxy3.

It's an ordered list, not a parallel lookup. By saying GOPROXY=proxy1,proxy2,proxy3 you are _directing_ the go command to send every import path to proxy1. If you should be splitting half your traffic to proxy1 and half to proxy2 and can't send the proxy2 paths to proxy1, then yes, you need a new proxy0 to split the traffic. But that is (1) fine and (2) not the envisioned use case.

The envisioned use case is some company has their own modules on an internal static file server that can pretend to be a Go proxy (because we made static file servers able to do that), and people use GOPROXY=,proxy1. Or another use case is people preferring GoCenter, but that's an incomplete module mirror, so it needs to be backed by a fallback, like "direct" or a more complete mirror. Again, if you care about not leaking paths for proxy2 to proxy1, you wouldn't do this. But there are other cases where you would.

Does anyone object to implementing the above semantics for GOPROXY=proxy1,proxy2,proxy3? If so, please explain why. Thanks.

@rsc I'm not sure the above discussions were about splitting traffic or concurrently pinging the "proxy list". AFAIK, it was about whether we want to have GOPROXY be able to provide multiple URLs or provide only one highly available URL that takes care of proxying to other proxies if it needs to.

I'm happy either way, but the CL from above does what you suggested

What would be the best work around at the moment?
I have few private modules defined in go.mod and those are in my GOPROXY.
When I try go mod download, it fails for modules that are not in GOPROXY.
I can't make sure all the modules and version are in GOPROXY all the times.

@RohitRox one work around is to have that proxy go mod download anything that's not existent and serve it back to the client. This way, it can provide all the modules and wouldn't 404.

@marwan-at-work That will be a chore for developers :|

I've found this to be a big problem today when trying to setup a GOPROXY at my company.

There are many instances where there may be a mix of public and private dependencies. The problem we have is that not all of the repos on our internal GitHub instance can be made publicly accessible. If a project has any dependency that is "private", we cannot use GOPROXY at all.

Tools like Athens and JFrog Artifactory can be used to store private Go modules and in addition, GoCenter can be used to fetch public Go modules.

@jorng, for Go 1.13 we expect to add a GONOPROXY environment variable that will let you set GOPROXY to a public proxy but avoid the proxy for modules matching a given pattern.

@rsc: That will be very helpful, at least to work around the issue.

I think the GOAUTH stuff may be the best option, once implemented. I鈥檓 imagining setting up a custom proxy that can handle authentication (perhaps using our internal SSO) and gate access appropriately.

Change https://golang.org/cl/173441 mentions this issue: cmd/go: add support for GOPROXY list

I've found this to be a big problem today when trying to setup a GOPROXY at my company.

There are many instances where there may be a mix of public and private dependencies. The problem we have is that not all of the repos on our internal GitHub instance can be made publicly accessible. If a project has any dependency that is "private", we cannot use GOPROXY at all.

@jorng Do you have any help with gos? https://github.com/storyicon/gos

Change https://golang.org/cl/183845 mentions this issue: cmd/go/internal/modfetch: halt proxy fallback if the proxy returns a non-404/410 response for @latest

Any chance https://golang.org/cl/173441 can be backported into 1.12.x?

@mikecook The change to the GOPROXY behavior is too significant for a minor release. Minor releases are meant only for security fixes, serious issues with no workarounds, and documentation fixes. See https://golang.org/wiki/MinorReleases for more information.

You can get the new behavior by updating to Go 1.13.

No: there were a lot of interrelated changes in the fetch paths.

Besides, we don't generally backport features (only critical bug-fixes, which this was not).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

OneOfOne picture OneOfOne  路  3Comments

mingrammer picture mingrammer  路  3Comments

jayhuang75 picture jayhuang75  路  3Comments

enoodle picture enoodle  路  3Comments

gopherbot picture gopherbot  路  3Comments