Steps to reproduce the issue:
docker context inspectDescribe the results you received:
It consistently takes more than 1 second to run this command, often more than 2 seconds.
Describe the results you expected:
Should take 100 ms or less. The default context, pointing to Docker Desktop engine, is the only context on the machine and it should be really straightforward to "inspect" it.
Additional information you deem important (e.g. issue happens only occasionally):
I took a trace (available upon request) and noticed that the majority of time is spent waiting on results from NetUserGetInfo Windows API. On my machine this API is called for a default Docker Desktop context inspection, which has no user information. Ultimately this accounts for more than 2 seconds of latency (for that run of the command)
May be related to https://github.com/docker/cli/issues/1739
Output of docker version:
Client: Docker Engine - Community
Version: 19.03.8
API version: 1.40
Go version: go1.12.17
Git commit: afacb8b
Built: Wed Mar 11 01:23:10 2020
OS/Arch: windows/amd64
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 19.03.8
API version: 1.40 (minimum version 1.12)
Go version: go1.12.17
Git commit: afacb8b
Built: Wed Mar 11 01:29:16 2020
OS/Arch: linux/amd64
Experimental: true
containerd:
Version: v1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
Output of docker info:
Client:
Debug Mode: false
Plugins:
app: Docker Application (Docker Inc., v0.8.0)
buildx: Build with BuildKit (Docker Inc., v0.3.1-tp-docker)
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 25
Server Version: 19.03.8
Storage Driver: overlay2
Backing Filesystem: <unknown>
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.19.76-linuxkit
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.846GiB
Name: docker-desktop
ID: KLND:UUTX:OIZ2:IUWS:KHMX:VVTK:SBMI:GHS3:VCM4:WPUY:4TJM:SXMY
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 33
Goroutines: 49
System Time: 2020-04-06T15:18:37.131345023Z
EventsListeners: 3
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
Additional environment details (AWS, VirtualBox, physical, etc.):
It may not be limited to just docker context inspect calls. On my system, even doing something as simple as docker help takes 7-8 seconds.
Yep, I have confirmed that docker help also somehow ends up calling NetUserGetInfo... 馃槙
For some context, my machine is on VPN, a good 3000 miles away from the corporate network. So, any network calls that cross that gap are gonna end up being pretty slow--I'm sure that my observed 7-8 seconds is well above average.
I looked up the global averages in our telemetry, docker context inspect takes roughly 600 ms on Mac/Linux and roughly 1300ms on Windows. More here: https://github.com/microsoft/vscode-docker/issues/1804#issuecomment-607358691
@silvin-lubecki @gtardif looks like the issue is still not completely fixed. I have built the binary from the master branch and am observing a lot of variance between the runs. The first run of the context inspect command is almost always is very slow. The following commands were executed from a freshly open Windows command window:
C:\code>\tools\timecmd.bat \scratch\docker-windows-amd64.exe context inspect
[
{
"Name": "default",
"Metadata": {
"StackOrchestrator": "swarm"
},
"Endpoints": {
"docker": {
"Host": "npipe:////./pipe/docker_engine",
"SkipTLSVerify": false
}
},
"TLSMaterial": {},
"Storage": {
"MetadataPath": "\u003cIN MEMORY\u003e",
"TLSPath": "\u003cIN MEMORY\u003e"
}
}
]
command took 0:0:2.41 (2.41s total)
C:\code>\tools\timecmd.bat \scratch\docker-windows-amd64.exe context inspect
[
{
"Name": "default",
"Metadata": {
"StackOrchestrator": "swarm"
},
"Endpoints": {
"docker": {
"Host": "npipe:////./pipe/docker_engine",
"SkipTLSVerify": false
}
},
"TLSMaterial": {},
"Storage": {
"MetadataPath": "\u003cIN MEMORY\u003e",
"TLSPath": "\u003cIN MEMORY\u003e"
}
}
]
command took 0:0:0.10 (0.10s total)
It is enough to wait about 30 seconds between docker context inspect invocations to observe the slow execution again. I bet the Windows user info is still being retrieved over the network, but it is cached for a short while by Windows, that is why quick successful invocations are fast.
The docker.exe binary that comes with Docker Desktop for Windows 2.2.0.5 behaves the same way BTW.
@karolz-ms curious; where do those MetadataPaths (<IN MEMORY>) come from?
FWIW, with https://github.com/docker/cli/pull/2424 applied, I tried reproducing in a Linux container (I was curious if the difference was between docker context ls, which I tried before, and docker context inspect), on Linux I didn't reproduce the problem though;
docker context create faulty --docker host=tcp://42.42.42.41:4242
docker context create local --docker host=unix:///var/run/docker.sock
docker context use faulty
time docker context ls
NAME DESCRIPTION DOCKER ENDPOINT KUBERNETES ENDPOINT ORCHESTRATOR
default Current DOCKER_HOST based configuration unix:///var/run/docker.sock swarm
faulty * tcp://42.42.42.41:4242
local unix:///var/run/docker.sock
real 0m 0.33s
user 0m 0.04s
sys 0m 0.04s
time docker context inspect local
[
{
"Name": "local",
"Metadata": {},
"Endpoints": {
"docker": {
"Host": "unix:///var/run/docker.sock",
"SkipTLSVerify": false
}
},
"TLSMaterial": {},
"Storage": {
"MetadataPath": "/root/.docker/contexts/meta/25bf8e1a2393f1108d37029b3df5593236c755742ec93465bbafa9b290bddcf6",
"TLSPath": "/root/.docker/contexts/tls/25bf8e1a2393f1108d37029b3df5593236c755742ec93465bbafa9b290bddcf6"
}
}
]
real 0m 0.36s
user 0m 0.05s
sys 0m 0.03s
time docker context inspect faulty
[
{
"Name": "faulty",
"Metadata": {},
"Endpoints": {
"docker": {
"Host": "tcp://42.42.42.41:4242",
"SkipTLSVerify": false
}
},
"TLSMaterial": {},
"Storage": {
"MetadataPath": "/root/.docker/contexts/meta/9ec1600e60d2aba399d84db8163ec37fc67cd60096cb32a66f0a52280797ee95",
"TLSPath": "/root/.docker/contexts/tls/9ec1600e60d2aba399d84db8163ec37fc67cd60096cb32a66f0a52280797ee95"
}
}
]
real 0m 0.34s
user 0m 0.03s
sys 0m 0.05s
Oh, never mind; looks like the <IN MEMORY> is a default that's set here; https://github.com/docker/cli/blob/0f337f1dfe574eb12eab8bb102a24f714cc79d86/cli/command/defaultcontextstore.go#L212
Same result when using / inspecting the default
docker context use default
time docker context inspect
[
{
"Name": "default",
"Metadata": {
"StackOrchestrator": "swarm"
},
"Endpoints": {
"docker": {
"Host": "unix:///var/run/docker.sock",
"SkipTLSVerify": false
}
},
"TLSMaterial": {},
"Storage": {
"MetadataPath": "\u003cIN MEMORY\u003e",
"TLSPath": "\u003cIN MEMORY\u003e"
}
}
]
real 0m 0.29s
user 0m 0.04s
sys 0m 0.05s
I can still consistently get long times on Windows with the current build. docker context inspect took an average of 3028 ms over 100 runs; docker info took an average of 7297 ms over 50 runs.
@thaJeztah @gtardif any news on this issue?
You know, we are a Windows shop here 鈽猴笍 so let us know if we can help in any way; run some experiments or whatever.
The NetGetUserInfo call probably comes for getting the HomeDir of the user... we recently changed the way we get the homedir to the golang base library implementation.
@thaJeztah can we try to revert this PR in some branch to see if it fixes the issue ?
@simonferquel we went through a couple of changes there; https://github.com/docker/cli/pull/2101, which was later (partially?) reverted in (https://github.com/docker/cli/pull/2111); could you check which is the problematic one?
I also have https://github.com/docker/cli/pull/2161 pending review, which should at least not call it by default
Hmm, maybe we should ask for Microsoft guys here what is the correct (and most efficient) way to get the home dir on Windows using win32 APIs and plug homedir.Get onto that for Windows ?
The most reliable way is to call SHGetKnownFolderPath with FOLDERID_Profile. USERPROFILE environment variable should point to the same thing, but might be set to something else by a user looking for trouble.
Somebody overwriting that environment variable is asking for so much trouble that I think it's an acceptable risk to use it. :smile:
So config.go seems to be using _both_ "our" (github.com/docker/docker/pkg/homedir) and Golang's os.UserHomeDir();
https://github.com/docker/cli/blob/d05846d2cb1917d06bd72b551b5a37552cee53e9/cli/config/config.go#L30-L32
The github.com/docker/docker/pkg/homedir package for windows looks to do just that (get USERPROFILE env var, and nothing else); https://github.com/docker/cli/blob/e7f720b0a6cbc4611194426f91e2392962faacc3/vendor/github.com/docker/docker/pkg/homedir/homedir_windows.go#L7-L18
os.UserHomeDir() also seems to do the same 馃 https://github.com/golang/go/blob/go1.13.10/src/os/file.go#L462-L472
func UserHomeDir() (string, error) {
env, enverr := "HOME", "$HOME"
switch runtime.GOOS {
case "windows":
env, enverr = "USERPROFILE", "%userprofile%"
case "plan9":
env, enverr = "home", "$home"
}
if v := Getenv(env); v != "" {
return v, nil
}
Hmm so do we have tooling to see the go stacktrace calling the problematic win32 API?
I will try to find out and report back here
@thaJeztah @simonferquel I was able to find the issue: it is in the klog logging package that you guys are vendoring. This package calls os/user.Current() upon initialization, which on Windows does all sorts of domain lookups.
These lookups causes problems on some versions of Windows OS, but much more importantly (and this is what this bug is about), this way of obtaining current user information for logging purposes is just too damn slow for a CLI utility, and probably not really necessary either.
Hope this helps!
@gtardif FYI
How about pulling this fix?
Great find; It's vendored as a dependency of the docker stack deploy to kubernetes, so we matched the version that's used with the version of kubernetes we vendor; let me check if vendoring a more current version would be problematic
I see the fix was back ported to v1.x; https://github.com/kubernetes/klog/pull/124, but is not yet in a release; https://github.com/kubernetes/klog/compare/v1.0.0...release-1.x
I may have one more suggestion for that fix; think it would make sense to lazily look up the username, given that on Linux looking up the user may also make network connections for external users (if cgo is enabled and the osusergo build-tag is not set); see https://github.com/golang/go/issues/23265 and https://github.com/golang/go/blob/go1.13.10/src/os/user/user.go#L8-L15
@dims any chance on tagging a v1.0.1 release?
Actually perhaps it would be good to have a change in Go itself, and add support for the osusergo build-tag for Windows, similar to https://github.com/golang/go/commit/62f0127d8104d8266d9a3fb5a87e2f09ec8b6f5b implemented for Unix/Linux
@karolz-ms perhaps there's interest from Microsoft's side to contribute such a change to Golang? (looking at the PR you linked, it looks like user.Current() is not supported on Nano Server, so possibly other Golang projects on Windows hit the same issue
opened https://github.com/kubernetes/klog/pull/143 to perform userName lookup lazily instead of on init()
opened https://github.com/docker/cli/pull/2515 and https://github.com/docker/cli/pull/2516 to (temporarily) vendor tip of release-1.x from k8s.io/klog
@thaJeztah Ack on 143, let me try to get some reviews on that first.
@thaJeztah thanks for quick reply, much appreciated
Actually perhaps it would be good to have a change in Go itself, and add support for the osusergo build-tag for Windows, similar to golang/go@62f0127 implemented for Unix/Linux
@karolz-ms perhaps there's interest from Microsoft's side to contribute such a change to Golang? (looking at the PR you linked, it looks like user.Current() is not supported on Nano Server, so possibly other Golang projects on Windows hit the same issue
This sounds like a good idea, although such work would be outside of my area of expertise. I will ask around, but do you happen to know about any Microsoft people who did contribute to Golang runtime in the past?
do you happen to know about any Microsoft people who did contribute to Golang runtime in the past?
Hmmm, no, not from the top of my head. I made a contribution myself once, but their flow is not a regular "github" flow, so I need to re-read the instructions again 馃槄. In this case, it'd probably be good to look if a reasonable "fallback" exists (that's generic enough), which probably should be done by someone with a better understanding of the Windows mechanisms than I have
This should be fixed through https://github.com/docker/cli/pull/2516
There are definitely substantial improvements in the latest build from master in my testing. I know that my results before were atypically slow so take the % improvements with a grain of salt.
docker context inspect averaged 111 ms over 50 calls (down 96% from 3028 ms before).docker context ls averaged 102 ms over 50 calls (down 95% from 2210 ms before).docker -v averaged 124 ms over 50 calls (down 95% from 2255 ms before).docker info is still a little sluggish at 6525 ms over 50 calls (down 11% from 7297 ms before). However, the VSCode Docker extension only calls this in very rare circumstances, so this isn't much of a concern for us.Globally, the average for docker context inspect on Windows looked to be around 1300 ms, so this marks a 91% drop from that. That's amazing!
Thanks everyone! This is a huge help for us!
Thanks @bwateratmsft!
docker infois still a little sluggish at 6525 ms over 50 calls (down 11% from 7297 ms before)
The /info endpoint has become a bit of a "kitchen-sink" of any information related to the daemon. While it contains a lot of useful information, unfortunately it also means the endpoint itself has become way more heavy than it should be (especially if you're only looking for a subset of the info)
We're aware of this, but (admittedly) haven't come round to design a better approach (I was thinking of having an optional query parameter to limit the returned information to specific parts, but ideally the struct that's returned would also be more structured).
Is there specific information that VSCode uses from the info output? (Perhaps there's alternatives).
Globally, the average for docker context inspect on Windows looked to be around 1300 ms, so this marks a 91% drop from that. That's amazing!
Thanks! Erm, so it turns out there's a regression (https://github.com/docker/cli/issues/2533) related to the first change (https://github.com/docker/cli/pull/2424), I _think_ that fix was not the "main contributor" to the docker context inspect issue on Windows, but it may mean that for 19.03 we may have to do a partial revert (but continue looking at re-adding the initial version negotiation for the next release)
That makes sense. I looked at where we call info, turns out we use Dockerode for that, not the CLI. There's three places, all user-initiated:
info endpoint is definitely not the choke point, since we have to copy vsdbg into the container.Only the first of those three has any significant usage. We're querying the Docker server OSType, indirectly to find out the container OS, which we use to decide whether we should launch cmd or bash.
Curiously, Dockerode takes only ~10-20 ms to call that endpoint. Any ideas on the discrepancy?
Most helpful comment
@thaJeztah @simonferquel I was able to find the issue: it is in the
kloglogging package that you guys are vendoring. This package callsos/user.Current()upon initialization, which on Windows does all sorts of domain lookups.These lookups causes problems on some versions of Windows OS, but much more importantly (and this is what this bug is about), this way of obtaining current user information for logging purposes is just too damn slow for a CLI utility, and probably not really necessary either.
Hope this helps!
@gtardif FYI