Cli: Docker Desktop on Windows: 'docker context inspect' is very slow

Created on 6 Apr 2020  路  30Comments  路  Source: docker/cli

Steps to reproduce the issue:

  1. Install Docker Desktop on Windows (tested on Windows 10 Enterpirse version 1909 build 18363.720, Docker Desktop 2.2.0.4)
  2. Start Docker Desktop.
  3. Open command prompt and do docker context inspect

Describe the results you received:
It consistently takes more than 1 second to run this command, often more than 2 seconds.

Describe the results you expected:
Should take 100 ms or less. The default context, pointing to Docker Desktop engine, is the only context on the machine and it should be really straightforward to "inspect" it.

Additional information you deem important (e.g. issue happens only occasionally):

I took a trace (available upon request) and noticed that the majority of time is spent waiting on results from NetUserGetInfo Windows API. On my machine this API is called for a default Docker Desktop context inspection, which has no user information. Ultimately this accounts for more than 2 seconds of latency (for that run of the command)

May be related to https://github.com/docker/cli/issues/1739

Output of docker version:

Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b
 Built:             Wed Mar 11 01:23:10 2020
 OS/Arch:           windows/amd64
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b
  Built:            Wed Mar 11 01:29:16 2020
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          v1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Debug Mode: false
 Plugins:
  app: Docker Application (Docker Inc., v0.8.0)
  buildx: Build with BuildKit (Docker Inc., v0.3.1-tp-docker)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 25
 Server Version: 19.03.8
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.19.76-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.846GiB
 Name: docker-desktop
 ID: KLND:UUTX:OIZ2:IUWS:KHMX:VVTK:SBMI:GHS3:VCM4:WPUY:4TJM:SXMY
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 33
  Goroutines: 49
  System Time: 2020-04-06T15:18:37.131345023Z
  EventsListeners: 3
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

Additional environment details (AWS, VirtualBox, physical, etc.):

arecontext arekubernetes platforwindows

Most helpful comment

@thaJeztah @simonferquel I was able to find the issue: it is in the klog logging package that you guys are vendoring. This package calls os/user.Current() upon initialization, which on Windows does all sorts of domain lookups.

These lookups causes problems on some versions of Windows OS, but much more importantly (and this is what this bug is about), this way of obtaining current user information for logging purposes is just too damn slow for a CLI utility, and probably not really necessary either.

Hope this helps!
@gtardif FYI

All 30 comments

It may not be limited to just docker context inspect calls. On my system, even doing something as simple as docker help takes 7-8 seconds.

Yep, I have confirmed that docker help also somehow ends up calling NetUserGetInfo... 馃槙

For some context, my machine is on VPN, a good 3000 miles away from the corporate network. So, any network calls that cross that gap are gonna end up being pretty slow--I'm sure that my observed 7-8 seconds is well above average.

I looked up the global averages in our telemetry, docker context inspect takes roughly 600 ms on Mac/Linux and roughly 1300ms on Windows. More here: https://github.com/microsoft/vscode-docker/issues/1804#issuecomment-607358691

@silvin-lubecki @gtardif looks like the issue is still not completely fixed. I have built the binary from the master branch and am observing a lot of variance between the runs. The first run of the context inspect command is almost always is very slow. The following commands were executed from a freshly open Windows command window:

C:\code>\tools\timecmd.bat \scratch\docker-windows-amd64.exe context inspect
[
    {
        "Name": "default",
        "Metadata": {
            "StackOrchestrator": "swarm"
        },
        "Endpoints": {
            "docker": {
                "Host": "npipe:////./pipe/docker_engine",
                "SkipTLSVerify": false
            }
        },
        "TLSMaterial": {},
        "Storage": {
            "MetadataPath": "\u003cIN MEMORY\u003e",
            "TLSPath": "\u003cIN MEMORY\u003e"
        }
    }
]
command took 0:0:2.41 (2.41s total)
C:\code>\tools\timecmd.bat \scratch\docker-windows-amd64.exe context inspect
[
    {
        "Name": "default",
        "Metadata": {
            "StackOrchestrator": "swarm"
        },
        "Endpoints": {
            "docker": {
                "Host": "npipe:////./pipe/docker_engine",
                "SkipTLSVerify": false
            }
        },
        "TLSMaterial": {},
        "Storage": {
            "MetadataPath": "\u003cIN MEMORY\u003e",
            "TLSPath": "\u003cIN MEMORY\u003e"
        }
    }
]
command took 0:0:0.10 (0.10s total)

It is enough to wait about 30 seconds between docker context inspect invocations to observe the slow execution again. I bet the Windows user info is still being retrieved over the network, but it is cached for a short while by Windows, that is why quick successful invocations are fast.

The docker.exe binary that comes with Docker Desktop for Windows 2.2.0.5 behaves the same way BTW.

@karolz-ms curious; where do those MetadataPaths (<IN MEMORY>) come from?

FWIW, with https://github.com/docker/cli/pull/2424 applied, I tried reproducing in a Linux container (I was curious if the difference was between docker context ls, which I tried before, and docker context inspect), on Linux I didn't reproduce the problem though;

docker context create faulty --docker host=tcp://42.42.42.41:4242
docker context create local --docker host=unix:///var/run/docker.sock
docker context use faulty

time docker context ls
NAME                DESCRIPTION                               DOCKER ENDPOINT               KUBERNETES ENDPOINT   ORCHESTRATOR
default             Current DOCKER_HOST based configuration   unix:///var/run/docker.sock                         swarm
faulty *                                                      tcp://42.42.42.41:4242
local                                                         unix:///var/run/docker.sock
real    0m 0.33s
user    0m 0.04s
sys 0m 0.04s


time docker context inspect local
[
    {
        "Name": "local",
        "Metadata": {},
        "Endpoints": {
            "docker": {
                "Host": "unix:///var/run/docker.sock",
                "SkipTLSVerify": false
            }
        },
        "TLSMaterial": {},
        "Storage": {
            "MetadataPath": "/root/.docker/contexts/meta/25bf8e1a2393f1108d37029b3df5593236c755742ec93465bbafa9b290bddcf6",
            "TLSPath": "/root/.docker/contexts/tls/25bf8e1a2393f1108d37029b3df5593236c755742ec93465bbafa9b290bddcf6"
        }
    }
]
real    0m 0.36s
user    0m 0.05s
sys 0m 0.03s

time docker context inspect faulty
[
    {
        "Name": "faulty",
        "Metadata": {},
        "Endpoints": {
            "docker": {
                "Host": "tcp://42.42.42.41:4242",
                "SkipTLSVerify": false
            }
        },
        "TLSMaterial": {},
        "Storage": {
            "MetadataPath": "/root/.docker/contexts/meta/9ec1600e60d2aba399d84db8163ec37fc67cd60096cb32a66f0a52280797ee95",
            "TLSPath": "/root/.docker/contexts/tls/9ec1600e60d2aba399d84db8163ec37fc67cd60096cb32a66f0a52280797ee95"
        }
    }
]
real    0m 0.34s
user    0m 0.03s
sys 0m 0.05s

Oh, never mind; looks like the <IN MEMORY> is a default that's set here; https://github.com/docker/cli/blob/0f337f1dfe574eb12eab8bb102a24f714cc79d86/cli/command/defaultcontextstore.go#L212

Same result when using / inspecting the default

docker context use default
time docker context inspect
[
    {
        "Name": "default",
        "Metadata": {
            "StackOrchestrator": "swarm"
        },
        "Endpoints": {
            "docker": {
                "Host": "unix:///var/run/docker.sock",
                "SkipTLSVerify": false
            }
        },
        "TLSMaterial": {},
        "Storage": {
            "MetadataPath": "\u003cIN MEMORY\u003e",
            "TLSPath": "\u003cIN MEMORY\u003e"
        }
    }
]
real    0m 0.29s
user    0m 0.04s
sys 0m 0.05s

I can still consistently get long times on Windows with the current build. docker context inspect took an average of 3028 ms over 100 runs; docker info took an average of 7297 ms over 50 runs.

@thaJeztah @gtardif any news on this issue?

You know, we are a Windows shop here 鈽猴笍 so let us know if we can help in any way; run some experiments or whatever.

The NetGetUserInfo call probably comes for getting the HomeDir of the user... we recently changed the way we get the homedir to the golang base library implementation.
@thaJeztah can we try to revert this PR in some branch to see if it fixes the issue ?

@simonferquel we went through a couple of changes there; https://github.com/docker/cli/pull/2101, which was later (partially?) reverted in (https://github.com/docker/cli/pull/2111); could you check which is the problematic one?

I also have https://github.com/docker/cli/pull/2161 pending review, which should at least not call it by default

Hmm, maybe we should ask for Microsoft guys here what is the correct (and most efficient) way to get the home dir on Windows using win32 APIs and plug homedir.Get onto that for Windows ?

The most reliable way is to call SHGetKnownFolderPath with FOLDERID_Profile. USERPROFILE environment variable should point to the same thing, but might be set to something else by a user looking for trouble.

Somebody overwriting that environment variable is asking for so much trouble that I think it's an acceptable risk to use it. :smile:

So config.go seems to be using _both_ "our" (github.com/docker/docker/pkg/homedir) and Golang's os.UserHomeDir();

https://github.com/docker/cli/blob/d05846d2cb1917d06bd72b551b5a37552cee53e9/cli/config/config.go#L30-L32

https://github.com/docker/cli/blob/d05846d2cb1917d06bd72b551b5a37552cee53e9/cli/config/config.go#L108-L112

The github.com/docker/docker/pkg/homedir package for windows looks to do just that (get USERPROFILE env var, and nothing else); https://github.com/docker/cli/blob/e7f720b0a6cbc4611194426f91e2392962faacc3/vendor/github.com/docker/docker/pkg/homedir/homedir_windows.go#L7-L18

os.UserHomeDir() also seems to do the same 馃 https://github.com/golang/go/blob/go1.13.10/src/os/file.go#L462-L472

func UserHomeDir() (string, error) {
    env, enverr := "HOME", "$HOME"
    switch runtime.GOOS {
    case "windows":
        env, enverr = "USERPROFILE", "%userprofile%"
    case "plan9":
        env, enverr = "home", "$home"
    }
    if v := Getenv(env); v != "" {
        return v, nil
    }

Hmm so do we have tooling to see the go stacktrace calling the problematic win32 API?

I will try to find out and report back here

@thaJeztah @simonferquel I was able to find the issue: it is in the klog logging package that you guys are vendoring. This package calls os/user.Current() upon initialization, which on Windows does all sorts of domain lookups.

These lookups causes problems on some versions of Windows OS, but much more importantly (and this is what this bug is about), this way of obtaining current user information for logging purposes is just too damn slow for a CLI utility, and probably not really necessary either.

Hope this helps!
@gtardif FYI

How about pulling this fix?

Great find; It's vendored as a dependency of the docker stack deploy to kubernetes, so we matched the version that's used with the version of kubernetes we vendor; let me check if vendoring a more current version would be problematic

I see the fix was back ported to v1.x; https://github.com/kubernetes/klog/pull/124, but is not yet in a release; https://github.com/kubernetes/klog/compare/v1.0.0...release-1.x

I may have one more suggestion for that fix; think it would make sense to lazily look up the username, given that on Linux looking up the user may also make network connections for external users (if cgo is enabled and the osusergo build-tag is not set); see https://github.com/golang/go/issues/23265 and https://github.com/golang/go/blob/go1.13.10/src/os/user/user.go#L8-L15

@dims any chance on tagging a v1.0.1 release?

Actually perhaps it would be good to have a change in Go itself, and add support for the osusergo build-tag for Windows, similar to https://github.com/golang/go/commit/62f0127d8104d8266d9a3fb5a87e2f09ec8b6f5b implemented for Unix/Linux

@karolz-ms perhaps there's interest from Microsoft's side to contribute such a change to Golang? (looking at the PR you linked, it looks like user.Current() is not supported on Nano Server, so possibly other Golang projects on Windows hit the same issue

opened https://github.com/kubernetes/klog/pull/143 to perform userName lookup lazily instead of on init()

opened https://github.com/docker/cli/pull/2515 and https://github.com/docker/cli/pull/2516 to (temporarily) vendor tip of release-1.x from k8s.io/klog

@thaJeztah Ack on 143, let me try to get some reviews on that first.

@thaJeztah thanks for quick reply, much appreciated

Actually perhaps it would be good to have a change in Go itself, and add support for the osusergo build-tag for Windows, similar to golang/go@62f0127 implemented for Unix/Linux

@karolz-ms perhaps there's interest from Microsoft's side to contribute such a change to Golang? (looking at the PR you linked, it looks like user.Current() is not supported on Nano Server, so possibly other Golang projects on Windows hit the same issue

This sounds like a good idea, although such work would be outside of my area of expertise. I will ask around, but do you happen to know about any Microsoft people who did contribute to Golang runtime in the past?

do you happen to know about any Microsoft people who did contribute to Golang runtime in the past?

Hmmm, no, not from the top of my head. I made a contribution myself once, but their flow is not a regular "github" flow, so I need to re-read the instructions again 馃槄. In this case, it'd probably be good to look if a reasonable "fallback" exists (that's generic enough), which probably should be done by someone with a better understanding of the Windows mechanisms than I have

This should be fixed through https://github.com/docker/cli/pull/2516

There are definitely substantial improvements in the latest build from master in my testing. I know that my results before were atypically slow so take the % improvements with a grain of salt.

  • docker context inspect averaged 111 ms over 50 calls (down 96% from 3028 ms before).
  • docker context ls averaged 102 ms over 50 calls (down 95% from 2210 ms before).
  • docker -v averaged 124 ms over 50 calls (down 95% from 2255 ms before).
  • docker info is still a little sluggish at 6525 ms over 50 calls (down 11% from 7297 ms before). However, the VSCode Docker extension only calls this in very rare circumstances, so this isn't much of a concern for us.

Globally, the average for docker context inspect on Windows looked to be around 1300 ms, so this marks a 91% drop from that. That's amazing!

Thanks everyone! This is a huge help for us!

Thanks @bwateratmsft!

docker info is still a little sluggish at 6525 ms over 50 calls (down 11% from 7297 ms before)

The /info endpoint has become a bit of a "kitchen-sink" of any information related to the daemon. While it contains a lot of useful information, unfortunately it also means the endpoint itself has become way more heavy than it should be (especially if you're only looking for a subset of the info)

We're aware of this, but (admittedly) haven't come round to design a better approach (I was thinking of having an optional query parameter to limit the returned information to specific parts, but ideally the struct that's returned would also be more structured).

Is there specific information that VSCode uses from the info output? (Perhaps there's alternatives).

Globally, the average for docker context inspect on Windows looked to be around 1300 ms, so this marks a 91% drop from that. That's amazing!

Thanks! Erm, so it turns out there's a regression (https://github.com/docker/cli/issues/2533) related to the first change (https://github.com/docker/cli/pull/2424), I _think_ that fix was not the "main contributor" to the docker context inspect issue on Windows, but it may mean that for 19.03 we may have to do a partial revert (but continue looking at re-adding the initial version negotiation for the next release)

That makes sense. I looked at where we call info, turns out we use Dockerode for that, not the CLI. There's three places, all user-initiated:

  1. Attaching the shell to a container
  2. Running the Azure CLI image
  3. Debugging .NET Core via attach configuration. In this case, the info endpoint is definitely not the choke point, since we have to copy vsdbg into the container.

Only the first of those three has any significant usage. We're querying the Docker server OSType, indirectly to find out the container OS, which we use to decide whether we should launch cmd or bash.

Curiously, Dockerode takes only ~10-20 ms to call that endpoint. Any ideas on the discrepancy?

Was this page helpful?
0 / 5 - 0 ratings