Kind: ipv6 port-forward has a consistent 36s delay on each request

Created on 4 May 2020  路  18Comments  路  Source: kubernetes-sigs/kind

When deploying an ipv6 kind cluster, all requests through kubectl port-forward are delayed by exactly 36s. This can be reproduced with a variety of backend pods.

containerd logs during request:

May 04 19:42:36 ipv6-control-plane containerd[185]: time="2020-05-04T19:42:36.748236106Z" level=info msg="Portforward for \"379868fecdbb9611e49f15c3306c2cd25f353d1f32d89f7d9f1350c3c3466769\" port []"
May 04 19:42:36 ipv6-control-plane containerd[185]: time="2020-05-04T19:42:36.748292691Z" level=info msg="Portforward for \"379868fecdbb9611e49f15c3306c2cd25f353d1f32d89f7d9f1350c3c3466769\" returns URL \"http://127.0.0.1:42031/portforward/6cArg5VH\""
May 04 19:42:48 ipv6-control-plane containerd[185]: time="2020-05-04T19:42:48.996645446Z" level=info msg="Executing port forwarding command \"/usr/bin/socat - TCP4:localhost:80\" in network namespace \"/var/run/netns/cni-6db781e9-1cc9-2d16-0b8a-5480d0dcd1d1\""
<36s delay>
May 04 19:43:25 ipv6-control-plane containerd[185]: time="2020-05-04T19:43:25.038951740Z" level=info msg="Finish port forwarding for \"379868fecdbb9611e49f15c3306c2cd25f353d1f32d89f7d9f1350c3c3466769\" port 80"

kubelet seems to have no relevant logs
api-server seems to have no relevant logs

Pod <-> Pod traffic and calling the /proxy endpoint are all fast - just port-forward is slow

$ kind version
kind v0.8.0 go1.14.2 linux/amd64
$ k version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-30T20:19:45Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
kinexternal lifecyclactive prioritimportant-longterm

Most helpful comment

I tried 3 different http based applications, 2 Istio debug endpoints and httpbin. All 3 of them are different languages etc but took the same 36s. I'll see if I can repro on another machine and with iperf

All 18 comments

@aojea maybe something with the portmap plugin? you somewhat recently PRed changes to this upstream right?

... seems odd that the port forward logs are referencing ipv4 here

... also bringing me back to this issue again https://github.com/containerd/cri/issues/730

/assign
will investigate tomorrow, seems easy to reproduce 馃

@BenTheElder you are right socat is executing a tcp4 listen https://github.com/containerd/cri/blob/master/pkg/server/sandbox_portforward_unix.go#L71.

The only reason this is working is because socat can do ipv4 to ipv6 capabilities

Also the delay is really containerd thing _i think_ check:
https://github.com/containerd/cri/blob/master/pkg/server/sandbox_portforward_unix.go#L81

I can't reproduce this, @howardjohn My observation is that the time between the events

May 05 13:09:19 kind-worker containerd[122]: time="2020-05-05T13:09:19.852531604Z" level=info msg="Executing port forwarding command \"/usr/bin/socat - TCP4:localhost:8001\" in network na
mespace \"/var/run/netns/cni-c9e71e3b-ea1e-7a62-10e4-bacbe9c81650\""

and

May 05 13:09:29 kind-worker containerd[122]: time="2020-05-05T13:09:29.904271707Z" level=info msg="Finish port forwarding for \"76d034f48d0320155b90dbff907bc5261416b20a88914f02f2d360393b3
93a7f\" port 8001"

is the time that the port-forward is being used, you can reproduce it with ipref per example, where you can control the time the data is going through the network pipe.

  1. Run an iperf server in a pod
kubectl run iperf --command --image=us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.13 --port=8001 -- /bin/sh -c "/usr/local/bin/iperf -s -p 8001"
  1. Forward the port
kubectl port-forward deployment/iperf 8888:8001
  1. Run traffic through the forwarded port, in this case let's limit it to 5 seconds:
iperf -c localhost -p 8888 -t 5

and we can see the following events in the containerd log , that is running in the node that has the pod with iperf

May 05 13:15:17 kind-worker containerd[122]: time="2020-05-05T13:15:17.604221527Z" level=info msg="Executing port forwarding command \"/usr/bin/socat - TCP4:localhost:8001\" in network namespace \"/var/run/netns/cni-c9e71e3b-ea1e-7a62-10e4-bacbe9c81650\""
May 05 13:15:22 kind-worker containerd[122]: time="2020-05-05T13:15:22.649776547Z" level=info msg="Finish port forwarding for \"76d034f48d0320155b90dbff907bc5261416b20a88914f02f2d360393b393a7f\" port 8001"

if we run it 17 seconds:

iperf -c localhost -p 8888 -t 17

the event gap is 17

May 05 13:19:18 kind-worker containerd[122]: time="2020-05-05T13:19:18.908618796Z" level=info msg="Executing port forwarding command \"/usr/bin/socat - TCP4:localhost:8001\" in network namespace \"/var/run/netns/cni-c9e71e3b-ea1e-7a62-10e4-bacbe9c81650\""
May 05 13:19:35 kind-worker containerd[122]: time="2020-05-05T13:19:35.955743777Z" level=info msg="Finish port forwarding for \"76d034f48d0320155b90dbff907bc5261416b20a88914f02f2d360393b393a7f\" port 8001"

so, 36 seconds should be something that is using the pipe, just curious, what are you using for checking the port forward functionality?

However, there are several things wrong that we should fix in containerd/cri,liek socat and TCP4, thanks for pointing them out here

I tried 3 different http based applications, 2 Istio debug endpoints and httpbin. All 3 of them are different languages etc but took the same 36s. I'll see if I can repro on another machine and with iperf

I can reproduce on another machine, but that isn't saying much as both my machines are similar (gLinux).

With iperf setup above:

$ iperf -c localhost -p 8888 -t 8
Handling connection for 8888
------------------------------------------------------------
Client connecting to localhost, TCP port 8888
TCP window size: 2.76 MByte (default)
------------------------------------------------------------
[  3] local 127.0.0.1 port 43664 connected with 127.0.0.1 port 8888
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 8.6 sec  44.0 MBytes  43.1 Mbits/sec

^ exits in 8 seconds

containerd logs show 36s gap

May 05 23:32:10 ipv6-control-plane containerd[185]: time="2020-05-05T23:32:10.111281653Z" level=info msg="Executing port forwarding command \"/usr/bin/socat - TCP4:localhost:8001\" in network namespace \"/var/run/netns/cni-6d4e45bf-b34a-1b5d-de61-ce0d9b69448e\""
May 05 23:32:46 ipv6-control-plane containerd[185]: time="2020-05-05T23:32:46.269239610Z" level=info msg="Finish port forwarding for \"af6f90a18ba77ec9b1b4cc09152fc0c4e6b9134a3bab31f63c18cfd815f9f220\" port 8001

/lifecycle active

/kind external

https://github.com/containerd/cri/pull/1470 just merged, but it will be a bit before this is in containerd I think. Once it's into containerd we can pull new binaries into kind and see how that goes.

https://github.com/containerd/containerd/commit/65df60b3c9cf08c1421f9d67fe23ca24f56a36fa this is now in containerd @ HEAD, will see about pulling it in.

triggered another nightly build, pulling it in via https://github.com/kubernetes-sigs/kind/pull/1599

I think this should be fixed if using a node image built with HEAD

we had to roll back containerd due to https://github.com/kubernetes-sigs/kind/issues/1634

blocked on https://github.com/kubernetes-sigs/kind/issues/1637, we're making progress on the issue.

The first issue in #1637 was fixed, but then we discovered that the containerd upgrade was also the cause of flakiness in our CI. That we think we've tracked down to an upstream timeout change w/ replication in non-kind based CI, but no ETA on getting it fixed.

Once that's done we can look at rolling forward again to pick up the fix for this issue. Until then we're stuck at <= 1.3.4 (we're rather certain the bug is introduced shortly after that, and the flake is due to errors deleting containers ...)

this should be fixed now.

Was this page helpful?
0 / 5 - 0 ratings