Kubeadm: HA: improvements to the load balancer documentation

Created on 25 Jul 2019  Â·  33Comments  Â·  Source: kubernetes/kubeadm

i'm starting to see more reports of people having issues with setting up LB in a cloud provider, see:
https://github.com/kubernetes/website/issues/14258
https://stackoverflow.com/questions/56768956/how-to-use-kubeadm-init-configuration-parameter-controlplaneendpoint/57121454#57121454
i also saw a report on the VMware internal slack the other day.

the common problems are:

  • confusion about L4 vs L7 load balancers, L4 should be sufficient.
  • not having SSL/TLS for the LB failing api-server heatlh-checks
  • possibly related - using an older version of kubeadm that does not have the config-map retry logic

ideally we should document some more LB aspects/best practices of this in our HA doc, even if LBs are out of scope for kubeadm:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/

some of the comments here need discussion:
https://stackoverflow.com/a/57121454

1) It fails on the first master node where "kubeadm init" is executed because it tries to communicate with itself through the load balancer.
2) On all the other master nodes where "kubeadm join" is executed, there's a 1/N chance of failure when the load balancer selects the node itself and not any of the (N-1) nodes that are already in the cluster.

for 1. the primary CP node should be doing exactly that as there are no other nodes.
2 on the other hand is odd, when a new CP node is joining it should start serving on the LB only after it has finished the join control plane process..

cc @ereslibre @aojea @fabriziopandini @RockyFu267
@kubernetes/sig-cluster-lifecycle

@rcythr related to your PR:
https://github.com/kubernetes/website/pull/15372

areHA help wanted kindesign kindocumentation lifecyclfrozen prioritbacklog

Most helpful comment

@pschulten can you please outline your iptables routing hack in a comment here?

@CecileRobertMichon
we have LB related documentation here:
https://github.com/kubernetes/kubeadm/blob/master/docs/ha-considerations.md

we could add a section in there with solutions for solving the hairpin problem, but this feels like something someone that experienced the problem should contribute and not the kubeadm maintainers, since none of the active kubeadm maintainers have experienced this. so happy to reopen this ticket and assign you or someone else, but one of the reason it was closed was that nobody stepped up to write the docs...

@mbert you might have an opinion about this too.

All 33 comments

I've seen several people discussing this on slack over the last few days. The problem seems to primarily be lack of hairpin routing on Azure.

Here is the relevant azure documentation line about the limitation of their LB, and a workaround.

"Unlike public Load Balancers which provide outbound connections when transitioning from private IP addresses inside the virtual network to public IP addresses, internal Load Balancers do not translate outbound originated connections to the frontend of an internal Load Balancer as both are in private IP address space. This avoids potential for SNAT port exhaustion inside unique internal IP address space where translation is not required. The side effect is that if an outbound flow from a VM in the backend pool attempts a flow to frontend of the internal Load Balancer in which pool it resides and is mapped back to itself, both legs of the flow don't match and the flow will fail. If the flow did not map back to the same VM in the backend pool which created the flow to the frontend, the flow will succeed...There are several common workarounds for reliably achieving this scenario (originating flows from a backend pool to the backend pools respective internal Load Balancer frontend) which include either insertion of a proxy layer behind the internal Load Balancer or using DSR style rules. Customers can combine an internal Load Balancer with any 3rd party proxy or substitute internal Application Gateway for proxy scenarios limited to HTTP/HTTPS. While you could use a public Load Balancer to mitigate, the resulting scenario is prone to SNAT exhaustion and should be avoided unless carefully managed." (https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-overview).

So the current advice for HA of using the LB as the control plane endpoint is broken for azure.

Its not the most elegant solution, but I wonder if a piece of networking duct tape ( I.e. an iptables rule ) could help people avoid setting up their own load balancer on Azure.
Edit: after reading the stack overflow I see the poster solved it by doing exactly this.

I'll have to look a bit more at Microsoft's workaround suggestion.

It looks like this isnt just a problem with Azure. From AWS "Internal load balancers do not support hairpinning or loopback. When you register targets by instance ID, the source IP addresses of clients are preserved. If an instance is a client of an internal load balancer that it's registered with by instance ID, the connection succeeds only if the request is routed to a different instance. Otherwise, the source and destination IP addresses are the same and the connection times out." (https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html)

/assign
/lifecycle active

we have both azure and aws experts in k8s land, so we might be able to get more feedback on this.

To clarify, AWS EC2 NLBs do support hairpin connections provided that the targets are registered by IP address, as opposed to by instance ID.

The documentation quoted above doesn’t say that explicitly, but I can confirm that it works.

For AWS:

  • AWS Classic ELBs work regardless of Public/Private networking.
  • NLBs have the limitation mentioned above by @seh.
  • ALBs should be avoided, since the admin kubeconfig relies on client tls cert authentication

@justaugustus should be able to provide some more detail on Azure based on his work with the Azure provider for Cluster API

To clarify, AWS EC2 NLBs do support hairpin connections provided that the targets are registered by IP address, as opposed to by instance ID.

The documentation quoted above doesn’t say that explicitly, but I can confirm that it works.

It is important to note that autoscaling groups register by instance id. We transitioned from NLBs to ELBs because of this.

They do if you tell them to. We use ASGs, but we run an initialization procedure that registers the instance by its IP address with a discoverable target group, and unregisters it as the machine shuts down (sometimes inconvenient when rebooting).

Its not the most elegant solution, but I wonder if a piece of networking duct tape ( I.e. an iptables rule ) could help people avoid setting up their own load balancer on Azure.
Edit: after reading the stack overflow I see the poster solved it by doing exactly this.

I'll have to look a bit more at Microsoft's workaround suggestion.

Hi! I'm the poster from Stackoverflow. We abandoned the iptables hack and went for HAProxy. We are still working on it thought. It complicates everything, and this is our first time having to configure such a piece of software (how do we add and remove nodes? autoscaling? VRRP?). For us it feels like we are in "Kubernetes the hard way" land, even if Kubeadm tries its best to help :)

We are also wondering if going for Azure Application Gateway (L7) would work?

Thanks for paying attention to this guys!

confusion about L4 vs L7 load balancers, L4 should be sufficient.

A L4 load balancer should be sufficient in most cases, but since there is no heartbeat or activity sent at all during kubectl logs -f and kubectl exec -it, the LB will close the connection after a default or configured timeout (usually client or server timeout, if using haproxy) if there's no activity. This is how it looks, just tried using kind:

~ > kubectl exec -it nginx-554b9c67f9-r9ls6 bash
root@nginx-554b9c67f9-r9ls6:/# ⏎
~ >
~ > kubectl logs -f nginx-554b9c67f9-r9ls6
...
127.0.0.1 - - [26/Jul/2019:21:20:28 +0000] "GET /favicon.ico HTTP/1.1" 404 555 "http://localhost:8000/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36" "-"
F0726 22:21:48.325924    1750 helpers.go:114] error: unexpected EOF
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog.stacks(0x2ce2301, 0x3, 0xc00057e100, 0x44)
    vendor/k8s.io/klog/klog.go:900 +0xb1
k8s.io/kubernetes/vendor/k8s.io/klog.(*loggingT).output(0x2ce2360, 0xc000000003, 0xc0003f0000, 0x2b05bc4, 0xa, 0x72, 0x0)
    vendor/k8s.io/klog/klog.go:815 +0xe6
k8s.io/kubernetes/vendor/k8s.io/klog.(*loggingT).printDepth(0x2ce2360, 0x3, 0x2, 0xc0007d17c8, 0x1, 0x1)
    vendor/k8s.io/klog/klog.go:718 +0x12b
k8s.io/kubernetes/vendor/k8s.io/klog.FatalDepth(...)
    vendor/k8s.io/klog/klog.go:1295
k8s.io/kubernetes/pkg/kubectl/cmd/util.fatal(0xc00073c020, 0x15, 0x1)
    pkg/kubectl/cmd/util/helpers.go:92 +0x1d2
k8s.io/kubernetes/pkg/kubectl/cmd/util.checkErr(0x1baef60, 0xc0000a2060, 0x19e5c98)
    pkg/kubectl/cmd/util/helpers.go:171 +0x90f
k8s.io/kubernetes/pkg/kubectl/cmd/util.CheckErr(...)
    pkg/kubectl/cmd/util/helpers.go:114
k8s.io/kubernetes/pkg/kubectl/cmd/logs.NewCmdLogs.func2(0xc000710a00, 0xc00026a960, 0x1, 0x3)
    pkg/kubectl/cmd/logs/logs.go:147 +0x1da
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc000710a00, 0xc00026a930, 0x3, 0x3, 0xc000710a00, 0xc00026a930)
    vendor/github.com/spf13/cobra/command.go:760 +0x2ae
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00044e780, 0x19e5e80, 0xc0000c6000, 0x5)
    vendor/github.com/spf13/cobra/command.go:846 +0x2ec
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(...)
    vendor/github.com/spf13/cobra/command.go:794
main.main()
    cmd/kubectl/kubectl.go:50 +0x1eb

goroutine 18 [chan receive]:
k8s.io/kubernetes/vendor/k8s.io/klog.(*loggingT).flushDaemon(0x2ce2360)
    vendor/k8s.io/klog/klog.go:1035 +0x8b
created by k8s.io/kubernetes/vendor/k8s.io/klog.init.0
    vendor/k8s.io/klog/klog.go:404 +0x6c

goroutine 5 [syscall]:
os/signal.signal_recv(0x0)
    GOROOT/src/runtime/sigqueue.go:139 +0x9c
os/signal.loop()
    GOROOT/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
    GOROOT/src/os/signal/signal_unix.go:29 +0x41

goroutine 6 [select]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x19e7da0, 0x12a05f200, 0x0, 0x1, 0xc0000407e0)
    staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go:164 +0x181
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0x19e7da0, 0x12a05f200, 0xc0000407e0)
    staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by k8s.io/kubernetes/pkg/kubectl/util/logs.InitLogs
    pkg/kubectl/util/logs/logs.go:51 +0x96
~ >

To reproduce, the only thing needed is to create an HA cluster with kind, and call to kubectl logs or kubectl exec on a pod.

I don't have a good solution if there's no activity at all. At SUSE we fixed this in the past by using HAProxy as a L7 load balancer, so we could add specific API endpoint timeouts (thus, exec and logs would have no timeouts) but it isn't certainly the best approach, as this solution comes with maintenance and documentation overhead, since the LB would be terminating the TLS connections.

I think this is worth some investigation to understand what could be done and/or documented regarding the load balancer.

I don't have a good solution if there's no activity at all. At SUSE we fixed this in the past by using HAProxy as a L7 load balancer, so we could add specific API endpoint timeouts (thus, exec and logs would have no timeouts) but it isn't certainly the best approach, as this solution comes with maintenance and documentation overhead, since the LB would be terminating the TLS connections.

who is closing the connection? you can always set keepalives to keep the connection up
or is the problem that it balances the connection to another backend? then we should sticky sessions

who is closing the connection?

HAProxy is closing it.

or is the problem that it balances the connection to another backend? then we should sticky sessions

There's no need, you can reach any apiserver safely, as far as I can tell.

have you considered the "option clitcpka" and "option srvtcpka" on haproxy?

have you considered the "option clitcpka" and "option srvtcpka" on haproxy?

Really worth taking a look (and maybe add them as default in the haproxy created by kind?). If this works as expected we could as well document this on kubeadm if the user wants to use haproxy.

I will have a look and report back, thanks @aojea!

I will have a look and report back, thanks @aojea!

No luck, tcpka or clitcpka and srvtcpka still leads to the same behavior, haproxy closing the connection after the default 50s configured by kind. I think we should investigate this a little deeper. Increasing timeouts is always an option, but there's never a timeout that will fit everyone.

nice explanation about haproxy here https://stackoverflow.com/a/32635324/7794348

Its not the most elegant solution, but I wonder if a piece of networking duct tape ( I.e. an iptables rule ) could help people avoid setting up their own load balancer on Azure.
Edit: after reading the stack overflow I see the poster solved it by doing exactly this.
I'll have to look a bit more at Microsoft's workaround suggestion.

Hi! I'm the poster from Stackoverflow. We abandoned the iptables hack and went for HAProxy. We are still working on it thought. It complicates everything, and this is our first time having to configure such a piece of software (how do we add and remove nodes? autoscaling? VRRP?). For us it feels like we are in "Kubernetes the hard way" land, even if Kubeadm tries its best to help :)

We are also wondering if going for Azure Application Gateway (L7) would work?

Thanks for paying attention to this guys!

For this Azure Application Gateway (L7) topic,
we'd like to see whether the application gateway (with internal frontend private ip) can be used as a control-plane endpoint instead of HAProxy.

1) HAProxy approach
We tried this 3 layers setup to make HA multi-master control-plane work on azure.

  Azure internal L4 loadbalancer (HAProxy virtual IP) <- control-plane endpoint
                      |
                      |
        HAProxy vm scaleset (2 nodes)  (L4 load balancer)
                      |
                      |
         k8s master vm scaleset (3 nodes)

Azure doesn't seem supporting VRRP floatip ip. so had to create redundant HAProxy nodes under a L4 loadbalacner to get a virtual ip.

This works, but it seems complicating our infrastructure.

2) Azure application gateway (L7) approach ?

if we can use Azure application gateway (= internal L7 load balancer) for a control-plane endpoint, it would make things simpler like this.

        Azure internal L7 application gateway (auto-scale)  <- control-plane endpoint
                      |
                      |
         k8s master vm scaleset (3 nodes)

But it seems a bit complicated to set up SSL certificate keys for end-to-end api-server SSL communication.
Is there any references how to set up SSL certificates for internal L7 load balancer, so api server HTTPS request to k8s masters works through the L7 load balancer ?

thanks for the comments and discussion on this,
please have a look at this pending PR, so that we can consolidate the kubeadm HA LB information that users should know about.

https://github.com/kubernetes/website/pull/15411

WRT Azure LBs it seems that the cloud provider needs to make some adjustments to make it easier for users.

why doesn't kubectl logs / the logs API have ping/pong / keepalive, or perform a reconnect?

why doesn't kubectl logs / the logs API have ping/pong / keepalive, or perform a reconnect?

I'm preparing a patch for this exactly (sending pings on a regular configurable basis). I will post the proposed PR here.

adding help-wanted label.

Completely forgot that I was going to link my PR on this issue: https://github.com/kubernetes/kubernetes/pull/81179. It has been closed already because this will come directly implemented in golang (possibly https://github.com/golang/net/pull/55). Whenever that happens I will check if all use cases are covered.

The golang solution is definitely the right place to handle the generic use case, at least for sending casual HTTP/2 ping frames over the wire to the server end. On my PR I stumbled across some issues, mostly regarding how golang hides the real transport being used with the RoundTrip interface, and bundling the http/2 logic as private implementation inside http.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/lifecycle frozen

so this issue ended up accumulating a collection of different (or related) LB problems including lack of hairpin in Azure (maybe this works nowadays), issues in haproxy, improvements in golang with respect to sending ping frames for http2 (this seems to have merged), and potential improvement of exposing the ping interval in client-go (https://github.com/kubernetes/kubernetes/pull/81179#issuecomment-538054720 seems viable still).

also we still are not covering the following exactly in our docs:

confusion about L4 vs L7 load balancers, L4 should be sufficient.
not having SSL/TLS for the LB failing api-server heatlh-checks
possibly related - using an older version of kubeadm that does not have the config-map retry logic

but we do have a dedicated LB doc here now:
https://github.com/kubernetes/kubeadm/blob/master/docs/ha-considerations.md

so basically with the combination of our k/website docs:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/

we are telling the users to use the guide we give them for VIP/LB or they are on they own when setting up LB in a CP.

i don't think there is anything substantionally actionable for the kubeadm team in this ticket at this point. but if you think there is something let's log separate concise tickets.

or maybe just send PRs for the new doc.

thanks for the discussion.
/close

@neolit123: Closing this issue.

In response to this:

so this issue ended up accumulating a collection of different LB related problems including lack of hairpin in Azure (maybe this works nowadays), issues in haproxy, improvements in golang with respect to sending ping frames for http2 (this seems to have merged), and potential improvement of exposing the ping interval in client-go (https://github.com/kubernetes/kubernetes/pull/81179#issuecomment-538054720 seems viable still).

also we still are not covering the following exactly in our docs:

confusion about L4 vs L7 load balancers, L4 should be sufficient.
not having SSL/TLS for the LB failing api-server heatlh-checks
possibly related - using an older version of kubeadm that does not have the config-map retry logic

but we do have a dedicated LB doc here now:
https://github.com/kubernetes/kubeadm/blob/master/docs/ha-considerations.md

so basically with the combination of our k/website docs:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/

we are telling the users to use the guide we give them for VIP/LB or they are on they own when setting up LB in a CP.

i don't think there is anything substantionally actionable for the kubeadm team in this ticket at this point. but if you think there is something let's log separate concise tickets.

or maybe just send PRs for the new doc.

thanks for the discussion.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neolit123 I'm running into this exact issue (the hairpin routing one) trying to set up private clusters with Cluster API on Azure: https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/974#issuecomment-713200730

Has anyone in this thread found a workaround to get kubeadm init to work when using an Azure load balancer? If not, should I open a new issue to track this?

Slack thread: https://kubernetes.slack.com/archives/C2P1JHS2E/p1603230818143600

@CecileRobertMichon I chose "the very hacky iptables duct tape" (with a heavy heart) but had no issues the last couple of month (AWS/NLB)

@pschulten can you please outline your iptables routing hack in a comment here?

@CecileRobertMichon
we have LB related documentation here:
https://github.com/kubernetes/kubeadm/blob/master/docs/ha-considerations.md

we could add a section in there with solutions for solving the hairpin problem, but this feels like something someone that experienced the problem should contribute and not the kubeadm maintainers, since none of the active kubeadm maintainers have experienced this. so happy to reopen this ticket and assign you or someone else, but one of the reason it was closed was that nobody stepped up to write the docs...

@mbert you might have an opinion about this too.

sure it's just something someone mentioned in a related issue.
service:

[Unit]
Description=Routes the IP of the NLB in the same subnet to loopback address (because internal NLB is unable to do hairpinning)
DefaultDependencies=no
After=docker.service

[Service]
Type=oneshot
ExecStart=/opt/bin/internal-nlb-hack.sh

[Install]
WantedBy=multi-user.target

impl (nlbname injected from the outside):

#!/bin/bash
nlbname=${nlb_name}
az=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
nlbid=$(docker run --rm -e AWS_REGION=eu-central-1 -e AWS_DEFAULT_REGION=eu-central-1 mesosphere/aws-cli elbv2 describe-load-balancers --names "$nlbname" --query 'LoadBalancers[*].LoadBalancerArn' --output text | sed 's#.*/\(.*\)$#\1#')
nlbip=$(docker run --rm -e AWS_REGION=eu-central-1 -e AWS_DEFAULT_REGION=eu-central-1 mesosphere/aws-cli ec2 describe-network-interfaces --filters Name=description,Values="*$nlbid" Name=availability-zone,Values=$az --query 'NetworkInterfaces[*].PrivateIpAddresses[*].PrivateIpAddress' --output text)

if  [  -z "$nlbid" ] ; then
  printf "Unable to find fronting NLB ip for AZ: %s" "$az" | systemd-cat --priority=err
  exit 1
fi

printf "iptables -t nat -A OUTPUT -p all -d %s -j DNAT --to-destination 127.0.0.1" "$nlbip" | systemd-cat --priority=warning
iptables -t nat -A OUTPUT -p all -d "$nlbip" -j DNAT --to-destination 127.0.0.1

thanks!

@neolit123 I'm hesitant to document "solutions" at this point since all the possibilities are hacky workarounds. Ideally kubeadm would let us optionally either a) use the local api endpoint for the API Server check or b) skip that check altogether (I prefer option a).

I'm thinking of using the iptables workaround for now to unblock CAPZ but I'm happy to contribute a proposal/implementation if the above is something the maintainers would consider. I was mostly trying to see if this was already possible, but looks like it's not?

Creating a NAT rule for the external IP is indeed probably the best way to resolve the issue. I'm not sure there is a good way to automate it for all the various combinations that would need to be supported, though.

For example:

  • Is the host system using iptables, ebtables, etc?
  • How to discover what the external IP should be for Azure LB, AWS NLB, etc?

I do think external orchestration systems (such as Cluster API) could automate these bits because they would know more about the systems being orchestrated.

It probably would help to add some generic documentation to the docs around this.

All that said, I do agree with @CecileRobertMichon that it makes sense to allow kubeadm to use the local endpoints, especially since that would open up the possibility of having a workflow where having the LB configured is no longer a pre-requisite, and can be added as a post-installation step.

I'm hesitant to document "solutions" at this point since all the possibilities are hacky workarounds. Ideally kubeadm would let us optionally either a) use the local api endpoint for the API Server check or b) skip that check altogether (I prefer option a).

as mentioned on slack and on https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/974#issuecomment-713641626 i do not disagree with the idea.

Was this page helpful?
0 / 5 - 0 ratings