Longhorn: [Question] run longhorn workloads on dedicated nodes only (taints, tolerations, affinity, nodeSelector)

Created on 23 Jul 2020  路  23Comments  路  Source: longhorn/longhorn

hi!

we run longhorn installed via Rancher catalog on our cluster. We would like to run longhorn workloads (instance manager, engines etc) only on dedicated nodes, not on every node. The reason for this would be that we might be adding more nodes to the cluster, then remove that nodes or run experiments over that nodes that would result into longhorn failures (engine would stuck in deploying phase if we shutdown the node that used to have longhorn running). By far I can't find a way to lock longhorn within certain nodes.

Please advise,
Anton

question

Most helpful comment

Another use case for supporting taints and tolerations is a cluster with windows and linux nodes where you want to use longhorn only on the linux nodes.

All 23 comments

Longhorn needs to be running on the node to provide storage on the node, so normally it's not recommended to limit the deployment for a subset of the node.

However, can you elaborate your issue with engine would be stuck in deploying phase? Longhorn is designed to scale well with the cluster, so if there is a problem with that part, we want to understand why it happened and potentially fix it.

so the scenario was (not exact, but i will be running more soon, and likely will have better steps to reproduce):

  1. have longhorn installed via Rancher. everything is stable
  2. add node(s) to the cluster - longhorn is getting installed
  3. taint nodes in order to isolate them for heavy jobs (do not require longhorn, but it's already installed)
  4. nodes are getting randomly rebooted/killed due to resources or kernel bugs
  5. longhorn engine image becomes stuck in 'deploying' status, we are having intermittent issues on other nodes
  6. longhorn is stabilised by reinstalling deployment

so our idea is to be able to tell longhorn what nodes it should run, so when we spin up nodes that don't need it - we don't have to deal with reinstalling it in order to respect taints that are set after longhorn is already installed

  1. Using taints with effect NoExecute will evict the existing Longhorn workloads on the node immediately. Then rebooting/shutting down the tainted nodes won't affect Longhorn.
  2. Rebooting/Killing the nodes on which the Longhorn workloads are running does lead to the longhorn engine image becoming 'deploying' status. In this case, waiting for the longhorn workloads back is enough and you don' need to reinstall Longhorn. If the Longhorn workload recovery somehow gets stuck, it may be a bug. You can file an issue for that then.

Another use case for supporting taints and tolerations is a cluster with windows and linux nodes where you want to use longhorn only on the linux nodes.

It doesn't work with Rancher with windows node enable. I bet people on longhorn don't talk to people on Rancher team lol

I just discovered there is a setting called "taintToleration".
if you set it like taintToleration: "cattle.io/os=linux:NoSchedule"
It will work if you deploy from helm chart directly. But it doesn't work from Rancher App catalog.
I also taint windows nodes. (I really think Rancher tainting linux nodes is very annoying.)

@maxisam The taint toleration should work with the Rancher App too, once you set the default setting option?

@yasker I did. But somehow it doesn't work. And I can't find a way to let it only select linux node, so I have to taint windwos node. In the end, every nodes is taints lol. I hope Rancher provide an option to taint linux or windows node

There is a known issue in Longhorn toleration setting: The toleration setting will be applied only after Longhorn workloads up in the cluster. In other words, if the toleration setting doesn't work and the Longhorn workloads can not be deployed if all nodes are tainted before Longhorn launching

@shuo-wu I don't think will work. I believe the way you said was the first time I did. I didn't taint windows nodes and I tried to fix it by setting toleration and affiliation after deployment. But toleration setting didn't stick. For a moment, it seemed working but it didn't work in the end. I remembered UI part did work. I really think it should be an easy fix from Rancher's side, just don't taint it.

@maxisam

  1. Longhorn doesn't support Windows now.
  2. The default setting doesn't work for the Longhorn upgrade. That's a default setting for the upcoming Longhorn system. Hence modifying the Longhorn toleration setting via Rancher App won't stick when there is already an (old) Longhorn system in the cluster.

@shuo-wu I totally aware Longhorn doesn't support Windows. I just need it run on Linux nods in a hybrid (Windows/Linux) cluster.

About 2. I think I didn't make it clear enough. I changed the settings on the deployment/daemonsets and scale the it down to 0 and scale back. There were 2 changes I made, affinity and toleration. Affinity part works but not toleration part.

Got it. Sorry for misunderstanding your comment. The 2nd issue will be tracked here: https://github.com/longhorn/longhorn/issues/1833

Longhorn needs to be running on the node to provide storage on the node, so normally it's not recommended to limit the deployment for a subset of the node.

However, can you elaborate your issue with engine would be stuck in deploying phase? Longhorn is designed to scale well with the cluster, so if there is a problem with that part, we want to understand why it happened and potentially fix it.

Hi folks! I need more info about @yasker answer above.
I was thinking on put longhorn in 3 dedicated nodes (they will exclusively will handle volumes and replicas, dashboard etc) and all _other worker nodes_ could have its pods using PVs hosted at these 3 longhorn nodes. By @yasker answer, it isn't possible?

I have a cluster with Longhorn on all nodes (3) but I also have a bad application that create massive CPU spikes - when this happens every process suffers from CPU starvation and Longhorn in particular 'get lost'; it stops scheduling, then starts rebuilding replicas (lots of disk pressure) and the worst: many pods get their PVCs in read-only mode. And Longhorn recovers itself but I have to restart all affected workloads.
That why I was thinking about isolating longhorn on dedicated nodes.

Sorry about this lenghty question. Any thoughts are appreciated.
Regards,
Fabio Carvalho

@FCarvMobil You need to set resource limitation on your "bad apps".

Hi @maxisam. Those pods already have k8s resources set. The issue here is there are some paralelism with them and I can't set resources too low as they will take too long to complete.
Thanks for replying.

@FCarvMobil

You can have three dedicated nodes to provide storage, which means running replicas and stuff. But Longhorn needs to be run on every node to provide the connectivity to the volume. So if any workload on the node need access to the Longhorn volume, Longhorn needs to be run there.

Based on what you described, you need to shield Longhorn from this situation. You can try to set higher GuaranteedEngineCPU to see if it helps, which will translate into CPU Request for the key Longhorn pods. Notice that reset the value will restart all the volumes, so scale down the workload and detach the volume first.

Hi @yasker, thanks for replying.

Yes, the bottom line is that, to protect Longhorn from this situation. I'm already using the GuaranteedEngineCPU setting at the default suggested value (by the docs it should fit) but I'll try to increase it (aware to bring down workloads prior to this).

Sorry if I didn't get something about the architecture at the docs. I should run Longhorn in all nodes to provide connectivity to volumes, this implies on having replicas scheduled _in each node_? If not, how can I control on which nodes to allow/forbid replica scheduling?
[UPDATE] I went through the docs again (I think that time with a refreshed mind): it's this? https://longhorn.io/docs/1.0.0/references/settings/#kubernetes-taint-toleration

Again, thanks!

The default value is a bit conservative since if we put too high a value then the instance manager may fail to start in user's environment.

We're also working on #1691 which should help us to get some guideline about how should we set the GuaranteedEngineCPU according to how many volumes will be used on the node.

Also, Longhorn manager needs to run on every node, but you can choose which nodes provide the storage (a.k.a has replica created). Any node without a disk set in Longhorn won't be used for replica scheduling. You can do that in the UI node page. Or you can set annotation on the Kubernetes node object to customize the default disk for the node. See https://longhorn.io/docs/1.0.2/advanced-resources/default-disk-and-node-config/

Taint toleration is indeed for dedicate nodes for Longhorn storage, but it's not required. See https://longhorn.io/docs/1.0.2/advanced-resources/deploy/taint-toleration/

I was thinking on put longhorn in 3 dedicated nodes (they will exclusively will handle volumes and replicas, dashboard etc) and all other worker nodes could have its pods using PVs hosted at these 3 longhorn nodes.

This would be great. But taints and tolerations seem not to be the right way for me. IMHO node-selector or affinity would be the right way for that.

Wait, it seems that https://github.com/longhorn/longhorn/issues/583 has fixed this by using labels.

Hi guys! Thank you all for replying.

I've seen that I can disable replica scheduling on Dashboard but the approach with nodeSelector is preferable so I haven't to manually configure anything on Dashboard - just add a new node with the appropriate label and it's done. #583 is a good reference!

I'll return here with my findings, still waiting an internal approval to build up a test scenario within our cloud provider account.
Again, thanks for replying.

Was this page helpful?
0 / 5 - 0 ratings