Rke: Optimize rke to handle creation of more than 1000 nodes

Created on 11 Oct 2018  路  4Comments  路  Source: rancher/rke

Currently rke can be used with any number of nodes, however running on large number of nodes raises some issues including too many open socket files and slow of the process.

The following is a list of optimizations that can be implemented to speed up the process and handle this number of nodes:

  • [x] Optimize the Docker dialer to connect concurrently in parallel. #966
  • [x] Cluster state deployer scale issue https://github.com/rancher/rke/issues/958
  • [x] Fix the sync for node labels and taints to handle large number of nodes. https://github.com/rancher/rke/issues/957
  • [x] Control the number of threads used when creating k8s planes. #969
  • ~[ ] Add the ability to only reconcile added/deleted nodes instead of run rke on all nodes.~ Moved to #974

This should be part of the rke optimization and refactoring issue https://github.com/rancher/rancher/issues/15975

kinbug

Most helpful comment

Tested yesterday with a master server build - b3249730 on an environment with 793 nodes. Total time for rke up to run was 13 minutes, 24 seconds. When using rke v 0.1.10, it took 35 minutes, 5 seconds on the same environment.

All 4 comments

Fixes done for this issue are mostly performance tweaks to handle large number of nodes. Several areas are changed internally, so the common use cases should be verified with the latest build.

Notable changes in UX:

  • The tunnel up phase is much faster now.
  • Sync labels and taints phase is faster and more stable with large number of nodes.
  • State save phase is much faster now.

@moelsayed Will these improvements also help rke remove? On a ~800 node cluster, it takes over 30 minutes.

My observation is rke remove is slow on large clusters and running tasks sequentially. Filed another issue to address - https://github.com/rancher/rke/issues/981

Tested yesterday with a master server build - b3249730 on an environment with 793 nodes. Total time for rke up to run was 13 minutes, 24 seconds. When using rke v 0.1.10, it took 35 minutes, 5 seconds on the same environment.

Was this page helpful?
0 / 5 - 0 ratings