Ray: [tune] set omp_num_threads = num_cpus per trial

Created on 15 Aug 2020  路  4Comments  路  Source: ray-project/ray

What is the problem?

Having successfully set up a cluster with m5.4xlarge worker nodes, each having 16 cpus, I noticed that my runs actually consume exactly 1 cpu on the worker nodes. Using resources_per_trial={"cpu": 16} does seem to allocate one run per worker as expected, but sshing into any single worker, only one core gets utilized.

Running the same calculation on the head node without ray does use multiple cores, so I'm pretty sure the problem is that ray somehow decided not to use all available compute.

Reproduction

import time
from ray import tune
import numpy as np
import ray

ray.init("172.31.29.113:6379")
def run_me(config):
    s = 3000
    for iter in range(100):
         #computationally expensive step (utilizes more than a single core on the head node running sequentially)
         np.matmul(np.random.randn(s,s),np.random.randn(s,s))

        tune.report(hello="world", ray="tune")

analysis = tune.run(run_me, num_samples=40, resources_per_trial={"cpu": 16})

Config

python 3.6, ray 0.8.7. m5.4xlarge worker nodes on AWS

  • [X] I have verified my script runs in a clean environment and reproduces the issue.
  • [X] I have verified the issue also occurs with the latest wheels.
P3 question tune

Most helpful comment

No problem!

Got it; as a diagnostic step, can you do the following:
If you're starting ray by hand:

# when you start ray for each node of your cluster:
export OMP_NUM_THREADS=16
ray start --...

If you're using the cluster launcher, you'll want to set this in the yaml:

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; export OMP_NUM_THREADS=16; ray start --head --num-redis-shards=10 --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; export OMP_NUM_THREADS=16; ray start --address=$RAY_HEAD_IP:6379

And then run:
```
import time
from ray import tune
import ray
ray.init("172.31.29.113:6379")
def run_me(config):
import numpy as np
s = 3000
for iter in range(100):

    a = np.matmul(np.random.randn(s,s),np.random.randn(s,s))

    tune.report(hello="world", ray="tune")

analysis = tune.run(run_me, num_samples=40, resources_per_trial={"cpu": 16})

All 4 comments

@urimerhav ah looks like there was an unexpected env var propagation. Can you try

```
from ray import tune
import ray

ray.init("172.31.29.113:6379")
def run_me(config):
import os; os.environ["OMP_NUM_THREADS"] = 16
import numpy as np
s = 3000
for iter in range(100):
#computationally expensive step (utilizes more than a single core on the head node running sequentially)
np.matmul(np.random.randn(s,s),np.random.randn(s,s))

    tune.report(hello="world", ray="tune")

Thanks a ton for the super quick reply! Still no dice. I changed on thing about your script which I'm guessing is a typo: os.environ["OMP_NUM_THREADS"] = "16", since ints aren't allowed.

Utilization is still at exactly one core.

Just to make sure we're on the same page, here's what I currently have, and it's still capping cpu utilization unfortunately:

import time
from ray import tune
import ray
ray.init("172.31.29.113:6379")
def run_me(config):
    import os
    os.environ["OMP_NUM_THREADS"] = "16"
    import numpy as np
    s = 3000
    for iter in range(100):

        a = np.matmul(np.random.randn(s,s),np.random.randn(s,s))

        tune.report(hello="world", ray="tune")

analysis = tune.run(run_me, num_samples=40, resources_per_trial={"cpu": 16})

No problem!

Got it; as a diagnostic step, can you do the following:
If you're starting ray by hand:

# when you start ray for each node of your cluster:
export OMP_NUM_THREADS=16
ray start --...

If you're using the cluster launcher, you'll want to set this in the yaml:

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; export OMP_NUM_THREADS=16; ray start --head --num-redis-shards=10 --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; export OMP_NUM_THREADS=16; ray start --address=$RAY_HEAD_IP:6379

And then run:
```
import time
from ray import tune
import ray
ray.init("172.31.29.113:6379")
def run_me(config):
import numpy as np
s = 3000
for iter in range(100):

    a = np.matmul(np.random.randn(s,s),np.random.randn(s,s))

    tune.report(hello="world", ray="tune")

analysis = tune.run(run_me, num_samples=40, resources_per_trial={"cpu": 16})

Alright!

I tested and this works! I did indeed use the cluster launcher and it works.

Am I correct that I now need to set num_thread to match the amount of cores on my workers? I can live with that, but I think it's a pretty important issue to resolve, as it's only a matter of time until the env variable and number of cpus go out of sync.

As a temporary workaround, if anyone stumbles on this thread, here's a solution that should be more stable, letting each worker node introspect on how many cores it has.

On the worker node startup script, add this line:
export OMP_NUM_THREADS=$(cat /proc/cpuinfo | grep processor | wc -l)

This is tested to be working on ubuntu.

Was this page helpful?
0 / 5 - 0 ratings