Ray: No logging of SSH failures of Ray nodes?

Created on 18 Feb 2019  路  15Comments  路  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): RHEL 6 HPC environment
  • Ray installed from (source or binary): pip
  • Ray version: 0.6.3
  • Python version: 3.7
  • Exact command to reproduce:

I've designed a custom NodeProvider to provide autoscaling support on Univa Grid Engine HPC environments. I'm having the hardest time, though, debugging errors w.r.t. permanent "waiting-for-ssh" states. I have at one point gotten a Ray head node properly qsub'd and running via ray up blarg.yml -y (which then goes on to spawn two initial workers as configured), but the workers are not able to do that dummy SSH step back to the head for whatever reason. Unfortunately, I've run through all the logs that Ray and the job script spit out, and have had no luck finding specifics of why the workers' SSH efforts have failed. Is there a particular way to get the reason(s) behind the SSH failure exported to the log files?

Now that I've been messing with this a bit, e.g., creating new SSH keys, I am having trouble even getting the qsub'd ray start connecting to the ray up process (with the "waiting-for-ssh" problem). I'm sure it's some configuration stuff on my end, but because I get so little feedback on the failure mode, this has been painful to say the least to debug, haha.

Command to start the head node that my NodeProvider qsubs from ray up:

ray start --block --num-cpus 1 --temp-dir /home/myusername/ray_cluster//clusters/pipeline/ray-pipeline-head --head --redis-port 31236 --autoscaling-config /home/myusername/ray_cluster//autoscaling_config.yml

Note that I can see the head node job + the two worker jobs have been submitted and are running on the cluster

Log outputs:

monitor.out

This is the output from the head node鈥檚 NodeProvider (my custom debugging output). It鈥檚 too long to paste in this issue, so find here:

https://gist.github.com/zbarry/0282655b01bc51f247a7a8fa6477e9e9

Cluster / job log

[INFO] [1550510728] [Mon Feb 18 12:25:28 EST 2019] [16366] [18509843] Starting the workflow
/home/myusername/miniconda3/envs/ray/bin/python
/home/myusername/miniconda3/envs/ray/bin/ray
2019-02-18 12:25:32,108 INFO scripts.py:288 -- Using IP address <redacted> for this node.
2019-02-18 12:25:32,113 INFO node.py:278 -- Process STDOUT and STDERR is being redirected to /home/myusername/ray_cluster//clusters/pipeline/ray-pipeline-head/logs.
2019-02-18 12:25:32,231 INFO services.py:396 -- Waiting for redis server at 127.0.0.1:31236 to respond...
2019-02-18 12:25:32,348 INFO services.py:396 -- Waiting for redis server at 127.0.0.1:48029 to respond...
2019-02-18 12:25:32,349 INFO services.py:798 -- Starting Redis shard with 10.0 GB max memory.
2019-02-18 12:25:32,384 WARNING services.py:1206 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
2019-02-18 12:25:32,385 INFO services.py:1360 -- Starting the Plasma object store with 20.0 GB memory using /dev/shm.
2019-02-18 12:25:32,433 INFO scripts.py:319 -- 
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --redis-address redacted:31236

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(redis_address="redacted:31236")

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop

debug_state.txt

NodeManager:
InitialConfigResources: {GPU,0.000000},{CPU,1.000000}
ClusterResources:
9314824bc109cd85b897aacc8d6b917b2bcc80d2: 
- total: {CPU,1.000000},{GPU,0.000000}
- avail: {CPU,1.000000},{GPU,0.000000}
499b4c9a4193e9825cc616cf2051285f576b6dc0: 
- total: {CPU,1.000000},{GPU,0.000000}
- avail: {CPU,1.000000},{GPU,0.000000}
8e0ad117daf498c80775ec6bf916f9ab098b0969: 
- total: {GPU,0.000000},{CPU,1.000000}
- avail: {GPU,0.000000},{CPU,1.000000}
ObjectManager:
- num local objects: 0
- num active wait requests: 0
- num unfulfilled push requests: 0
- num pull requests: 0
- num buffered profile events: 0
ObjectDirectory:
- num listeners: 0
- num eviction entries: 0
ObjectStoreNotificationManager:
- num adds processed: 0
- num removes processed: 0
BufferPool:
- get buffer state map size: 0
- create buffer state map size: 0
ConnectionPool:
- num message send connections: 0
- num transfer send connections: 0
- num avail message send connections: 0
- num avail transfer send connections: 0
- num message receive connections: 0
- num transfer receive connections: 0
AsyncGcsClient:
- TaskTable: num lookups: 0, num adds: 0
- ActorTable: num lookups: 0, num appends: 0
- TaskReconstructionLog: num lookups: 0, num appends: 0
- TaskLeaseTable: num lookups: 0, num adds: 0
- HeartbeatTable: num lookups: 0, num adds: 5901
- ErrorTable: num lookups: 0, num appends: 0
- ProfileTable: num lookups: 0, num appends: 0
- ClientTable: num lookups: 0, num appends: 1, cache size: 3, num removed: 0
- DriverTable: num lookups: 0, num appends: 0
WorkerPool:
- num workers: 1
- num drivers: 0
SchedulingQueue:
- num placeable tasks: 0
- num waiting tasks: 0
- num ready tasks: 0
- num running tasks: 0
- num infeasible tasks: 0
- num waiting for actor creation tasks: 0
- num tasks blocked: 0
ReconstructionPolicy:
- num reconstructing: 0
TaskDependencyManager:
- task dep map size: 0
- task req map size: 0
- req objects map size: 0
- local objects map size: 0
- pending tasks map size: 0
LineageCache:
- committed tasks: 0
- child map size: 0
- num subscribed tasks: 0
- lineage size: 0
ActorRegistry:
- num live actors: 0
- num reconstructing actors: 0
- num dead actors: 0
- max num handles: 0
RemoteConnections:
499b4c9a4193e9825cc616cf2051285f576b6dc0: 
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
9314824bc109cd85b897aacc8d6b917b2bcc80d2: 
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
DebugString() time ms: 1

monitor.err

2019-02-18 12:25:33,473 INFO autoscaler.py:351 -- StandardAutoscaler: {'cluster_name': 'pipeline', 'min_workers': 2, 'max_workers': 10, 'initial_workers': 2, 'docker': {'image': '', 'container_name': ''}, 'target_utilization_fraction': 0.8, 'idle_timeout_minutes': 10, 'provider': {'type': 'external', 'module': 'pipeline.hpc.UGENodeProvider'}, 'auth': {'ssh_user': 'myusername', 'ssh_private_key': '/home/myusername/.ssh/id_dsa'}, 'head_node': {}, 'worker_nodes': {}, 'file_mounts': {}, 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': [], 'worker_start_ray_commands': []}
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:25:34,188 INFO autoscaler.py:597 -- StandardAutoscaler: 0/2 target nodes (0 pending) (bringup=True)
2019-02-18 12:25:34,189 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=-1 Mean=-1 Max=-1, NumNodesConnected=0, NumNodesUsed=0.0, ResourceUsage=, TimeSinceLastHeartbeat=Min=-1 Mean=-1 Max=-1
2019-02-18 12:25:34,189 INFO autoscaler.py:588 -- StandardAutoscaler: Launching 2 new nodes
2019-02-18 12:25:34,798 INFO autoscaler.py:597 -- StandardAutoscaler: 0/2 target nodes (2 pending) (bringup=True)
2019-02-18 12:25:34,798 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=-1 Mean=-1 Max=-1, NumNodesConnected=0, NumNodesUsed=0.0, ResourceUsage=, TimeSinceLastHeartbeat=Min=-1 Mean=-1 Max=-1
2019-02-18 12:25:39,122 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (bringup=True)
2019-02-18 12:25:39,122 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=4 Mean=4 Max=4, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:43,406 INFO autoscaler.py:512 -- StandardAutoscaler: ray-pipeline-worker_hbfty_8z: Runtime state is None, want 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,613 INFO autoscaler.py:512 -- StandardAutoscaler: ray-pipeline-worker_lclafqgd: Runtime state is None, want 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,614 INFO updater.py:126 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Updating to 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,614 INFO updater.py:126 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Updating to 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:45,373 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:45,373 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=10 Mean=10 Max=10, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:46,386 INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_lclafqgd...
2019-02-18 12:25:46,481 INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_hbfty_8z...
2019-02-18 12:25:50,393 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:50,393 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=15 Mean=15 Max=15, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:55,470 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:55,470 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=20 Mean=20 Max=20, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:57,619 INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_lclafqgd...
2019-02-18 12:25:57,657 INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_hbfty_8z...
2019-02-18 12:25:58,199 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Got IP [LogTimer=12396ms]
2019-02-18 12:25:58,210 INFO updater.py:153 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Waiting for SSH...
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:25:58,222 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Got IP [LogTimer=12404ms]
2019-02-18 12:25:58,251 INFO updater.py:153 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Waiting for SSH...
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:00,507 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:00,507 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=1 Mean=9 Max=25, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:05,560 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:05,560 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=6 Mean=14 Max=30, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:26:10,580 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:10,580 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=11 Mean=19 Max=35, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:15,686 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:15,687 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=16 Mean=24 Max=40, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:20,706 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:20,706 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=21 Mean=29 Max=45, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0

~~ snip ~~

2019-02-18 12:30:51,015 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Got SSH [LogTimer=292805ms]
2019-02-18 12:30:51,015 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Applied config 0c8a44b274b693301ee007458fceccd7c210c4b4 [LogTimer=306401ms]
2019-02-18 12:30:51,015 ERROR updater.py:138 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Error updating Unable to SSH to node
2019-02-18 12:30:51,086 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Got SSH [LogTimer=292835ms]
2019-02-18 12:30:51,086 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Applied config 0c8a44b274b693301ee007458fceccd7c210c4b4 [LogTimer=306472ms]
2019-02-18 12:30:51,086 ERROR updater.py:138 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Error updating Unable to SSH to node
Exception in thread Thread-48:
Traceback (most recent call last):
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 141, in run
    raise e
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 130, in run
    self.do_update()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 188, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

Exception in thread Thread-47:
Traceback (most recent call last):
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 141, in run
    raise e
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 130, in run
    self.do_update()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 188, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

2019-02-18 12:30:54,114 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:30:54,114 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=295 Mean=303 Max=319, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=3 Max=5
2019-02-18 12:30:56,944 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:30:56,944 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=297 Mean=305 Max=321, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:31:03,112 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:31:03,112 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=304 Mean=312 Max=328, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:31:09,433 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:31:09,434 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=310 Mean=318 Max=334, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0

~~ snip ~~

redis-shard_0.out

16540:C 18 Feb 2019 12:25:32.248 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
16540:C 18 Feb 2019 12:25:32.248 # Redis version=5.0.3, bits=64, commit=d2b6db3d, modified=0, pid=16540, just started
16540:C 18 Feb 2019 12:25:32.248 # Configuration loaded
16540:M 18 Feb 2019 12:25:32.249 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
16540:M 18 Feb 2019 12:25:32.249 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
16540:M 18 Feb 2019 12:25:32.249 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
16540:M 18 Feb 2019 12:25:32.249 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
16540:M 18 Feb 2019 12:25:32.249 # Server initialized
16540:M 18 Feb 2019 12:25:32.249 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

redis.out

16536:C 18 Feb 2019 12:25:32.135 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
16536:C 18 Feb 2019 12:25:32.135 # Redis version=5.0.3, bits=64, commit=d2b6db3d, modified=0, pid=16536, just started
16536:C 18 Feb 2019 12:25:32.135 # Configuration loaded
16536:M 18 Feb 2019 12:25:32.136 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
16536:M 18 Feb 2019 12:25:32.136 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
16536:M 18 Feb 2019 12:25:32.136 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
16536:M 18 Feb 2019 12:25:32.136 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
16536:M 18 Feb 2019 12:25:32.136 # Server initialized
16536:M 18 Feb 2019 12:25:32.136 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

Most helpful comment

Not sure if this is the issue, but there are multiple ports that need to be open between the different machines.

For example, by default there are two Redis servers (one "primary" and one or more "non-primary" shards). The "object manager" and "node manager" on each machine also need their own port.

These ports can be specified in the ray start command with flags like --redis-port=6379, --redis-shard-ports=6380, --node-manager-port=12345, and --object-manager-port=12346. Note that the Redis ones are only relevant on the head node.

For ray.init(), only the "primary" Redis address is relevant. That's the first one that is printed.

When you get the error message Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node, can you check the logs on the same machine in /tmp/ray/session*/ (just the most recent session) and see if anything looks like it crashed?

All 15 comments

Forgot to mention that if I get a head node successfully started (with ray up exiting successfully), I can actually then manually ray start --redis-address theheadip:redisport and then ray.init(redis_address='theheadip:redisport'); dostuff() and it works perfectly fine... it's just when qsub'ing the workers that workers cannot connect to the head...

E: it appears that the ray start of a worker manually only works if you run it from the same physical (HPC) node as the Ray head node, which is pretty much in line with this being an SSH issue.

Are you able to manually run the ssh commands from the head to worker nodes? Usually this means it's some network misconfiguration, either the ports are firewalled off from the head or the wrong ip address is being used.

Yeah, that's why it's so weird. Manually ssh'ing between nodes is perfectly fine (and the cluster is managed by a decent-sized group here, and with the dozens of people that use it, you'd think strange situations like this would have been discovered a long time ago based on how many different applications people are using the HPC environment for). Wrong IP is possible, I guess, though the workers themselves seem to be able to access the head redis server. I bet you what's happening is the environment presented to a qsub'd job is missing something that the ssh functionality here is completely dependent on, though I have basically no hard evidence for that assertion, haha.

I'm going to go through Ray piece-by-piece in the ssh-related stuff and add more logging to try to figure this out.

Hmm... one thing of concern that I don't think is causing my issue is the hard-coded Ray socket tmp directory:

https://github.com/ray-project/ray/blob/729d0b2825072eac998fa2f5d1d5782fc1b30752/python/ray/autoscaler/updater.py#L25

On an HPC environment with multiple people using Ray, if someone creates that directory, the permissions would be such that anyone else accessing /tmp on that physical node would not be able to access this folder, right? Is it absolutely necessary to hard code? Maybe this should be a separate issue.

@zbarry I made this change to speed up NodeUpdaters by re-using persistent connections. It's hardcoded primarily because it's an enormous speed improvement and not default SSH behaviour.

We could just as easily make a /tmp/username/ray_ssh_sockets or use something in the home folder.

I hadn't considered that use case, apologies!

@ls-daniel - ahh got it. Do you think there would be collisions between Ray nodes if two of yours happen to be on the same physical node?

Still scratching my head over this, unfortunately.

It looks like I have it to the point that the ray-node-status of my workers hits up-to-date instead of sitting forever at waiting-for-ssh.

monitor.out of head node:

    ||| retrieve_nodes(): node: Node(node_id='ray-pipeline-head', internal_ip='10.145.56.57', external_ip='10.145.56.57', tags={'ray-node-type': 'head', 'ray-launch-config': 'f3e609ac09196cce5508991098153c71801c1e6f', 'ray-node-name': 'ray-pipeline-head', 'ray-node-status': 'up-to-date', 'ray-runtime-config': '7ebe8f7bcf176040828eb34fe0d9dc4cb27b05e9'}, state='running', job_id='33293', is_head=True)


    ||| retrieve_nodes(): node: Node(node_id='ray-pipeline-worker_e9ghb8g_', internal_ip='10.145.56.82', external_ip='10.145.56.82', tags={'ray-node-name': 'ray-pipeline-worker_e9ghb8g_', 'ray-node-type': 'worker', 'ray-node-status': 'up-to-date', 'ray-launch-config': '09a878425125cf009cb9ce10f8be28b66308068d', 'ray-runtime-config': 'bd7a05b53aa43f3ce8e92deb596387b7c7161462'}, state='running', job_id='33294', is_head=False)


    ||| retrieve_nodes(): node: Node(node_id='ray-pipeline-worker_zii0ix9u', internal_ip='10.145.56.83', external_ip='10.145.56.83', tags={'ray-node-name': 'ray-pipeline-worker_zii0ix9u', 'ray-node-type': 'worker', 'ray-node-status': 'up-to-date', 'ray-launch-config': '09a878425125cf009cb9ce10f8be28b66308068d', 'ray-runtime-config': 'bd7a05b53aa43f3ce8e92deb596387b7c7161462'}, state='running', job_id='33295', is_head=False)


    ||| is_running(ray-pipeline-worker_e9ghb8g_): is running: True

(Node is just a simple dataclass for info):

@dataclass
class Node:
    """ Info storage class describing a node. """

    node_id: str  # arbitrary node name
    internal_ip: str  # IP the node can be found on
    external_ip: str  # IP IP the node can be found on - same as internal
    tags: Dict[str, str]  # Ray builtin (see `ray_tags above) and custom tags
    state: str  # 'running', 'pending', 'terminated'
    job_id: str  # job ID given by the scheduler to a node
    is_head: bool = False # True if this is the head node

monitor.err progression of StandardAutoscaler logs showing that the two clients have connected and are showing as registered & ready(?)

StandardAutoscaler [2019-02-25 17:41:32.148312]: 0/2 target nodes (0 pending) (bringup=True)
 - NodeIdleSeconds: Min=-1 Mean=-1 Max=-1
 - NumNodesConnected: 0
 - NumNodesUsed: 0.0
 - ResourceUsage: 
 - TimeSinceLastHeartbeat: Min=-1 Mean=-1 Max=-1
StandardAutoscaler [2019-02-25 17:41:32.959982]: 0/2 target nodes (2 pending) (bringup=True)
 - NodeIdleSeconds: Min=-1 Mean=-1 Max=-1
 - NumNodesConnected: 0
 - NumNodesUsed: 0.0
 - ResourceUsage: 
 - TimeSinceLastHeartbeat: Min=-1 Mean=-1 Max=-1
StandardAutoscaler [2019-02-25 17:41:36.919676]: 1/2 target nodes (2 pending) (bringup=True)
 - NodeIdleSeconds: Min=3 Mean=3 Max=3
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 b'CPU', 0.0/0.0 b'GPU'
 - TimeSinceLastHeartbeat: Min=1 Mean=1 Max=1
StandardAutoscaler: ray-pipeline-worker_zii0ix9u has runtime state None, want bd7a05b53aa43f3ce8e92deb596387b7c7161462
Connected (version 2.0, client OpenSSH_7.4)
Authentication (publickey) successful!
NodeUpdater: Applied config bd7a05b53aa43f3ce8e92deb596387b7c7161462 to node ray-pipeline-worker_e9ghb8g_
NodeUpdater: Updating ray-pipeline-worker_zii0ix9u to bd7a05b53aa43f3ce8e92deb596387b7c7161462, logging to (console)
StandardAutoscaler [2019-02-25 17:42:00.505860]: 2/2 target nodes (0 pending) (2 updating)
 - NodeIdleSeconds: Min=0 Mean=14 Max=27
 - NumNodesConnected: 2
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/2.0 b'CPU', 0.0/4.0 b'GPU'
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
StandardAutoscaler [2019-02-25 17:42:07.878001]: 2/2 target nodes (0 pending) (1 updating)
 - NodeIdleSeconds: Min=8 Mean=21 Max=34
 - NumNodesConnected: 2
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/2.0 b'CPU', 0.0/4.0 b'GPU'
 - TimeSinceLastHeartbeat: Min=0 Mean=4 Max=8
StandardAutoscaler [2019-02-25 17:51:17.141634]: 2/2 target nodes (0 pending)
 - NodeIdleSeconds: Min=540 Mean=560 Max=583
 - NumNodesConnected: 3
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/3.0 b'CPU', 0.0/8.0 b'GPU'
 - TimeSinceLastHeartbeat: Min=1 Mean=1 Max=1

I would think that this would mean that all's good, but apparently that is not the case (different IPs because I submitted the jobs again):

ray up output shows waiting on redis server for two different ports for some reason:

Using IP address 10.145.56.59 for this node.
Process STDOUT and STDERR is being redirected to /home/barryza1/ray_cluster//clusters/pipeline/ray-pipeline-head/logs.
Waiting for redis server at 127.0.0.1:31236 to respond...
Waiting for redis server at 127.0.0.1:37112 to respond...
Starting Redis shard with 10.0 GB max memory.
Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
Starting the Plasma object store with 20.0 GB memory using /dev/shm.

======================================================================
View the web UI at http://localhost:8888/notebooks/ray_ui.ipynb?token=6b6044d37c2614d264f5d23c800b9c9d60f7d215c40f48b3
======================================================================


Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --redis-address 10.145.56.59:31236

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(redis_address="10.145.56.59:31236")

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop

I tried ray.init on both of those ports, and I get this message on both of them (two running sounds weird...):

>>> ray.init(redis_address="10.145.56.59:37112")
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/dtk-pipeline/ray/python/ray/worker.py", line 1508, in init
    redis_address, node_ip_address, redis_password=redis_password)
  File "/data/dtk-pipeline/ray/python/ray/worker.py", line 1237, in get_address_info_from_redis
    redis_address, node_ip_address, redis_password=redis_password)
  File "/data/dtk-pipeline/ray/python/ray/worker.py", line 1205, in get_address_info_from_redis_helper
    "Redis has started but no raylets have registered yet.")
Exception: Redis has started but no raylets have registered yet.
>>> ray.init(redis_address="10.145.56.59:31236")
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/dtk-pipeline/ray/python/ray/worker.py", line 1508, in init
    redis_address, node_ip_address, redis_password=redis_password)
  File "/data/dtk-pipeline/ray/python/ray/worker.py", line 1237, in get_address_info_from_redis
    redis_address, node_ip_address, redis_password=redis_password)
  File "/data/dtk-pipeline/ray/python/ray/worker.py", line 1217, in get_address_info_from_redis_helper
    "Redis has started but no raylets have registered yet.")
Exception: Redis has started but no raylets have registered yet.

debug_state.txt for the head node:

NodeManager:
InitialConfigResources: {GPU,0.000000},{CPU,1.000000}
ClusterResources:
14c9fe370fc10f3565f3919695747c110f3e3e15: 
- total: {CPU,1.000000},{GPU,4.000000}
- avail: {CPU,1.000000},{GPU,4.000000}
b795a6f30cd449a5e8144e62d2ce9e7a49b14e07: 
- total: {CPU,1.000000},{GPU,4.000000}
- avail: {CPU,1.000000},{GPU,4.000000}
51ccce6da42c3a7df0fd2ec07af612ac15a37d7b: 
- total: {GPU,0.000000},{CPU,1.000000}
- avail: {GPU,0.000000},{CPU,1.000000}
ObjectManager:
- num local objects: 0
- num active wait requests: 0
- num unfulfilled push requests: 0
- num pull requests: 0
- num buffered profile events: 0
ObjectDirectory:
- num listeners: 0
- num eviction entries: 0
ObjectStoreNotificationManager:
- num adds processed: 0
- num removes processed: 0
BufferPool:
- get buffer state map size: 0
- create buffer state map size: 0
ConnectionPool:
- num message send connections: 0
- num transfer send connections: 0
- num avail message send connections: 0
- num avail transfer send connections: 0
- num message receive connections: 0
- num transfer receive connections: 0
AsyncGcsClient:
- TaskTable: num lookups: 0, num adds: 0
- ActorTable: num lookups: 0, num appends: 0
- TaskReconstructionLog: num lookups: 0, num appends: 0
- TaskLeaseTable: num lookups: 0, num adds: 0
- HeartbeatTable: num lookups: 0, num adds: 2901
- ErrorTable: num lookups: 0, num appends: 0
- ProfileTable: num lookups: 0, num appends: 0
- ClientTable: num lookups: 0, num appends: 1, cache size: 3, num removed: 0
- DriverTable: num lookups: 0, num appends: 0
WorkerPool:
- num workers: 1
- num drivers: 0
SchedulingQueue:
- num placeable tasks: 0
- num waiting tasks: 0
- num ready tasks: 0
- num running tasks: 0
- num infeasible tasks: 0
- num methods waiting for actor creation: 0
ReconstructionPolicy:
- num reconstructing: 0
TaskDependencyManager:
- task dep map size: 0
- task req map size: 0
- req objects map size: 0
- local objects map size: 0
- pending tasks map size: 0
LineageCache:
- committed tasks: 0
- child map size: 0
- num subscribed tasks: 0
- lineage size: 0
ActorRegistry:
- num live actors: 0
- num reconstructing actors: 0
- num dead actors: 0
- max num handles: 0
RemoteConnections:
14c9fe370fc10f3565f3919695747c110f3e3e15: 
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
b795a6f30cd449a5e8144e62d2ce9e7a49b14e07: 
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
DebugString() time ms: 1

Feel like I'm really close, but I have no idea what the next debugging step would be... There doesn't seem to be much more in the logs that would guide me. Any ideas?

Thinking about it, the most interesting thing popping up in the logs right now might be:

RemoteConnections:
14c9fe370fc10f3565f3919695747c110f3e3e15: 
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
b795a6f30cd449a5e8144e62d2ce9e7a49b14e07: 
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0

I'm not sure what parts of Ray fill in these stats, but I guess there's your clear indication that the head can register & update these nodes but still can't communicate with them.

Not sure if this is the issue, but there are multiple ports that need to be open between the different machines.

For example, by default there are two Redis servers (one "primary" and one or more "non-primary" shards). The "object manager" and "node manager" on each machine also need their own port.

These ports can be specified in the ray start command with flags like --redis-port=6379, --redis-shard-ports=6380, --node-manager-port=12345, and --object-manager-port=12346. Note that the Redis ones are only relevant on the head node.

For ray.init(), only the "primary" Redis address is relevant. That's the first one that is printed.

When you get the error message Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node, can you check the logs on the same machine in /tmp/ray/session*/ (just the most recent session) and see if anything looks like it crashed?

Thanks for the advice. Think we've made a lot of progress on our end today. It turns out what we think is happening is actually a silent security (?) failure by design. It occurs during this check (get_address_info_from_redis_helper in worker.py):

    relevant_client = None
    for client_info in client_table:
        client_node_ip_address = client_info["NodeManagerAddress"]
        if (client_node_ip_address == node_ip_address or
            (client_node_ip_address == "127.0.0.1"
             and redis_ip_address == ray.services.get_node_ip_address())):
            relevant_client = client_info
            break
    if relevant_client is None:
        raise Exception(
            "Redis has started but no raylets have registered yet.")

If I'm running ray.init() from my workstation or a cluster node that no Ray node is running on, that if statement never validates, whereas I can run a script containing ray.init() from any of the HPC nodes that are running either the Ray head node or a worker & that condition is therefore met (and the computation proceeds correctly). Autoscaling and all that good stuff also happens in my UGE node provider too, so that's really nice to see.

Now, I did the obvious dumb thing, which is to comment out that entire block and just replace with

    for client_info in client_table:
        relevant_client = client_info
        break

It didn't work (shocking), but what was interesting is the error message I got:

[{'ClientID': 'fbb5a9717637d374b38083006cba7715dc1b4b2a', 'IsInsertion': True, 'NodeManagerAddress': '10.145.56.49', 'NodeManagerPort': 34647, 'ObjectManagerPort': 40383, 'ObjectStoreSocketName': '/home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/plasma_store', 'RayletSocketName': '/home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/raylet', 'Resources': {'GPU': 0.0, 'CPU': 1.0}}, {'ClientID': 'c9ea48a358c79b97841edcbd881808d75b5c9dd3', 'IsInsertion': True, 'NodeManagerAddress': '10.145.56.40', 'NodeManagerPort': 35991, 'ObjectManagerPort': 42724, 'ObjectStoreSocketName': '/home/me/ray_cluster//clusters/pipeline/ray-pipeline-worker_4280p258/sockets/plasma_store', 'RayletSocketName': '/home/me/ray_cluster//clusters/pipeline/ray-pipeline-worker_4280p258/sockets/raylet', 'Resources': {'GPU': 0.0, 'CPU': 1.0}}, {'ClientID': '7556611c59b0dd82508cea5ca35a84e60819cec0', 'IsInsertion': True, 'NodeManagerAddress': '10.145.56.46', 'NodeManagerPort': 36860, 'ObjectManagerPort': 36938, 'ObjectStoreSocketName': '/home/me/ray_cluster//clusters/pipeline/ray-pipeline-worker_pva743dp/sockets/plasma_store', 'RayletSocketName': '/home/me/ray_cluster//clusters/pipeline/ray-pipeline-worker_pva743dp/sockets/raylet', 'Resources': {'GPU': 0.0, 'CPU': 1.0}}]
{'node_ip_address': '10.145.56.58', 'redis_address': '10.145.56.49:31236', 'object_store_address': '/home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/plasma_store', 'raylet_socket_name': '/home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/raylet', 'webui_url': None}
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0304 16:15:53.664083 20065 io.cc:168] Connection to IPC socket failed for pathname /home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/plasma_store, retrying 50 more times
E0304 16:15:53.764858 20065 io.cc:168] Connection to IPC socket failed for pathname /home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/plasma_store, retrying 49 more times
E0304 16:15:53.865279 20065 io.cc:168] Connection to IPC socket failed for pathname /home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/plasma_store, retrying 48 more times
E0304 16:15:53.965435 20065 io.cc:168] Connection to IPC socket failed for pathname /home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/plasma_store, retrying 47 more times
E0304 16:15:54.065852 20065 io.cc:168] Connection to IPC socket failed for pathname /home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/plasma_store, retrying 46 more times
E0304 16:15:54.165968 20065 io.cc:168] Connection to IPC socket failed for pathname /home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/plasma_store, retrying 45 more times
E0304 16:15:54.266175 20065 io.cc:168] Connection to IPC socket failed for pathname /home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/plasma_store, retrying 44 more times
E0304 16:15:54.366600 20065 io.cc:168] Connection to IPC socket failed for pathname /home/me/ray_cluster//clusters/pipeline/ray-pipeline-head/sockets/plasma_store, retrying 43 more times

This would seem to mirror https://github.com/modin-project/modin/issues/468 / https://github.com/ray-project/ray/issues/4193 . No idea if they are truly related. Note that our workstations' home directories are actually a mount of the HPC NFS (and this plasma store error also appears in the "running ray.init() on an HPC node where a ray node isn't running" scenario).

Is this some additional layer of security causing this, you think? There's a (to me) somewhat cryptic comment at the top of the get_address_info_from_redis_helper function:

# For this command to work, some other client (on the same machine as
# Redis) must have run "CONFIG SET protected-mode no".

Seems like this could be the reason (maybe?...).

Oh. This is because this is a shared memory socket for Arrow, isn't it, meaning that it could only possibly work if you connected to one on the same box (same memory) as your program?

That would explain why that check that I tried commenting out exists...

Think I got it!

I believe that was exactly it. I came up with a goofy little solution if you are insistent on ray.initing from a computer that has no Ray nodes running on it, which is to ray start a worker locally which you point to the head redis, but you set --num-cpus on it to 0 and all your jobs are then pushed to the real workers. :)

@ls-daniel - re: the socket thing above, think this just could be a ray start --socket-dir or whatever parameter?

Going to take a stab at the sockets thing above now that I can compile Ray (and run the unit tests) unless someone has a more appropriate way / idea to go about this. Think it might turn out to be essential to making Ray usable on a shared env like HPC clusters.

This was caused by trying to run a driver from a computer that was not running a ray worker. It can be solved by simply starting a "dummy" worker on that computer - ray start --num-cpus 0 --redis-address yourheadnode:port

Was this page helpful?
0 / 5 - 0 ratings