Ray: The worker node could not register node info successfully in VM

Created on 28 May 2020  路  8Comments  路  Source: ray-project/ray


[autoscaler]

What is your question?

similar to https://github.com/ray-project/ray/issues/8033.
And when I open debug mode, find the raylet.out in the worker node does not contains log like "Finished registering node info, status = OK, node id" which means the node does not register itself successfully.

But:

  1. I could access redis from the worker by redis-cli and python.
  2. The worker node could get ClientTableNotification from redis
  3. But it can't set node info and heartbeat to GCS successfully

Ray version and other system information (Python version, TensorFlow version, OS):
Ray: 0.8.5
OS: Ubuntu 18.04.2
Python 3.6.8

question

All 8 comments

Are you also using multi machines with Docker?

Not docker, but similar tech: KVM and runc

@ijrsvt Probably also related

@ijrsvt @rkooo567 any progress?

@songpengwei We have not been able to reproduce this--can you give us some more context about your cluster config, the config yaml and the output from raylet.out?

I went into the same problem, worker nodes cannot be found after upgrading to 0.8.5
And for me, it's caused by the GCS Server got a wrong ip

I0727 17:07:25.352876 52834 52834 global_state_accessor.cc:25] Redis server address = 10.251.231.121:6379, is test flag = 0
I0727 17:07:25.362498 52834 52834 redis_client.cc:146] RedisClient connected.
I0727 17:07:25.371081 52834 52834 redis_gcs_client.cc:88] RedisGcsClient Connected.
I0727 17:07:25.472491 52834 52834 service_based_gcs_client.cc:193] Reconnected to GCS server: 127.0.0.1:40531
I0727 17:07:25.472697 52834 52834 service_based_accessor.cc:91] Reestablishing subscription for job info.
I0727 17:07:25.472707 52834 52834 service_based_accessor.cc:421] Reestablishing subscription for actor info.
I0727 17:07:25.472715 52834 52834 service_based_accessor.cc:796] Reestablishing subscription for node info.
I0727 17:07:25.472728 52834 52834 service_based_accessor.cc:1068] Reestablishing subscription for task info.
I0727 17:07:25.472736 52834 52834 service_based_accessor.cc:1243] Reestablishing subscription for object locations.
I0727 17:07:25.472743 52834 52834 service_based_accessor.cc:1363] Reestablishing subscription for worker failures.
I0727 17:07:25.472752 52834 52834 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected.

Reconnected to GCS server: 127.0.0.1:40531, while Redis server address = 10.251.231.121:6379
If I set GcsServerAddress to 10.251.231.121:40531 in redis manually, everything goes right.

I went into the same problem, worker nodes cannot be found after upgrading to 0.8.5
And for me, it's caused by the GCS Server got a wrong ip

I0727 17:07:25.352876 52834 52834 global_state_accessor.cc:25] Redis server address = 10.251.231.121:6379, is test flag = 0
I0727 17:07:25.362498 52834 52834 redis_client.cc:146] RedisClient connected.
I0727 17:07:25.371081 52834 52834 redis_gcs_client.cc:88] RedisGcsClient Connected.
I0727 17:07:25.472491 52834 52834 service_based_gcs_client.cc:193] Reconnected to GCS server: 127.0.0.1:40531
I0727 17:07:25.472697 52834 52834 service_based_accessor.cc:91] Reestablishing subscription for job info.
I0727 17:07:25.472707 52834 52834 service_based_accessor.cc:421] Reestablishing subscription for actor info.
I0727 17:07:25.472715 52834 52834 service_based_accessor.cc:796] Reestablishing subscription for node info.
I0727 17:07:25.472728 52834 52834 service_based_accessor.cc:1068] Reestablishing subscription for task info.
I0727 17:07:25.472736 52834 52834 service_based_accessor.cc:1243] Reestablishing subscription for object locations.
I0727 17:07:25.472743 52834 52834 service_based_accessor.cc:1363] Reestablishing subscription for worker failures.
I0727 17:07:25.472752 52834 52834 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected.

Reconnected to GCS server: 127.0.0.1:40531, while Redis server address = 10.251.231.121:6379
If I set GcsServerAddress to 10.251.231.121:40531 in redis manually, everything goes right.

Thank you @bbtfr !!! It solves my problem too

@rkooo567, @ijrsvt Probably this is fixed by #10004

Was this page helpful?
0 / 5 - 0 ratings