[autoscaler]
similar to https://github.com/ray-project/ray/issues/8033.
And when I open debug mode, find the raylet.out in the worker node does not contains log like "Finished registering node info, status = OK, node id" which means the node does not register itself successfully.
But:
Ray version and other system information (Python version, TensorFlow version, OS):
Ray: 0.8.5
OS: Ubuntu 18.04.2
Python 3.6.8
Are you also using multi machines with Docker?
Not docker, but similar tech: KVM and runc
@ijrsvt Probably also related
@ijrsvt @rkooo567 any progress?
@songpengwei We have not been able to reproduce this--can you give us some more context about your cluster config, the config yaml and the output from raylet.out?
I went into the same problem, worker nodes cannot be found after upgrading to 0.8.5
And for me, it's caused by the GCS Server got a wrong ip
I0727 17:07:25.352876 52834 52834 global_state_accessor.cc:25] Redis server address = 10.251.231.121:6379, is test flag = 0
I0727 17:07:25.362498 52834 52834 redis_client.cc:146] RedisClient connected.
I0727 17:07:25.371081 52834 52834 redis_gcs_client.cc:88] RedisGcsClient Connected.
I0727 17:07:25.472491 52834 52834 service_based_gcs_client.cc:193] Reconnected to GCS server: 127.0.0.1:40531
I0727 17:07:25.472697 52834 52834 service_based_accessor.cc:91] Reestablishing subscription for job info.
I0727 17:07:25.472707 52834 52834 service_based_accessor.cc:421] Reestablishing subscription for actor info.
I0727 17:07:25.472715 52834 52834 service_based_accessor.cc:796] Reestablishing subscription for node info.
I0727 17:07:25.472728 52834 52834 service_based_accessor.cc:1068] Reestablishing subscription for task info.
I0727 17:07:25.472736 52834 52834 service_based_accessor.cc:1243] Reestablishing subscription for object locations.
I0727 17:07:25.472743 52834 52834 service_based_accessor.cc:1363] Reestablishing subscription for worker failures.
I0727 17:07:25.472752 52834 52834 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected.
Reconnected to GCS server: 127.0.0.1:40531, while Redis server address = 10.251.231.121:6379
If I set GcsServerAddress to 10.251.231.121:40531 in redis manually, everything goes right.
I went into the same problem, worker nodes cannot be found after upgrading to 0.8.5
And for me, it's caused by the GCS Server got a wrong ipI0727 17:07:25.352876 52834 52834 global_state_accessor.cc:25] Redis server address = 10.251.231.121:6379, is test flag = 0 I0727 17:07:25.362498 52834 52834 redis_client.cc:146] RedisClient connected. I0727 17:07:25.371081 52834 52834 redis_gcs_client.cc:88] RedisGcsClient Connected. I0727 17:07:25.472491 52834 52834 service_based_gcs_client.cc:193] Reconnected to GCS server: 127.0.0.1:40531 I0727 17:07:25.472697 52834 52834 service_based_accessor.cc:91] Reestablishing subscription for job info. I0727 17:07:25.472707 52834 52834 service_based_accessor.cc:421] Reestablishing subscription for actor info. I0727 17:07:25.472715 52834 52834 service_based_accessor.cc:796] Reestablishing subscription for node info. I0727 17:07:25.472728 52834 52834 service_based_accessor.cc:1068] Reestablishing subscription for task info. I0727 17:07:25.472736 52834 52834 service_based_accessor.cc:1243] Reestablishing subscription for object locations. I0727 17:07:25.472743 52834 52834 service_based_accessor.cc:1363] Reestablishing subscription for worker failures. I0727 17:07:25.472752 52834 52834 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected.
Reconnected to GCS server: 127.0.0.1:40531, whileRedis server address = 10.251.231.121:6379
If I setGcsServerAddressto10.251.231.121:40531in redis manually, everything goes right.
Thank you @bbtfr !!! It solves my problem too
@rkooo567, @ijrsvt Probably this is fixed by #10004