Ray: [core] Raylet does not start up properly on remote instance

Created on 26 Aug 2020  路  18Comments  路  Source: ray-project/ray

What is the problem?

I've started a 2 node cluster, but the remote node has to retry 10 times in order to startup.

rliaw@ray-gpu-docker-worker-97ecf362:/tmp/ray/session_latest/logs$ ls
log_monitor.10.err   plasma_store.11.out                                                                    raylet.11.out    reporter.1.out
log_monitor.10.out   plasma_store.12.err                                                                    raylet.12.err    reporter.2.err
log_monitor.11.err   plasma_store.12.out                                                                    raylet.12.out    reporter.2.out
log_monitor.11.out   plasma_store.1.err                                                                     raylet.1.err     reporter.3.err
log_monitor.12.err   plasma_store.1.out                                                                     raylet.1.out     reporter.3.out
log_monitor.12.out   plasma_store.2.err                                                                     raylet.2.err     reporter.4.err
log_monitor.1.err    plasma_store.2.out                                                                     raylet.2.out     reporter.4.out
log_monitor.1.out    plasma_store.3.err                                                                     raylet.3.err     reporter.5.err
log_monitor.2.err    plasma_store.3.out                                                                     raylet.3.out     reporter.5.out
log_monitor.2.out    plasma_store.4.err                                                                     raylet.4.err     reporter.6.err
log_monitor.3.err    plasma_store.4.out                                                                     raylet.4.out     reporter.6.out
log_monitor.3.out    plasma_store.5.err                                                                     raylet.5.err     reporter.7.err
log_monitor.4.err    plasma_store.5.out                                                                     raylet.5.out     reporter.7.out
log_monitor.4.out    plasma_store.6.err                                                                     raylet.6.err     reporter.8.err
log_monitor.5.err    plasma_store.6.out                                                                     raylet.6.out     reporter.8.out
log_monitor.5.out    plasma_store.7.err                                                                     raylet.7.err     reporter.9.err
log_monitor.6.err    plasma_store.7.out                                                                     raylet.7.out     reporter.9.out
log_monitor.6.out    plasma_store.8.err                                                                     raylet.8.err     reporter.err
log_monitor.7.err    plasma_store.8.out                                                                     raylet.8.out     reporter.out
log_monitor.7.out    plasma_store.9.err                                                                     raylet.9.err     worker-308de9ed75f814ffdea62d08393b3330a48e16b9-13529.err
log_monitor.8.err    plasma_store.9.out                                                                     raylet.9.out     worker-308de9ed75f814ffdea62d08393b3330a48e16b9-13529.out
log_monitor.8.out    plasma_store.err                                                                       raylet.err       worker-35d505598ecd15b326ea158ca3597109e8b84fec-13530.err
log_monitor.9.err    plasma_store.out                                                                       raylet.out       worker-35d505598ecd15b326ea158ca3597109e8b84fec-13530.out
log_monitor.9.out    python-core-worker-336a05fc3287e69c95972742c25fff720d52d688.20200826-200358.13529.log  reporter.10.err  worker-3877e3561edd1e1db1576dbbdd6f091412dc4346-0100-13528.err
log_monitor.err      python-core-worker-49f675a34a9fc5484085e1192f001ffc8e51720d.20200826-200358.13530.log  reporter.10.out  worker-3877e3561edd1e1db1576dbbdd6f091412dc4346-0100-13528.out
log_monitor.out      python-core-worker-b715d4ed0005beaa6a77909b20bb6d1fc1576ecc.20200826-200358.13528.log  reporter.11.err  worker-3877e3561edd1e1db1576dbbdd6f091412dc4346-13528.err
old                  python-core-worker-cf9cd0d5f987e7bddd7b1fd6ff0f95955c3d3779.20200826-200358.13527.log  reporter.11.out  worker-3877e3561edd1e1db1576dbbdd6f091412dc4346-13528.out
plasma_store.10.err  raylet.10.err                                                                          reporter.12.err  worker-4456bccd1793ce975b7f1144fb29665e1002f101-13527.err
plasma_store.10.out  raylet.10.out                                                                          reporter.12.out  worker-4456bccd1793ce975b7f1144fb29665e1002f101-13527.out
plasma_store.11.err  raylet.11.err                                                                          reporter.1.err

All of the raylet error files look like:

E0826 20:02:15.725277270   12954 server_chttp2.cc:40]        {"created":"@1598472135.725176780","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/c
ore/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1598472135.725174514","description":"Failed to add any wildcard listeners","file":"external/com_
github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":341,"referenced_errors":[{"created":"@1598472135.725159248","description":"Unable to configure socket","fd":32,"file":"external
/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":208,"referenced_errors":[{"created":"@1598472135.725152253","description":"Address already in use","errno":98
,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1598472135.72517376
0","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":208,"referenced_errors":[{"created":"@1
598472135.725170986","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Addres
s already in use","syscall":"bind"}]}]}]}
*** Aborted at 1598472135 (unix time) try "date -d @1598472135" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x55d400000058) received by PID 12954 (TID 0x7ff6e633b7c0) from PID 88; stack trace: ***
    @     0x7ff6e58990e0 (unknown)
    @     0x55d44e85c782 grpc::ServerInterface::RegisteredAsyncRequest::IssueRequest()
    @     0x55d44e4fc199 ray::rpc::ObjectManagerService::WithAsyncMethod_Push<>::RequestPush()
    @     0x55d44e50bdfb ray::rpc::ServerCallFactoryImpl<>::CreateCall()
    @     0x55d44e785c69 ray::rpc::GrpcServer::Run()
    @     0x55d44e50045e ray::ObjectManager::StartRpcService()
    @     0x55d44e510f1c ray::ObjectManager::ObjectManager()
    @     0x55d44e466162 ray::raylet::Raylet::Raylet()
    @     0x55d44e43fc3d _ZZ4mainENKUlN3ray6StatusEN5boost8optionalISt13unordered_mapISsSsSt4hashISsESt8equal_toISsESaISt4pairIKSsSsEEEEEE_clES0_SD_
    @     0x55d44e440c41 _ZNSt17_Function_handlerIFvN3ray6StatusERKN5boost8optionalISt13unordered_mapISsSsSt4hashISsESt8equal_toISsESaISt4pairIKSsSsEEEEEEZ4mainEUlS1_SE_E_E9_M_invokeERKSt9_Any_dat
aS1_SG_
    @     0x55d44e5bd6ac _ZZN3ray3gcs28ServiceBasedNodeInfoAccessor22AsyncGetInternalConfigERKSt8functionIFvNS_6StatusERKN5boost8optionalISt13unordered_mapISsSsSt4hashISsESt8equal_toISsESaISt4pair
IKSsSsEEEEEEEENKUlRKS3_RKNS_3rpc22GetInternalConfigReplyEE_clESO_SS_
    @     0x55d44e56f39f _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc22GetInternalConfigReplyEEZNS4_12GcsRpcClient17GetInternalConfigERKNS4_24GetInternalConfigRequestERKSt8functionIS8_EEUl
S3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
    @     0x55d44e56f49d ray::rpc::ClientCallImpl<>::OnReplyReceived()
    @     0x55d44e49d690 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10err
or_codeEm
    @     0x55d44eaef54f boost::asio::detail::scheduler::do_run_one()
    @     0x55d44eaf0a51 boost::asio::detail::scheduler::run()
    @     0x55d44eaf1a82 boost::asio::io_context::run()
    @     0x55d44e421730 main
    @     0x7ff6e50ea2e1 __libc_start_main
    @     0x55d44e4328b1 (unknown)

Ray version and other system information (Python version, TensorFlow version, OS):

Latest. python=3.7.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

# An unique identifier for the head node and workers of this cluster.
cluster_name: gpu-docker

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 1

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 1

# The initial number of worker nodes to launch in addition to the head
# node. When the cluster is first brought up (or when it is refreshed with a
# subsequent `ray up`) this number of nodes will be started.
initial_workers: 0

# Whether or not to autoscale aggressively. If this is enabled, if at any point
#   we would start more workers, we start at least enough to bring us to
#   initial_workers.
autoscaling_mode: default

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
# docker:
#     image: "tensorflow/tensorflow:1.13.1-gpu-py3"
#     container_name: "ray-nvidia-docker-test" # e.g. ray_docker
#     run_options:
#       - --runtime=nvidia

    # # Example of running a GPU head with CPU workers
    # head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
    # head_run_options:
    #     - --runtime=nvidia

    # worker_image: "ubuntu:18.04"
    # worker_run_options: []


# The autoscaler will scale up the cluster to this target fraction of resource
# usage. For example, if a cluster of 10 nodes is 100% busy and
# target_utilization is 0.8, it would resize the cluster to 13. This fraction
# can be decreased to increase the aggressiveness of upscaling.
# This value must be less than 1.0 for scaling to happen.
target_utilization_fraction: 0.8

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-central1
    availability_zone: us-central1-a
    project_id: ~~~~~~~~~~~~~~ # Globally unique project id

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
head_node:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 100
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/ml-images/global/images/c5-deeplearning-tf2-2-2-cu101-v20200701
    guestAccelerators:
      - acceleratorType: projects/~~~~~~~~~~~~~~/zones/us-central1-a/acceleratorTypes/nvidia-tesla-p4
        acceleratorCount: 1
    scheduling:
      - onHostMaintenance: TERMINATE
        preemptible: false
        automaticRestart: true
    metadata:
      - kind: compute#metadata
        items:
        - { "key": "install-nvidia-driver", "value": "True" }

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

worker_nodes:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 100
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/ml-images/global/images/c5-deeplearning-tf2-2-2-cu101-v20200701
    guestAccelerators:
      - acceleratorType: projects/~~~~~~~~~~~~~~/zones/us-central1-a/acceleratorTypes/nvidia-tesla-p4
        acceleratorCount: 1
    scheduling:
      - onHostMaintenance: TERMINATE
        preemptible: false
        automaticRestart: true
    metadata:
      - kind: compute#metadata
        items:
        - { "key": "install-nvidia-driver", "value": "True" }


# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
    /home/ubuntu/train_data.txt: /Users/rliaw/dev/summit-tune-demo/train_data.txt
}


# initialization_commands:
#     # Wait until nvidia drivers are installed
#     - >-
#       timeout 300 bash -c "
#           command -v nvidia-smi && nvidia-smi
#           until [ \$? -eq 0 ]; do
#               command -v nvidia-smi && nvidia-smi
#           done"

# List of shell commands to run to set up nodes.
setup_commands:
    - source /opt/conda/bin/activate && pip install -U pip
    - source /opt/conda/bin/activate && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp37-cp37m-manylinux1_x86_64.whl
    - source /opt/conda/bin/activate && pip install torch==1.4.0 torchvision==0.5.0
    - source /opt/conda/bin/activate && pip install transformers
    - source /opt/conda/bin/activate && pip install wandb

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - source /opt/conda/bin/activate && pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - source /opt/conda/bin/activate && ray stop
    - >-
      ulimit -n 65536;
      source /opt/conda/bin/activate && ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start source /opt/conda/bin/activate && ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - source /opt/conda/bin/activate && ray stop
    - >-
      ulimit -n 65536;
      source /opt/conda/bin/activate && ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

If we cannot run your script, we cannot fix your issue.

  • [ ] I have verified my script runs in a clean environment and reproduces the issue.
  • [ ] I have verified the issue also occurs with the latest wheels.
P0 bug

All 18 comments

cc @ray-project/ray-core

More information on similar issue.

Setup : autoscaler/kubernetes
initial_workers: 2
min_workers: 2
max_workers: 4

Observation:

  1. Workers start running in the pod.
  2. But no Ray-related processes running on them.
    Running : ps -aux| grep ray inside worker-node (kubectl exec)
    Out: root 188 0.0 0.0 5196 740 pts/0 S+ 16:14 0:00 grep --color=auto ray

Thus, workers wont run worker_start_ray_commands OR ray start successfully ? I was able to reproduce this issue in my minikube env too, which rules out anything wrong with our K8s provider (Google).

The manual workaround that I am using at the moment :

  1. kubectl -n ray get pods. Copy worker-pod-id
  2. kubectl exec -it <worker-id> -- bin/bash
  3. ray start --num-cpus=1 --address=$RAY_HEAD_SERVICE_HOST:6379 --object-manager-port=8076. To get value for $RAY_HEAD_SERVICE_HOST you can either printenv in worker OR kubectl -n ray get svc and copy the IP of the ray-head svc.
    Do 1-3 in all the worker.

To test what error the StandardAutoscaler is running into, I killed one worker. Thus triggering the Autoscaler to re-span a worker to maintain the min_worker count. Here is the log from the head node.

2020-08-26 12:03:15,238 INFO autoscaler.py:467 -- Cluster status: 2/2 target nodes (0 pending) (2 updating) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.17289352416992188}
 - NodeIdleSeconds: Min=227 Mean=227 Max=227
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0


==> /tmp/ray/session_latest/logs/monitor.err <==
2020-08-26 12:03:15,238 INFO autoscaler.py:467 -- Cluster status: 2/2 target nodes (0 pending) (2 updating) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.17289352416992188}
 - NodeIdleSeconds: Min=227 Mean=227 Max=227
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0


==> /tmp/ray/session_2020-08-26_11-59-24_547664_135/logs/monitor.err <==
2020-08-26 12:03:20,269 INFO autoscaler.py:467 -- Cluster status: 1/2 target nodes (0 pending) (2 updating) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.15368938446044922}
 - NodeIdleSeconds: Min=233 Mean=233 Max=233
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0

2020-08-26 12:03:20,280 INFO autoscaler.py:447 -- StandardAutoscaler: Queue 1 new nodes for launch
2020-08-26 12:03:20,282 INFO node_launcher.py:74 -- NodeLauncher1: Got 1 nodes to launch.
2020-08-26 12:03:20,292 INFO node_launcher.py:74 -- NodeLauncher1: Launching 1 nodes, type None.
2020-08-26 12:03:20,293 INFO node_provider.py:87 -- KubernetesNodeProvider: calling create_namespaced_pod (count=1).
2020-08-26 12:03:20,298 INFO autoscaler.py:467 -- Cluster status: 1/2 target nodes (1 pending) (2 updating) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.18278002738952637}
 - NodeIdleSeconds: Min=233 Mean=233 Max=233
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0


==> /tmp/ray/session_latest/logs/monitor.err <==
2020-08-26 12:03:20,269 INFO autoscaler.py:467 -- Cluster status: 1/2 target nodes (0 pending) (2 updating) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.15368938446044922}
 - NodeIdleSeconds: Min=233 Mean=233 Max=233
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0

2020-08-26 12:03:20,280 INFO autoscaler.py:447 -- StandardAutoscaler: Queue 1 new nodes for launch
2020-08-26 12:03:20,282 INFO node_launcher.py:74 -- NodeLauncher1: Got 1 nodes to launch.
2020-08-26 12:03:20,292 INFO node_launcher.py:74 -- NodeLauncher1: Launching 1 nodes, type None.
2020-08-26 12:03:20,293 INFO node_provider.py:87 -- KubernetesNodeProvider: calling create_namespaced_pod (count=1).
2020-08-26 12:03:20,298 INFO autoscaler.py:467 -- Cluster status: 1/2 target nodes (1 pending) (2 updating) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.18278002738952637}
 - NodeIdleSeconds: Min=233 Mean=233 Max=233
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0


==> /tmp/ray/session_2020-08-26_11-59-24_547664_135/logs/monitor.err <==
2020-08-26 12:03:23,211 INFO log_timer.py:27 -- NodeUpdater: ray-worker-4gc2c: Got remote shell  [LogTimer=230969ms]
2020-08-26 12:03:23,211 INFO log_timer.py:27 -- NodeUpdater: ray-worker-4gc2c: Applied config 6a1cdc5f85138bb386579a27532ac2cc0bacd521  [LogTimer=231063ms]
2020-08-26 12:03:23,240 ERROR updater.py:101 -- NodeUpdater: ray-worker-4gc2c: Error executing: Unable to connect to node

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 89, in run
    self.do_update()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 246, in do_update
    self.wait_ready(deadline)
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 238, in wait_ready
    assert False, "Unable to connect to node"
AssertionError: Unable to connect to node


==> /tmp/ray/session_latest/logs/monitor.err <==
2020-08-26 12:03:23,211 INFO log_timer.py:27 -- NodeUpdater: ray-worker-4gc2c: Got remote shell  [LogTimer=230969ms]
2020-08-26 12:03:23,211 INFO log_timer.py:27 -- NodeUpdater: ray-worker-4gc2c: Applied config 6a1cdc5f85138bb386579a27532ac2cc0bacd521  [LogTimer=231063ms]
2020-08-26 12:03:23,240 ERROR updater.py:101 -- NodeUpdater: ray-worker-4gc2c: Error executing: Unable to connect to node

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 89, in run
    self.do_update()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 246, in do_update
    self.wait_ready(deadline)
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 238, in wait_ready
    assert False, "Unable to connect to node"
AssertionError: Unable to connect to node


==> /tmp/ray/session_2020-08-26_11-59-24_547664_135/logs/monitor.err <==
2020-08-26 12:03:25,383 INFO autoscaler.py:467 -- Cluster status: 2/2 target nodes (0 pending) (2 updating) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.15513324737548828}
 - NodeIdleSeconds: Min=238 Mean=238 Max=238
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0

2020-08-26 12:03:25,406 ERROR autoscaler.py:129 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 127, in update
    self._update()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 238, in _update
    self.load_metrics.mark_active(self.provider.internal_ip(node_id))
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/kubernetes/node_provider.py", line 65, in internal_ip
    pod = core_api().read_namespaced_pod_status(node_id, self.namespace)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 19328, in read_namespaced_pod_status
    (data) = self.read_namespaced_pod_status_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 19413, in read_namespaced_pod_status_with_http_info
    collection_formats=collection_formats)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 345, in call_api
    _preload_content, _request_timeout)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 176, in __call_api
    _request_timeout=_request_timeout)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 366, in request
    headers=headers)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 241, in GET
    query_params=query_params)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Wed, 26 Aug 2020 19:03:25 GMT', 'Content-Length': '200'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-4gc2c\" not found","reason":"NotFound","details":{"name":"ray-worker-4gc2c","kind":"pods"},"code":404}



==> /tmp/ray/session_latest/logs/monitor.err <==
2020-08-26 12:03:25,383 INFO autoscaler.py:467 -- Cluster status: 2/2 target nodes (0 pending) (2 updating) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.15513324737548828}
 - NodeIdleSeconds: Min=238 Mean=238 Max=238
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0

2020-08-26 12:03:25,406 ERROR autoscaler.py:129 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 127, in update
    self._update()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 238, in _update
    self.load_metrics.mark_active(self.provider.internal_ip(node_id))
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/kubernetes/node_provider.py", line 65, in internal_ip
    pod = core_api().read_namespaced_pod_status(node_id, self.namespace)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 19328, in read_namespaced_pod_status
    (data) = self.read_namespaced_pod_status_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 19413, in read_namespaced_pod_status_with_http_info
    collection_formats=collection_formats)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 345, in call_api
    _preload_content, _request_timeout)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 176, in __call_api
    _request_timeout=_request_timeout)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 366, in request
    headers=headers)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 241, in GET
    query_params=query_params)
  File "/root/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Wed, 26 Aug 2020 19:03:25 GMT', 'Content-Length': '200'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-4gc2c\" not found","reason":"NotFound","details":{"name":"ray-worker-4gc2c","kind":"pods"},"code":404}



==> /tmp/ray/session_2020-08-26_11-59-24_547664_135/logs/monitor.err <==
2020-08-26 12:03:30,422 INFO autoscaler.py:467 -- Cluster status: 2/2 target nodes (0 pending) (1 updating) (1 failed to update) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.173508882522583}
 - NodeIdleSeconds: Min=243 Mean=243 Max=243
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0

2020-08-26 12:03:30,462 INFO updater.py:74 -- NodeUpdater: ray-worker-6k5hj: Updating to 6a1cdc5f85138bb386579a27532ac2cc0bacd521
2020-08-26 12:03:30,500 INFO updater.py:194 -- NodeUpdater: ray-worker-6k5hj: Waiting for remote shell...

==> /tmp/ray/session_latest/logs/monitor.err <==
2020-08-26 12:03:30,422 INFO autoscaler.py:467 -- Cluster status: 2/2 target nodes (0 pending) (1 updating) (1 failed to update) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.173508882522583}
 - NodeIdleSeconds: Min=243 Mean=243 Max=243
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0

2020-08-26 12:03:30,462 INFO updater.py:74 -- NodeUpdater: ray-worker-6k5hj: Updating to 6a1cdc5f85138bb386579a27532ac2cc0bacd521
2020-08-26 12:03:30,500 INFO updater.py:194 -- NodeUpdater: ray-worker-6k5hj: Waiting for remote shell...

==> /tmp/ray/session_2020-08-26_11-59-24_547664_135/logs/monitor.err <==
2020-08-26 12:03:35,459 INFO autoscaler.py:467 -- Cluster status: 2/2 target nodes (0 pending) (2 updating) (1 failed to update) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.15118980407714844}
 - NodeIdleSeconds: Min=248 Mean=248 Max=248
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0


==> /tmp/ray/session_latest/logs/monitor.err <==
2020-08-26 12:03:35,459 INFO autoscaler.py:467 -- Cluster status: 2/2 target nodes (0 pending) (2 updating) (1 failed to update) (bringup=True)
 - MostDelayedHeartbeats: {'172.17.0.6': 0.15118980407714844}
 - NodeIdleSeconds: Min=248 Mean=248 Max=248
 - NumNodesConnected: 1
 - NumNodesUsed: 0.0
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/1.21 GiB memory, 0.0 GiB/0.42 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0


@raulchen giving to you for triage; feel free to send back.

I'm able to get the same stack trace as @richardliaw on a cluster of 4 machines started with the Ray cluster launcher. The error message appears at the beginning of the job, I think before any application code has run.

Digging into this a bit more, it seems to have something to do with how ray processes are torn down and/or the autoscaler. The stacktrace appears to be caused by the object manager binding to a port that's already in use, probably from the previous raylet that ran on the same node. Sometimes it's able to resolve itself, but sometimes the autoscaler gets into a bad loop where it keeps trying to restart a worker raylet only to have it crash again.

@ericl, any ideas?

Okay, it does seem to have something to do with how ray processes get torn down. I ssh'ed into a worker node on a cluster that was experiencing the loop and noticed there was a leftover raylet process, probably from a previous version of the cluster. After running ray stop, I got the message saying that "Stopped all n Ray processes", but the raylet process was still alive.

@richardliaw @edoakes did we make any changes to teardown / port configuration that may affect this?

@stephanie-wang I can't think of any cause here, maybe bisect time?

I've tried bisecting, but didn't get very far. It's pretty finicky to reproduce since it seems to be based on timing. I'm pretty sure that this was introduced after 0.8.7, though.

Can you try reverting to 4f8fef134e1611e6b83a3156af8a26db762cb0d3? (or before then)?

This was the last change to ray stop past 0.8.7.

This is:

commit 40b8e35d61982b149e88fc09effd001879649bff (HEAD) <------------- suspect?
Author: Maksim Smolin <[email protected]>
Date:   Tue Aug 11 09:58:23 2020 -0700

    [cli] New logging for the rest of the `ray` commands (#9984)

    Co-authored-by: Richard Liaw <[email protected]>

commit 4f8fef134e1611e6b83a3156af8a26db762cb0d3
Author: Amog Kamsetty <[email protected]>
Date:   Tue Aug 11 09:05:57 2020 -0700

    [Tune] Remove `checkpoint_at_end` from tune+serve docs (#10034)

commit 0400a88bf1b97a4e6b2021a3a2e9c25310b22170
Author: Basasuya <[email protected]>
Date:   Tue Aug 11 17:36:07 2020 +0800

    [EVENT] Basic Function and Definition (#9657)

Both of these commits seem to have the same issue :( Let me try bisecting again and see if I can figure it out.

Realizing now that my previous attempts to bisect were probably getting messed up because of a zombie raylet process from who knows which commit :/ It's still turning out to be pretty difficult to consistently check whether a commit is OK or not, but for now using ray stop --force instead of ray stop in the cluster yaml seems to work.

I couldn't find any clue from the error information. I suspect this is something related with the environment. Did you look into the core dump file?

Okay, this is the first bad commit. Here is the test script that I used. It will hang and should continually print the stacktraces shown in the original issue if the commit is bad. You also have to make sure to first clean the cluster with ray stop --force, then restart again with the normal ray stop.

Given that this PR is only needed for GCS restart (as I understand it), I think we should revert this PR ASAP. I tried to do it, but I wasn't totally clear on how to resolve the git conflicts. Can you help out, @raulchen @ffbin?

Hi @stephanie-wang , i prefer to add a configuration with detect turned off by default, so that the changes are minimal. What is your suggestion? If you agree with this proposal, I will submit a pr, thanks.

Sure, but let's make sure to keep the number of configuration parameters minimal. Ideally, we should only have one for any code related to GCS restart. Thanks!

@ffbin, any update on this? This is high priority since the branch cutoff for 1.0 is very soon.

@stephanie-wang I will submit a pr today. The company has outgoing these days, so it's a little delayed, Sorry.

Was this page helpful?
0 / 5 - 0 ratings