Ray: Dashboard failures with include_dashboard set to false

Created on 11 Nov 2020 · 38Comments · Source: ray-project/ray

Running Tune with A3C fails straight at the beginning with the following traceback:

2020-11-11 14:13:37,114 WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
    agent_ip_address=self.ip))
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
    debug_error_string = "{"created":"@1605096817.110308830","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605096817.110303917","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>

*This is obviously a dashboard related exception, which is unexpected since include_dashboard is set to False.
It might be related to https://github.com/ray-project/ray/issues/11943 but it shouldn't happen if this flag is set to False, so it's a different issue.
*

Ray version and other system information (Python version, TensorFlow version, OS):
Ray installed via https://docs.ray.io/en/master/development.html#building-ray-python-only
on both latest master and releases/1.0.1

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

    ray.init(include_dashboard=False)
    tune.run(
        A3CTrainer,
        config=<any config>,
        stop={
            "timesteps_total": 50e6,
        },
    )

[ ] I have verified my script runs in a clean environment and reproduces the issue.
[ ] I have verified the issue also occurs with the latest wheels.

P1 bug dashboard fix-error-msg

Source

roireshef

👍1

Most helpful comment

This looks like a bad bug. @mfitton can you take a look at it?

rkooo567 on 11 Nov 2020

👍3

All 38 comments

Although this is documented in https://docs.ray.io/en/master/configure.html?highlight=ports#ports-configurations
when running ray.init() without "ray start" beforehand, the include_dashboard flag doesn't work well

roireshef on 11 Nov 2020

This actually happens even if I run "ray start" with "--include-dashboard=false"

roireshef on 11 Nov 2020

👍1

This looks like a bad bug. @mfitton can you take a look at it?

rkooo567 on 11 Nov 2020

👍3

Yep, I'll take a look. Thanks for reporting @roireshef and apologies for the inconvenience. We're moving to a new dashboard backend that's currently in the nightly, so your bug report is really helpful as far as helping iron out issues before we roll this out more broadly.

mfitton on 11 Nov 2020

I've noticed this is because in the new dashboard architecture we start-up the dashboard agent regardless of whether include_dashboard is specified. This could be because the dashboard agent is the entity that receives ray stats via GRPC for export to Prometheus.

@fyrestone I'm planning on creating a PR to make the dashboard agent not start when include_dashboard is false. Am I missing any issues that doing this could cause?

mfitton on 11 Nov 2020

👍1

The immidiate issue I noticed is people cannot export metrics?

rkooo567 on 11 Nov 2020

Yep, that's true, they wouldn't be able to export metrics without running the dashboard.

mfitton on 11 Nov 2020

I think that’s not ideal though. I can imagine users who want to export metrics while they don’t have the dashboard

rkooo567 on 11 Nov 2020

👍1

Yeah. It's kind of a general problem that people might want to run certain backend dashboard modules without running the actual web ui. Especially when more APIs are introduced that are accessed via the dashboard.

That said, the main reason to want to not run the dashboard is performance-oriented, as it generally won't cause any other issues to have it run.

We might want to eventually move to an API where a user can specify which dashboard modules they want to run.

mfitton on 11 Nov 2020

Not sure what the best thing to do is for now. I can't repro the issue yet because the repro script is incomplete, but I would be surprised if the dashboard warning message is actually linked to the training failing. I'm going to try to search for where the A3CTrainer comes from.

mfitton on 11 Nov 2020

I found A3CTrainer in ray.rllib.agents.ac3. That said, the provided script fails with the error

Traceback (most recent call last):
  File "/Users/maxfitton/Development/ray/python/ray/tune/trial_runner.py", line 547, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/maxfitton/Development/ray/python/ray/tune/ray_trial_executor.py", line 484, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/maxfitton/Development/ray/python/ray/worker.py", line 1472, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::A3C.train() (pid=80578, ip=192.168.50.14)
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task
    worker.memory_monitor.raise_if_low_memory()
  File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
    task_exception = True
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
    outputs = function_executor(*args, **kwargs)
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
    task_exception = False
  File "python/ray/_raylet.pyx", line 431, in ray._raylet.execute_task.function_executor
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer_template.py", line 106, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 445, in __init__
    self._env_id = self._register_if_needed(env or config.get("env"))
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 1179, in _register_if_needed
    "You can specify a custom env as either a class "
ValueError: None is an invalid env specification. You can specify a custom env as either a class (e.g., YourEnvCls) or a registered env id (e.g., "your_env").

@roireshef I'm happy to keep looking into this, but I need you to provide a script that is runnable as-is. I doubt that the dashboard error message you're seeing is impacting training, as the two processes should run separately and not overlap / crash one another.

mfitton on 11 Nov 2020

@mfitton This issue should be reproducible without the tune code right?

rkooo567 on 12 Nov 2020

Also about the solution; We can probably collect stats only when the include_dashboard is set to be True. Otherwise, start only agents + we can stop collecting stats from the endpoints.

rkooo567 on 12 Nov 2020

👍1

@mfitton The traceback shows that agent can't register to raylet, it seems that the raylet process has been crash. The dashboard process is the head role, if the dashboard is not started, then the agent just collects stats and not report to dashboard directly (but publish to redis).

fyrestone on 12 Nov 2020

I found A3CTrainer in ray.rllib.agents.ac3. That said, the provided script fails with the error

Traceback (most recent call last):
  File "/Users/maxfitton/Development/ray/python/ray/tune/trial_runner.py", line 547, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/maxfitton/Development/ray/python/ray/tune/ray_trial_executor.py", line 484, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/maxfitton/Development/ray/python/ray/worker.py", line 1472, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::A3C.train() (pid=80578, ip=192.168.50.14)
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task
    worker.memory_monitor.raise_if_low_memory()
  File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
    task_exception = True
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
    outputs = function_executor(*args, **kwargs)
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
    task_exception = False
  File "python/ray/_raylet.pyx", line 431, in ray._raylet.execute_task.function_executor
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer_template.py", line 106, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 445, in __init__
    self._env_id = self._register_if_needed(env or config.get("env"))
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 1179, in _register_if_needed
    "You can specify a custom env as either a class "
ValueError: None is an invalid env specification. You can specify a custom env as either a class (e.g., YourEnvCls) or a registered env id (e.g., "your_env").

@mfitton I can't attach the original script I'm using because it relies on a custom environment I developed which I can't expose its code outside of my company, I hope you understand. That said, I don't think this issue is related to the environment or RL algorithm implementation at all. Try to use any environment you have in hand. If you run it inside a docker container like I do, I'm pretty sure it will reproduce. The exception you currently see is because you didn't define any environment for tune to run with.

roireshef on 12 Nov 2020

Guys, if I may, from my perspective having metrics written to the logs files so they can be viewed in Tensorboard is crucial, regardless if the web dashboard is working on not. I would kindly like to ask not to break this functionality, I believe it's being heavily used by others as well...

roireshef on 12 Nov 2020

I've noticed this is because in the new dashboard architecture we start-up the dashboard agent regardless of whether include_dashboard is specified. This could be because the dashboard agent is the entity that receives ray stats via GRPC for export to Prometheus.

@fyrestone I'm planning on creating a PR to make the dashboard agent not start when include_dashboard is false. Am I missing any issues that doing this could cause?

@mfitton - could you please point me to where you are starting it so I could disable it as a local hotfix? If I do that, will it disable metrics as well? or is there any way around it?

roireshef on 12 Nov 2020

@roireshef I don't have any RLLib/Tune environments on hand, as I don't do any ML work.

The tune metrics written to log files is not affected by whether the dashboard is running or not. You'll still be able to run tensor board by starting it with the log directory that Tune writes to. We'll make sure not to break this functionality.

That said, what's crashing your program isn't the dashboard. The message you're seeing logged by the agent, like fyrestone said, isn't causing your Ray cluster to crash, but is rather a symptom that the Raylet (which handles scheduling tasks among other things) has crashed.

Could you include your raylet.err and raylet.out log files? Those would be more helpful as far as getting to the bottom of this.

mfitton on 12 Nov 2020

@mfitton
A. Sure, I can send the raylet files, but where can I find them?
B. If you have Ray installed, you already have RLlib-ready environments at hand. See: https://github.com/ray-project/ray/tree/master/rllib/examples - I think you'll find it very useful to use one and close a training loop for debugging...

roireshef on 12 Nov 2020

@mfitton - I'll try to provide some more information in the meantime:

This is how I setup ray cluster (handshakes between nodes):

Head (inside docker):
ray start --block --head --port=$redis_port --redis-password=$redis_password --node-ip-address=$head_node_ip \ --gcs-server-port=6005 --dashboard-port=6006 --node-manager-port=6007 --object-manager-port=6008 \ --redis-shard-ports=6400,6401,6402,6403,6404,6405,6406,6407,6408,6409 --min-worker-port=6100 --max-worker-port=6299 --include-dashboard=false

Worker Nodes (inside docker, different machine(s)):
ray start --block --address=$head_node_ip:$redis_port --redis-password=$redis_password --node-ip-address=$worker_node_ip --node-manager-port=6007 --object-manager-port=6008 --min-worker-port=6100 --max-worker-port=6299

After I do that, I call:

ray.init(address=$head_node_ip:$redis_port, _redis-password=$redis_password)
tune.run(
        A3CTrainer,
        config=<any config>,
        stop={
            "timesteps_total": 50e6,
        },
    )

The tune.run() part you could find in the examples, including environment implementations. Alternatively, this reproduces also without setting a ray cluster, like I described in the body of this issue (above).

Observing the console produces a flow of exceptions, all look similar (note that this time I captured more informative one than the one attached to the body of this issue, the * part is for security reasons):

2020-11-12 16:53:56,179 WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 123, in run
    modules = self._load_modules()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 82, in _load_modules
    c = cls(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/usr/local/lib/python3.6/dist-packages/ray/metrics_agent.py", line 42, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
    options=option, gatherer=option.registry, collector=collector)
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 266, in __init__
    self.serve_http()
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 321, in serve_http
    port=self.options.port, addr=str(self.options.address))
  File "/usr/local/lib/python3.6/dist-packages/prometheus_client/exposition.py", line 78, in start_wsgi_server
    httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
  File "/usr/lib/python3.6/wsgiref/simple_server.py", line 153, in make_server
    server = server_class((host, port), handler_class)
  File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib/python3.6/wsgiref/simple_server.py", line 50, in server_bind
    HTTPServer.server_bind(self)
  File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

(pid=raylet, ip=***) Traceback (most recent call last):
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 308, in <module>
(pid=raylet, ip=***)     raise e
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
(pid=raylet, ip=***)     loop.run_until_complete(agent.run())
(pid=raylet, ip=***)   File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
(pid=raylet, ip=***)     return future.result()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 123, in run
(pid=raylet, ip=***)     modules = self._load_modules()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 82, in _load_modules
(pid=raylet, ip=***)     c = cls(self)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
(pid=raylet, ip=***)     self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/metrics_agent.py", line 42, in __init__
(pid=raylet, ip=***)     namespace="ray", port=metrics_export_port)))
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
(pid=raylet, ip=***)     options=option, gatherer=option.registry, collector=collector)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 266, in __init__
(pid=raylet, ip=***)     self.serve_http()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 321, in serve_http
(pid=raylet, ip=***)     port=self.options.port, addr=str(self.options.address))
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/prometheus_client/exposition.py", line 78, in start_wsgi_server
(pid=raylet, ip=***)     httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/wsgiref/simple_server.py", line 153, in make_server
(pid=raylet, ip=***)     server = server_class((host, port), handler_class)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
(pid=raylet, ip=***)     self.server_bind()
(pid=raylet, ip=***)   File "/usr/lib/python3.6/wsgiref/simple_server.py", line 50, in server_bind
(pid=raylet, ip=***)     HTTPServer.server_bind(self)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
(pid=raylet, ip=***)     socketserver.TCPServer.server_bind(self)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
(pid=raylet, ip=***)     self.socket.bind(self.server_address)
(pid=raylet, ip=***) OSError: [Errno 98] Address already in use
2020-11-12 16:53:56,392 WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
    agent_ip_address=self.ip))
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
    debug_error_string = "{"created":"@1605218036.477366833","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605218036.477361267","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"

roireshef on 12 Nov 2020

👍2

raylet.out and raylet.err are explained at https://docs.ray.io/en/master/configure.html#logging-and-debugging. (This should be linked from https://docs.ray.io/en/master/debugging.html#backend-logging, but the link is broken at the moment. (https://github.com/ray-project/ray/pull/11956))

dHannasch on 13 Nov 2020

The above traceback shows that the prometheus exporter has a port conflict. @mfitton Does the exporter use a fixed port? I am not familiar with the MetricsAgent.

fyrestone on 13 Nov 2020

If anyone's interested, a temporary fix to disable the dashboard is commenting out those 2 lines:

https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L1447
https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L1448

roireshef on 15 Nov 2020

👍1

@mfitton - Here are the two Raylet files:

raylet.err.txt
raylet.out.txt

roireshef on 15 Nov 2020

According to the logs, I found two problems:

The prometheus exporter has a port conflict, so the agent is exit with OSError: [Errno 98] Address already in use.
The agent can't register to raylet by using ip = ray._private.services.get_node_ip_address(), port = node_manager_port. Then the agent is exit with grpc.experimental.aio._call.AioRpcError.

fyrestone on 16 Nov 2020

👍1

@roireshef Is your worker node has multiple network interface card? I guess the second problem is caused by:
The agent connects to the raylet with ip = ray._private.services.get_node_ip_address(), the ip is different with your worker command ip --node-ip-address=$worker_node_ip.

fyrestone on 16 Nov 2020

👍1

@roireshef Is your worker node has multiple network interface card? I guess the second problem is caused by:
The agent connects to the raylet with ip = ray._private.services.get_node_ip_address(), the ip is different with your worker command ip --node-ip-address=$worker_node_ip.

As said, I'm working in docker containers, which abstract away the node's external IP (the only one that is valid to use if you were to connect from another machine). For applications _running inside a docker container_ and asking for an IP (and I'm assuming what you're doing there is similar to running "ifconfig" or "hostname" in the bash shell), the docker _will provide a different "virtual" IP_ that is accessible only from that same machine (or even only from within the same docker container, I'm not entirely sure).

Since the valid node's IP is already passed in --node-ip-address=$worker_node_ip - why isn't that the only IP used across all services? In case the user has already provided the application with the "right IP" of the machine, Wouldn't propagating it across all services be the right thing to do here?

roireshef on 16 Nov 2020

According to the logs, I found two problems:

1. The prometheus exporter has a port conflict, so the agent is exit with `OSError: [Errno 98] Address already in use`.

2. The agent can't register to raylet by using `ip = ray._private.services.get_node_ip_address()`, `port = node_manager_port`. Then the agent is exit with `grpc.experimental.aio._call.AioRpcError`.

It feels we are getting closer to solve (2), do you have any idea on how to solve (1)?

BTW, when running v1.0.1 it doesn't happen. This might be because the new dashboard isn't initialized. So, if the problem only occurs when initializing the new dashboard, I'm suspecting it's trying to do something different than the reset of the stack (redis, ray core, tune, rllib, etc.). It might be worth sticking to the Ray stack standard communication protocols to avoid those edge-cases. For instance, I'm not sure what prometheus is, but was is used by older dashboard, or is it used by the rest of the stack? If yes, let's understand how is working well in those areas...

roireshef on 16 Nov 2020

@roireshef Is your worker node has multiple network interface card? I guess the second problem is caused by:
The agent connects to the raylet with ip = ray._private.services.get_node_ip_address(), the ip is different with your worker command ip --node-ip-address=$worker_node_ip.

As said, I'm working in docker containers, which abstract away the node's external IP (the only one that is valid to use if you were to connect from another machine). For applications _running inside a docker container_ and asking for an IP (and I'm assuming what you're doing there is similar to running "ifconfig" or "hostname" in the bash shell), the docker _will provide a different "virtual" IP_ that is accessible only from that same machine (or even only from within the same docker container, I'm not entirely sure).

Since the valid node's IP is already passed in --node-ip-address=$worker_node_ip - why isn't that the only IP used across all services? In case the user has already provided the application with the "right IP" of the machine, Wouldn't propagating it across all services be the right thing to do here?

Thanks. I will create a fix PR about the (2) by passing the --node-ip-address value to the agent.

fyrestone on 16 Nov 2020

👍1

As far as issue (1), it looks like when --metrics-export-port isn't passed in as an argument to Ray start, a random unused port is fetched for the metric agent to do prometheus export.

The code is here https://github.com/ray-project/ray/blob/master/python/ray/node.py#L128 where the port gets selected and passed into the initialization of the Raylet as well as the dashboard agent.

It should be selecting an unused, random port, but does Docker block the use of internal ports unless they're explicitly specified? If so, it's possible that the issue is arising there. I'm going to reach out to a coworker who knows more about docker.

mfitton on 16 Nov 2020

@fyrestone thanks for contributing the fix for the IP issue, I reviewed it and will get it merged by a teammate ASAP.

mfitton on 16 Nov 2020

def _get_unused_port(self, close_on_exit=True):
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.bind(("", 0))
        port = s.getsockname()[1]

        # Try to generate a port that is far above the 'next available' one.
        # This solves issue #8254 where GRPC fails because the port assigned
        # from this method has been used by a different process.
        for _ in range(NUM_PORT_RETRIES):
            new_port = random.randint(port, 65535)
            new_s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            try:
                new_s.bind(("", new_port))
            except OSError:
                new_s.close()
                continue
            s.close()
            if close_on_exit:
                new_s.close()
            return new_port, new_s
        logger.error("Unable to succeed in selecting a random port.")
        if close_on_exit:
            s.close()
        return port, s

this is the code being used to fetch an unused port, and for some reason failing to select an unused port on docker.

mfitton on 16 Nov 2020

@yiranwang52 @ijrsvt

Quick docker question: When not passed in explicitly, the metrics-export-port (which the MetricsAgent class is fed as an argument for where to host prometheus metrics for scraping over HTTP) gets set to a random unused port on the machine. There’s a user who’s running into an issue starting the dashboard_agent because the MetricsAgent fails to connect to this port. Can either of you take a quick look and see if there are any red flags as to why this would fail in a docker environment?

mfitton on 16 Nov 2020

@mfitton, @fyrestone I see you merged in a fix - thanks! Were you able to verify dashboard works now when using Ray inside docker? Is this issue fixed completely?

roireshef on 19 Nov 2020

@mfitton , @fyrestone I'm still experiencing this issue after I've installed latest Ray wheel and verified the fix is in it. Please see attached logs in:

logs.tar.gz

I've already tried:

ray.init(address=.., redis_port=..., dashboard_host=..., dashboard_port=...)
initializing with ray start --block --head --port=6379 --redis-password=12345 --node-ip-address=10.67.34.148 --gcs-server-port=6005 --dashboard-port=6006 --dashboard-host=10.67.34.148 --node-manager-port=6007 --object-manager-port=6008 --redis-shard-ports=6400,6401,6402,6403,6404,6405,6406,6407,6408,6409 --min-worker-port=6100 --max-worker-port=6299 and pointing ray.init() to this address.

roireshef on 19 Nov 2020

@mfitton , @fyrestone I'm still experiencing this issue after I've installed latest Ray wheel and verified the fix is in it. Please see attached logs in:

logs.tar.gz

I've already tried:

ray.init(address=.., redis_port=..., dashboard_host=..., dashboard_port=...)

initializing with ray start --block --head --port=6379 --redis-password=12345 --node-ip-address=10.67.34.148 --gcs-server-port=6005 --dashboard-port=6006 --dashboard-host=10.67.34.148 --node-manager-port=6007 --object-manager-port=6008 --redis-shard-ports=6400,6401,6402,6403,6404,6405,6406,6407,6408,6409 --min-worker-port=6100 --max-worker-port=6299 and pointing ray.init() to this address.

Thanks for the logs. It's weird that both of Dashboard head -> GCS and Dashboard agent -> raylet GRPC connections are failed.

fyrestone on 20 Nov 2020

@roireshef Can you help confirm whether the environment variables http_proxy or https_proxy exists?

The GCS logs:

[2020-11-19 15:33:29,110 I 7616 7616] grpc_server.cc:74: GcsServer server started, listening on port 6005.
[2020-11-19 15:33:29,118 I 7616 7616] gcs_server.cc:273: Gcs server address = 10.67.34.148:6005
[2020-11-19 15:33:29,118 I 7616 7616] gcs_server.cc:277: Finished setting gcs server address: 10.67.34.148:6005

The Dashboard head logs:

2020-11-19 15:33:29,615 INFO head.py:161 -- Connect to GCS at b'10.67.34.148:6005'

2020-11-19 15:33:29,940 ERROR head.py:108 -- Got AioRpcError when updating nodes.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/head.py", line 74, in _update_nodes
    nodes = await self._get_nodes()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/head.py", line 61, in _get_nodes
    request, timeout=2)
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
    debug_error_string = "{"created":"@1605792809.939876173","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605792809.939868059","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>

It seems that the GRPC Python client uses the correct address, but can't connect to the GRPC server.

fyrestone on 20 Nov 2020

👍1

environment variables http_proxy or https_proxy

When I run the following script:

import time
import ray
import ray.services

@ray.remote
def f():
    time.sleep(8)
    return ray.services.get_node_ip_address()

if __name__ == "__main__":
  ray.init(num_cpus=1)
  IPaddresses = set(ray.get([f.remote() for _ in range(4)]))
  print('IPaddresses =', IPaddresses)
  ray.shutdown()

on a machine with http_proxy and https_proxy set, it spits out

Traceback (most recent call last):
  File "ray/new_dashboard/agent.py", line 305, in <module>
    loop.run_until_complete(agent.run())
  File "python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "python3.8/site-packages/ray/new_dashboard/agent.py", line 169, in run
    await raylet_stub.RegisterAgent(
  File "python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4165,"referenced_errors":[{"description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}"

This looks like the same error, right?
(The sleeps are necessary so that the script doesn't exit before the error appears; the length of sleep necessary presumably varies by machine.)

Obviously, on any machine with http_proxy and https_proxy set, no_proxy is also going to be set, presumably with localhost and 127.0.0.1...but no_proxy usually won't include the machine's external IP address. Ray is using that external IP address from get_node_ip_address().

For my machine, at least, adding the external IP address to no_proxy makes everything go through without that error message.

$ no_proxy="$(hostname -i),$no_proxy" python test_actors.py

I think @fyrestone hit the nail on the head.

Unfortunately, the problem being diagnosed is not the same thing as the problem being solved. Setting no_proxy that way works for a simple standalone script like that one, but for the more complicated operations such as ray start and tune, the new processes don't get started with the new value of no_proxy, even if you export no_proxy. The new processes must pull the values of the variables from some deeper level when they get started up, and I'm not sure where. Not .bashrc, I assume, since these new processes aren't starting in shells as such.

Looking at https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L1438, the dashboard process doesn't have shell=True, so I'm really not sure where it's pulling the proxy information from. And yet setting no_proxy on the command line works when running a simple ray.init() script...?
https://stackoverflow.com/questions/12060863/python-subprocess-call-a-bash-alias~~

It's sort of baffling, because actors are also separate processes, but apparently those actors started from that script do somehow inherit the value of no_proxy. (export no_proxy="$(hostname -i),$no_proxy" makes that script go through just fine; it doesn't matter whether no_proxy is set on the same line, no_proxy just needs to be set.) Yet other workers created by ray start do not, so that

$ export no_proxy="$(hostname -i),$no_proxy" $ ray start

~~still results in workers spitting out that error.~~

All the various processes started by Ray inherit no_proxy. I dunno how, but they do. You do need to set no_proxy on all machines involved, though, with the numerical IP addresses of all machines involved (comma-separated), including its own. Remember that the IP address by which one machine can find another machine is not necessarily the same IP address that hostname -i brings up on the target machine.

You might not know in advance the IP address of every machine that will be joining. But you could probably brute-force that by just adding every IP address ending in a number to no_proxy:

no_proxy="0,1,2,3,4,5,6,7,8,9,$no_proxy" python test_actors.py

(Presumably, the problem only happens because we're using raw numerical IP addresses; presumably, no_proxy is already set to cover all relevant domains.)
We could add that to the documentation as a recommendation to just always do.

I'm not sure what to do about this with respect to the port-checking documentation. netcat and nmap (natually, I think?) completely ignore http_proxy and https_proxy for non-HTTP traffic. (This isn't HTTP traffic, is it? This is a metric-export thing and that's why it happens even with the dashboard disabled? I'm not sure why gRPC is using the proxy settings. I'm guessing there are some kind of gRPC-over-HTTP shenanigans going on, for some reason?)

(Okay, I guess they just always ignore proxy settings.

$ http_proxy=http://some.random.proxy:80 https_proxy=http://some.random.proxy:80 nc -vv -z www.google.com 80 Connection to www.google.com 80 port [tcp/http] succeeded! $ http_proxy=http://some.random.proxy:80 https_proxy=http://some.random.proxy:80 nmap -p 80 www.google.com PORT STATE SERVICE 80/tcp open http

I still don't get why gRPC is using the proxy. I assume this is dashboard-specific somehow, since nothing else goes wrong if you run http_proxy=http://some.imaginary.proxy:80 https_proxy=http://some.imaginary.proxy:80 python test_actors.py, just the dashboard thing. You even still get the correct answer, despite the error messages the dashboard is spitting out.)

(To be clear, if you literally use an imaginary proxy like http://some.random.proxy:80, you'll get a different error message. But the computation will still go through, so it's only the dashboard gRPC thing that's looking at http_proxy.)

In any case, we could have an error message that gives the IP and port that failed to reach, possibly with a suggestion to add them to no_proxy.

dHannasch on 20 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

In Jenkins tests, test_0.py hangs occasionally.

robertnishihara · 3Comments

Lost reference to actor exception for seemingly valid code.

robertnishihara · 3Comments

Bazel build error for boost when building from source on ARM(aarch64)

heavyinfo · 3Comments

Trouble installing in virtual env.

robertnishihara · 3Comments

[rllib] In multiagent environment, is timesteps_total the total timesteps per agent or over all agents?

coreylowman · 3Comments