Ray: [dashboard] "new_dashboard" is leaking processes

Created on 19 Nov 2020  路  6Comments  路  Source: ray-project/ray

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): Ray 1.1dev

OS: Ubuntu 18.04
Python 3.6

When a ray is started and stopped with ray.init(), I see a process left over like this:

swang    30660  4805  1 16:55 pts/1    00:00:02 /home/swang/anaconda3/envs/ray-36/bin/python -u /home/swang/ray/python/ray/new_dashboard/agent.py ...

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

ray.init()

P0 bug release-blocker

All 6 comments

cc @mfitton

Max is out until next week. @fyrestone @mxz96102 could you please take a look? This might be linux-only because I'm not able to repro locally.

Closed by #12096 :)

Max is out until next week. @fyrestone @mxz96102 could you please take a look? This might be linux-only because I'm not able to repro locally.

The dashboard agent has a loop for checking parent is alive. https://github.com/ray-project/ray/blob/master/dashboard/agent.py#L90. Any logs for the leaked dashboard agent?

@stephanie-wang did you confirm that this fixes the issue? Above it says you have processes left from ray.init() but #12096 seems to only change ray stop.

Oh I think you're right about that. It seems it's still leaking processes. Here is the output that I have from a dashboard agent (it repeats this over and over):

2020-11-20 11:21:07,215 INFO agent.py:69 -- Dashboard agent grpc address: XXX.XXX.XXX.XXX:63209
2020-11-20 11:21:07,221 INFO utils.py:201 -- Get all modules by type: DashboardAgentModule
2020-11-20 11:21:07,889 INFO agent.py:82 -- Loading DashboardAgentModule: <class 'ray.new_dashboard.modules.log.log_agent.LogAgent'>
2020-11-20 11:21:07,889 INFO agent.py:82 -- Loading DashboardAgentModule: <class 'ray.new_dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2020-11-20 11:21:07,892 INFO agent.py:86 -- Loaded 2 modules.
2020-11-20 11:21:07,893 INFO agent.py:150 -- Dashboard agent http address: XXX.XXX.XXX.XXX:42441
2020-11-20 11:21:07,894 INFO agent.py:157 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2020-11-20_11-21-05_809084_19322/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/tmp/ray/session_2020-11-20_11-21-05_809084_19322/logs')>>
2020-11-20 11:21:07,894 INFO agent.py:157 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2020-11-20_11-21-05_809084_19322/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f4226661da0>>
2020-11-20 11:21:07,894 INFO agent.py:158 -- Registered 2 routes.
2020-11-20 11:21:10,437 ERROR reporter_agent.py:234 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "/home/swang/ray/python/ray/new_dashboard/modules/reporter/reporter_agent.py", line 232, in _perform_iteration
    await aioredis_client.publish(self._key, jsonify_asdict(stats))
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/pool.py", line 257, in _wait_execute
    conn = await self.acquire(command, args)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/pool.py", line 324, in acquire
    await self._fill_free(override_min=True)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/pool.py", line 383, in _fill_free
    conn = await self._create_new_connection(self._address)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/connection.py", line 113, in create_connection
    timeout)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/tasks.py", line 339, in wait_for
    return (yield from fut)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/stream.py", line 24, in open_connection
    lambda: protocol, host, port, **kwds)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/base_events.py", line 798, in create_connection
    raise exceptions[0]
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/base_events.py", line 785, in create_connection
    yield from self.sock_connect(sock, address)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/selector_events.py", line 439, in sock_connect
    return (yield from fut)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/selector_events.py", line 469, in _sock_connect_cb
    raise OSError(err, 'Connect call failed %s' % (address,))
ConnectionRefusedError: [Errno 111] Connect call failed ('XXX.XXX.XXX.XXX', 6379)
Was this page helpful?
0 / 5 - 0 ratings