Ray version and other system information (Python version, TensorFlow version, OS): Ray 1.1dev
OS: Ubuntu 18.04
Python 3.6
When a ray is started and stopped with ray.init(), I see a process left over like this:
swang 30660 4805 1 16:55 pts/1 00:00:02 /home/swang/anaconda3/envs/ray-36/bin/python -u /home/swang/ray/python/ray/new_dashboard/agent.py ...
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
ray.init()
cc @mfitton
Max is out until next week. @fyrestone @mxz96102 could you please take a look? This might be linux-only because I'm not able to repro locally.
Closed by #12096 :)
Max is out until next week. @fyrestone @mxz96102 could you please take a look? This might be linux-only because I'm not able to repro locally.
The dashboard agent has a loop for checking parent is alive. https://github.com/ray-project/ray/blob/master/dashboard/agent.py#L90. Any logs for the leaked dashboard agent?
@stephanie-wang did you confirm that this fixes the issue? Above it says you have processes left from ray.init() but #12096 seems to only change ray stop.
Oh I think you're right about that. It seems it's still leaking processes. Here is the output that I have from a dashboard agent (it repeats this over and over):
2020-11-20 11:21:07,215 INFO agent.py:69 -- Dashboard agent grpc address: XXX.XXX.XXX.XXX:63209
2020-11-20 11:21:07,221 INFO utils.py:201 -- Get all modules by type: DashboardAgentModule
2020-11-20 11:21:07,889 INFO agent.py:82 -- Loading DashboardAgentModule: <class 'ray.new_dashboard.modules.log.log_agent.LogAgent'>
2020-11-20 11:21:07,889 INFO agent.py:82 -- Loading DashboardAgentModule: <class 'ray.new_dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2020-11-20 11:21:07,892 INFO agent.py:86 -- Loaded 2 modules.
2020-11-20 11:21:07,893 INFO agent.py:150 -- Dashboard agent http address: XXX.XXX.XXX.XXX:42441
2020-11-20 11:21:07,894 INFO agent.py:157 -- <ResourceRoute [GET] <StaticResource /logs -> PosixPath('/tmp/ray/session_2020-11-20_11-21-05_809084_19322/logs')> -> <bound method StaticResource._handle of <StaticResource /logs -> PosixPath('/tmp/ray/session_2020-11-20_11-21-05_809084_19322/logs')>>
2020-11-20 11:21:07,894 INFO agent.py:157 -- <ResourceRoute [OPTIONS] <StaticResource /logs -> PosixPath('/tmp/ray/session_2020-11-20_11-21-05_809084_19322/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f4226661da0>>
2020-11-20 11:21:07,894 INFO agent.py:158 -- Registered 2 routes.
2020-11-20 11:21:10,437 ERROR reporter_agent.py:234 -- Error publishing node physical stats.
Traceback (most recent call last):
File "/home/swang/ray/python/ray/new_dashboard/modules/reporter/reporter_agent.py", line 232, in _perform_iteration
await aioredis_client.publish(self._key, jsonify_asdict(stats))
File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/pool.py", line 257, in _wait_execute
conn = await self.acquire(command, args)
File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/pool.py", line 324, in acquire
await self._fill_free(override_min=True)
File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/pool.py", line 383, in _fill_free
conn = await self._create_new_connection(self._address)
File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/connection.py", line 113, in create_connection
timeout)
File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/tasks.py", line 339, in wait_for
return (yield from fut)
File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/stream.py", line 24, in open_connection
lambda: protocol, host, port, **kwds)
File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/base_events.py", line 798, in create_connection
raise exceptions[0]
File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/base_events.py", line 785, in create_connection
yield from self.sock_connect(sock, address)
File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/selector_events.py", line 439, in sock_connect
return (yield from fut)
File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/selector_events.py", line 469, in _sock_connect_cb
raise OSError(err, 'Connect call failed %s' % (address,))
ConnectionRefusedError: [Errno 111] Connect call failed ('XXX.XXX.XXX.XXX', 6379)