Ray: pdb unusable due to invisible stdout

Created on 10 Feb 2019  Â·  17Comments  Â·  Source: ray-project/ray

System information

Describe the problem

In the latest master, stdout somehow gets handled incorrectly, thus making pdb unusable. This happens at least for trainables run with tune.run_experiments. Things work fine before https://github.com/ray-project/ray/commit/ef527f84abf0cee7ac6ad832828ff92311440ee4.

regression

Most helpful comment

Can we actually make this issue a high priority?

In a separate project I'm working on, it's incredibly hard to debug things now... (this is not specific to Tune)

All 17 comments

@robertnishihara

@hartikainen, is ipdb being set on a worker or on the driver? If on a worker, then in order to get this to work the worker would presumably need to be running in tmux. If that was working before, how did you actually connect to the worker?

This is run locally with just a single trial at a time. I'm not sure about the terminology here (e.g. the distinction between the driver and worker), but basically I just ran the code with python ${RAY_PATH}/python/ray/tune/examples/logging_example.py and used the debugger in that session/process without separately connecting to it.

When not in cluster mode (just calling ray.init()), ipdb worked previously because it would just hijack stdin. There was no separate connecting needed.

I see, I could see that working in very specific cases, but getting ugly
really quickly, e.g., what if two workers drop into ipdb at the same time

I think the way I’d preder to address this is to allow workers to be
started in tmux so that you can attach to the one that drops into ipdb.
That probably will prevent us from redirecting logs for that worker (it’d
probably all just go to tmux) so it wouldn’t be the default.

Does that sound like a good development experience?
On Mon, Feb 11, 2019 at 2:06 PM Richard Liaw notifications@github.com
wrote:

When not in cluster mode (just calling ray.init()), ipdb worked
previously because it would just hijack stdin. There was no separate
connecting needed.

—
You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4005#issuecomment-462513902,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPOrTKAQ9nO16J54bpyxm4pwQegIhzjks5vMelcgaJpZM4ay1Cd
.

I think it's probably good to maintain behavior from before (I've always
been able to workaround the case of multiple workers but wanting to use
ipdb).

On the other hand, providing some debug mode where all workers were started
in tmux would be super helpful too.

On Tue, Feb 12, 2019 at 12:45 AM Robert Nishihara notifications@github.com
wrote:

I see, I could see that working in very specific cases, but getting ugly
really quickly, e.g., what if two workers drop into ipdb at the same time

I think the way I’d preder to address this is to allow workers to be
started in tmux so that you can attach to the one that drops into ipdb.
That probably will prevent us from redirecting logs for that worker (it’d
probably all just go to tmux) so it wouldn’t be the default.

Does that sound like a good development experience?
On Mon, Feb 11, 2019 at 2:06 PM Richard Liaw notifications@github.com
wrote:

When not in cluster mode (just calling ray.init()), ipdb worked
previously because it would just hijack stdin. There was no separate
connecting needed.

—
You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4005#issuecomment-462513902,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AAPOrTKAQ9nO16J54bpyxm4pwQegIhzjks5vMelcgaJpZM4ay1Cd

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4005#issuecomment-462670155,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEUc5WdQvcA6iEF6kiPuIffmpAoTxc89ks5vMn9DgaJpZM4ay1Cd
.

I think if I had to manually connect to the worker, that would slow down my workflow enough that I would find some other workaround to it (for example by running my development flow without tune). If tune supported the local mode and I ran things using that, this problem wouldn't exist, right?

That's true, if Tune supported local_mode=True, then it would just work out of the box. The relevant issue is https://github.com/ray-project/ray/issues/2796.

@robertnishihara is this a bug?
Note the flag in ray.init.

In [1]: import ray
ray
In [2]: ray.init(redirect_worker_output=False)
2019-02-13 00:21:57,165 INFO node.py:276 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-02-13_00-21-57_43262/logs.
2019-02-13 00:21:57,272 INFO services.py:368 -- Waiting for redis server at 127.0.0.1:27250 to respond...
2019-02-13 00:21:57,393 INFO services.py:368 -- Waiting for redis server at 127.0.0.1:51126 to respond...
2019-02-13 00:21:57,396 INFO services.py:759 -- Starting Redis shard with 10.0 GB max memory.
2019-02-13 00:21:57,414 INFO services.py:1309 -- Starting the Plasma object store with 6.871947672999999 GB memory using /tmp.

======================================================================
View the web UI at http://localhost:8889/notebooks/ray_ui.ipynb?token=3162cde8b114e5abc7faeb6d4657d58bedf5946ca9e3b624
======================================================================

Out[2]:
{'node_ip_address': None,
 'object_store_address': '/tmp/ray/session_2019-02-13_00-21-57_43262/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2019-02-13_00-21-57_43262/sockets/raylet',
 'redis_address': '10.142.38.214:27250',
 'webui_url': 'http://localhost:8889/notebooks/ray_ui.ipynb?token=3162cde8b114e5abc7faeb6d4657d58bedf5946ca9e3b624'}

In [3]: @ray.remote
   ...: def hi():
   ...:     print("hiii")
   ...:

In [4]: hi.remote()
Out[4]: ObjectID(01000000c1aa68a4c2a883f77ecdd61886cf8db9)

In [5]: ray.get(_)

BTW, @hartikainen, a workaround for this issue is actually to set redirect_output=False. This actually (and probably accidentally) sets the worker output to be fed to the driver console.

It's probably showing up in the raylet log file (because the raylet now redirects its output and if we don't redirect the worker output, worker output will be directed to the raylet output (because the workers are forked from the raylet), so I don't think this workaround currently works. You could say this is a bug, but redirect_worker_output is deprecated in #4025, so I wouldn't expect this workaround to work for very long (or even currently). If you want to use this workaround, you would have to also not redirect the raylet output (e.g., via redirect_output=False, which is also being deprecated).

what is the argument that should be used instead? log_to_driver?

There is currently no argument that allows the worker stdin to be received from the driver. I'm still thinking about possibilities here. Starting workers in tmux is one option, which is probably a good idea regardless, making local mode work well is probably another good option.

Can we actually make this issue a high priority?

In a separate project I'm working on, it's incredibly hard to debug things now... (this is not specific to Tune)

Somehow, it still works for me (Linux console here):

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 2/4 CPUs, 0/0 GPUs
Memory usage on this node: 2.4/12.3 GB
Result logdir: /home/eric/ray_results/default
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PG_CartPole-v0_0:    RUNNING

(pid=12480) 2019-03-06 01:06:33,371 INFO policy_evaluator.py:275 -- Creating policy evaluation worker 0 on CPU (please ignore any CUDA init errors)
(pid=12480) 2019-03-06 01:06:33.372261: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
(pid=12478) 2019-03-06 01:06:39,520 INFO policy_evaluator.py:275 -- Creating policy evaluation worker 1 on CPU (please ignore any CUDA init errors)
(pid=12478) 2019-03-06 01:06:39.521582: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
(pid=12478) 
(pid=12478) > /usr/local/lib/python3.5/dist-packages/ray/rllib/evaluation/policy_evaluator.py(395)sample()
(pid=12478)     394         # This avoids over-running the target batch size.
(pid=12478) --> 395         if self.batch_mode == "truncate_episodes":
(pid=12478)     396             max_batches = self.num_envs
(pid=12478) 
(pid=12478) ipdb> 
2 + 2
(pid=12478) 4

Is this an OSX specific issue where stdin is inadvertently getting hijacked?

Hmm.... this is a bit crazy, because as you can see since it prints (pid=12478), the worker SDTOUT is still being redirected to a file and then streamed to the driver.

But I guess we never redirected the worker STDIN, and so it is still inherited from the driver (via the raylet)...

My favorite solution is still to try to make the workflow of running workers in tmux work well.

Stale - please open new issue if still relevant

Was this page helpful?
0 / 5 - 0 ratings