Ray: [tune] tf.summary.FileWriter extensibility for custom TensorBoard metrics

Created on 10 May 2019  Â·  20Comments  Â·  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • Ray installed from (source or binary): source
  • Ray version: 0.6.6
  • Python version: 3.6.7
  • Exact command to reproduce: NA

Context: I rely on tune and tensorboard for visualizing training while using callbacks to define custom metrics in the dictionary results then passed to TFLogger.

Problem: ray saves scalars only, and all of them are saved under the same tab 'ray' in tensorboard. Having tens of metrics under the same tab does not help readability, in particular if the end user is adding custom metrics. It would be a great feature to let users access TFLogger._file_writer so that they can add custom metrics (not just scalars) in custom tabs. Note that creating a second tf.summary.FileWriter is not an option as two FileWriter sharing the same logdir are not supported at this time. Question: what's the recommended way to achieve that?

Attempts: using a custom Logger instance is not an option as the trainer is never passed to (only results) and this limits the access to possible custom metrics of interest. Using the callback on_train_result does pass the 'trainer' (info['trainer']) but from there I don't see how it possible to access TFLogger._file_writer to save custom metrics in tensorboard.

tune

Most helpful comment

I'm happy to give it a try when you give me the ok.

All 20 comments

A partial solution to save custom scalars under custom tabs in tensorboard is to:
(1) save the scalars in the result dictionary via the callback 'on_train_result' and
(2) Subclass TFLogger and implement method 'on_result' which does have access to self._file_writer.

This solution is partial because it still does NOT allow to access FileWriter at the same time of 'trainer'. I've tried to pass the trainer or the computation graph as value of the result dictionary, e.g.
result['graph'] = info['trainer'].get_policy().sess.graph
but I've got the error:
# TypeError: can't pickle _thread.RLock objects

I think the difference is on_train_result is on the Trainer which is a separate process than where the TFLogger is. One possibility is to just to subclass a Trainer to have an attribute for tf.summary.FileWriter, and in on_train_result call the Trainer's FileWriter, at the same time turning off the TFLogger.

This brings up a good point/larger discussion, which is that maybe we should do logging on the tune.Trainable rather than locally, and let rsync handle the data moving across the network back to the driver.

cc @ericl @hartikainen

This brings up a good point/larger discussion, which is that maybe we should do logging on the tune.Trainable rather than locally, and let rsync handle the data moving across the network back to the driver.

If FileWriter was accessible from the trainer, then any sorts of custom logging could just be implemented in callbacks, which would be more elegant and easier I think (probably I'm just rephrasing your point).

Comment aside - with a stronger integration between FileWriter and Trainer, it would be great to add in tensorboard stuff like computation graphs and L2 norm of weights/gradients

Instead of a stronger integration, how about we log to an entirely separate FileWriter? I believe that file sync will automatically move those event logs to the result directory on the head node as well.

To be concrete, in the trainable:

  • create a separate tf file writer object
  • log to that directly, bypassing tune apis

Any downsides of this approach?

Yeah, I was thinking that too. I guess one thing specifically I was looking at was completely just moving the logging logic into the trainable, and keeping the syncing logic on the driver (we're already doing something like this in #4450) .

A workaround for now is just making a subdirectory in the Trainable Setup and creating a SummaryWriter that points directly to that directory. @FedericoFontana does that sound viable for you?

A workaround for now is just making a subdirectory in the Trainable Setup and creating a SummaryWriter that points directly to that directory. @FedericoFontana does that sound viable for you?

It sounds like this solution would involve instancing a SummaryWriter in both TFLogger and Trainable, which might be a bit of a hack maybe considering that multiple SummaryWriters are not supported. When I tried to create a second SummaryWriter using a different logdir (a subdir), then in tensorboard the same trial was displied under two different names. When both writers share the same logdir, as soon as the second SummaryWriter is flushed, the first SummaryWriter stopped writing.

If you are suggesting to stop using TFLogger and use a single SummaryWriter in Trainable, than I'm happy to give it a try.

Yeah I would suggest removing TFLogger for now and using a single
SummaryWriter in trainable.

On Sat, May 11, 2019 at 4:00 AM Federico Fontana notifications@github.com
wrote:

A workaround for now is just making a subdirectory in the Trainable Setup
and creating a SummaryWriter that points directly to that directory.
@FedericoFontana https://github.com/FedericoFontana does that sound
viable for you?

It sounds like this solution would involve instancing a SummaryWriter in
both TFLogger and Trainable, which might be a bit of a hack maybe
considering that multiple SummaryWriters are not supported
https://github.com/tensorflow/tensorboard/issues/1063. When I tried to
create a second SummaryWriter using a different logdir (a subdir), then
in tensorboard the same trial was displied under two different names. When
both writers share the same logdir, as soon as the second SummaryWriter
is flushed, the first SummaryWriter stopped writing.

If you are suggesting to stop using TFLogger and use a single
SummaryWriter in Trainable, than I'm happy to give it a try.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4762#issuecomment-491501357,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABCRZZPAMMAWU2VMNTOKEGLPU2RN5ANCNFSM4HMDL5SQ
.

@richardliaw I did as you recommended (no TFLogger, FileWriter instanced in Trainer._init) and it works like a charm. Note that I've provided the computation graph when initializing FileWriter. I think that rllib/tune should save the computation graph by default as it is invaluable both for developers (debugging) and users (understand/visualize policy network without going through any source code).

What should the next step be (e.g. if PR, what should the PR change)?

class PPO(PPOTrainer):
    def _init(self, config, env_creator):
        super()._init(config, env_creator)
        self._file_writer = tf.summary.FileWriter(
            logdir=self.logdir,
            graph=self.get_policy().sess.graph,
        )
        self._file_writer.flush()

image
image
image

That's a good point; there's a couple parts of the code that are currently in active development that may block this:

  1. #4450 will separate the Logger from its uploading functionality
  2. #4362 will provide a much more extensible and flexible logging API

So I guess it'd be nice to get those things merged first before merging this change... it'll probably be a couple weeks from now due to deadlines taking priority but I'll keep you updated when they are merged if you want to be the one to make the PR?

I'm happy to give it a try when you give me the ok.

@richardliaw I did as you recommended (no TFLogger, FileWriter instanced in Trainer._init) and it works like a charm. Note that I've provided the computation graph when initializing FileWriter. I think that rllib/tune should save the computation graph by default as it is invaluable both for developers (debugging) and users (understand/visualize policy network without going through any source code).

What should the next step be (e.g. if PR, what should the PR change)?

class PPO(PPOTrainer):
    def _init(self, config, env_creator):
        super()._init(config, env_creator)
        self._file_writer = tf.summary.FileWriter(
            logdir=self.logdir,
            graph=self.get_policy().sess.graph,
        )
        self._file_writer.flush()

image
image
image

Has this been implemented already?
If so, what changes are required to see the graph in tensorboard?

@OnTheRicky, did you ever get any more feedback on this? I also want to know how to visualize the policy graph in tensorboard. There are several other issues that talk about this, but none have worked for me "out of the box"

class PPO(PPOTrainer):
   def _init(self, config, env_creator):
       super()._init(config, env_creator)
       self._file_writer = tf.summary.FileWriter(
           logdir=self.logdir,
           graph=self.get_policy().sess.graph,
       )
       self._file_writer.flush()

One issue with this is that if I'm not mistaken, each FileWriter gets its own events file. If you want to also use the native tune logging capability (e.g. logging of the results returned from the .step() method), then you end up with two separate events files in the same directory. This tends to mess up tensorboard, in that it doesn't refresh correctly.

@ethanabrooks to work around this, you can disable the Tune logger (tune.run(loggers=None) or only enable a subset of the loggers). Also, if you want to capture results, you can do:

class PPO(..):
    def _log_result(self, result):
        self._file_writer.log(result)

@richardliaw thanks for the suggestions. I use a lot of features from the Tune logger, so I don't want to disable it. As for overriding _log_result, I am serializing large arrays, so I would rather not have them print to my screen (it looks like ray prints the entire result dictionary to the screen each iteration). Is there a way to suppress printing of some values in the result dictionary?

As a hack you might be able to use _log_result but also in that method, pop those values out.

You can also do tune.run(verbose=1) to choose which values you want to display.

@richardliaw it looks like the values get printed before _log_result gets run so popping out the values does not change the print statement. If I used verbose=1 how would I designate which values get printed?

Was this page helpful?
0 / 5 - 0 ratings