Tensorboard: Tensorboard delete option for runs

Created on 2 Dec 2017  路  10Comments  路  Source: tensorflow/tensorboard

Since many of us train with Tensorflow on a remote server and use Tensorboard to monitor and evaluate the training I think it would be useful to add a delete option for specific runs on the Tensorboard dashboard.
When I debug a training process I monitor what the network is doing and sometimes restart the process very quickly. Like this my data directory easily gets filled with a lot of fail runs. Deleting them from command is a little tedious cause you always have to check in Tensorboard which ones you might want to keep. And usually you have to close Tensorboard to be able to delete them at all.

Maybe you guys have a suggestion for an improved workflow! Otherwise I think being able to delete runs directly from Tensorboard would be a nice addition. As part of this an 'active' flag for whether a run is still running might be interesting as well. Obviously this flag would disable the deletion option.

usability feature

Most helpful comment

The issue as I see it is not that there is a bunch of runs littering up on the disk, but that you easily end up with a lot of pointless runs cluttering up the interface, making it hard to find the ones that are relevant.
So couldn't this be done by just keeping a list of 'deleted' runs and instead of actually modifying the file system, just hide those from the interface. This could be done as web storage I guess, and if people want to delete it from the file system for real I don't think it should be too complicated to generate a shell command you could copy-paste into your terminal.

I think it would be nice to have an indication of how long the run is in the list itself (time or steps). Short runs tends to be failures as this issue is about, slightly longer runs tends to be testing stuff, while long runs are the ones you see potential in and want to properly evaluate.
Anyway, I find it hard to organize the runs when you have a bunch and all you have is a search bar. Sometimes the runs I want to compare end up being the same colors as well because there just happened to be that amount of failed runs between them.

All 10 comments

Right now I'm working on some code to be able to store summaries in SQLite which is going to help us with that sort of thing.

Any update on this issue?

No updates. We're working on the DB backend which should at some point at least support more of an "active" indicator.

Building functionality into TensorBoard to delete runs from disk may be out of scope, since it converts TensorBoard from being essentially read-only into something that can be used to permanently alter data on the machine it runs on, which puts it into a different realm in terms of security and access control. So while we might add that feature at some point, it would require a lot of careful consideration.

Original issue for this opened in the TensorFlow repo, which has more context: https://github.com/tensorflow/tensorflow/issues/5039

@stephanwlee is there more to say about experiment management here?

The issue as I see it is not that there is a bunch of runs littering up on the disk, but that you easily end up with a lot of pointless runs cluttering up the interface, making it hard to find the ones that are relevant.
So couldn't this be done by just keeping a list of 'deleted' runs and instead of actually modifying the file system, just hide those from the interface. This could be done as web storage I guess, and if people want to delete it from the file system for real I don't think it should be too complicated to generate a shell command you could copy-paste into your terminal.

I think it would be nice to have an indication of how long the run is in the list itself (time or steps). Short runs tends to be failures as this issue is about, slightly longer runs tends to be testing stuff, while long runs are the ones you see potential in and want to properly evaluate.
Anyway, I find it hard to organize the runs when you have a bunch and all you have is a search bar. Sometimes the runs I want to compare end up being the same colors as well because there just happened to be that amount of failed runs between them.

What I would love would be a command line flag that was like...

tensorboard --logdir . --delete_smaller_than 50

Where 50 was the number of global steps, or some other relevant metric for specifying that a run is short.

@DuaneNielsen would this literally delete the data files, or mark them as not to be shown in the UI?

Literally delete the files. I believe the request is to get rid of "dead" files that are created during development or from poorly thought out run commands that get aborted right away, but leave a tf.event file on disk.

Here's a utility that does something like that.

Checks a log directory, figures out the max steps, and then deletes everything under a threshold.

it's a bit slow. Looks like the max_steps isn't in a header, so you need to read the entire contents to figure out how many steps there are, but it get's the job done.

from argparse import ArgumentParser
import shutil
from tensorboard.backend.event_processing.event_file_inspector import get_inspection_units, print_dict, get_dict_to_print

"""
Deletes all folders with small tensorboard run files
"""

parser = ArgumentParser('delete small runs')
parser.add_argument('--logdir', type=str, default='.')
parser.add_argument('--delete_smaller_than', type=int, default=18000)
parser.add_argument('--list', action='store_true')
args = parser.parse_args()

run_len = {}
inspect_units = get_inspection_units(logdir=args.logdir)


for run in inspect_units:
    path = run[0]
    max_length = 0
    for key, value in get_dict_to_print(run.field_to_obs).items():
        if value is not None:
            length = value['max_step']
            if max_length < length:
                max_length = length
    run_len[path] = max_length

for run, length in run_len.items():
    if length < args.min_run_len:
        if args.list:
            print(f'{run} is {length} steps long and so will be deleted')
        else:
            try:
                print(f'{run} is {length} and was deleted')
                shutil.rmtree(run)
            except:
                print(f"OS didn't let us delete {run}")
    else:
        print(f'{run} is {length} and is good')
Was this page helpful?
0 / 5 - 0 ratings