Tensorboard: Mesh plugin hits KeyError when new mesh summary tags are added to a run

Created on 20 Aug 2019  Â·  5Comments  Â·  Source: tensorflow/tensorboard

cc @podlipensky

The following steps will recreate the issue against today's tb-nightly:

  1. Create a virtualenv w/ tf-nightly-2.0-preview and the current tb-nightly
  2. Build the mesh_demo_v2 binary as updated in https://github.com/tensorflow/tensorboard/pull/2578
  3. wget https://people.sc.fsu.edu/~jburkardt/data/ply/teapot.ply
  4. bazel-bin/tensorboard/plugins/mesh/mesh_demo_v2 --mesh_path=teapot.ply --tag_name=mesh1
  5. tensorboard --logdir /tmp/mesh_demo
  6. Open TensorBoard tab, confirm that tag mesh1 appears as expected in TB
  7. bazel-bin/tensorboard/plugins/mesh/mesh_demo_v2 --mesh_path=teapot.ply --tag_name=mesh2
  8. Reload TensorBoard tab

The mesh visualizations fail to load because the tags request hits a 500 internal server error, due to the handler crashing on a KeyError here:

Traceback (most recent call last):
  File "/usr/local/google/home/nickfelt/.tf-venvs/tf-nightly-2.0-preview-py2/lib/python2.7/site-packages/werkzeug/serving.py", line 270, in run_wsgi                                                                                                       
    execute(self.server.app)
  File "/usr/local/google/home/nickfelt/.tf-venvs/tf-nightly-2.0-preview-py2/lib/python2.7/site-packages/werkzeug/serving.py", line 258, in execute                                                                                                        
    application_iter = app(environ, start_response)
  File "/usr/local/google/home/nickfelt/.tf-venvs/tf-nightly-2.0-preview-py2/lib/python2.7/site-packages/tensorboard/backend/application.py", line 380, in __call__                                                                                        
    return self.exact_routes[clean_path](environ, start_response)
  File "/usr/local/google/home/nickfelt/.tf-venvs/tf-nightly-2.0-preview-py2/lib/python2.7/site-packages/werkzeug/wrappers.py", line 308, in application                                                                                                   
    resp = f(*args[:-2] + (request,))
  File "/usr/local/google/home/nickfelt/.tf-venvs/tf-nightly-2.0-preview-py2/lib/python2.7/site-packages/tensorboard/plugins/mesh/mesh_plugin.py", line 107, in _serve_tags                                                                                
    tag = self._instance_tag_to_tag[(run, instance_tag)]
KeyError: ('.', u'mesh2_FACE')

The problem is that the mesh plugin permanently caches any non-empty result from PluginRunToTagsToContent() here:
https://github.com/tensorflow/tensorboard/blob/2b96c2a18da8cfe5f387cc07d03e8227706ca914/tensorboard/plugins/mesh/mesh_plugin.py#L56-L76

Later on in serve_tags() the logic calls PluginRunToTagsToContent() again and then prepare_metadata() but it's a no-op because we already have non-empty metadata cached for the mesh1 tag. And then when we index into the tag dict we get the KeyError.

Right now, the workaround is restarting TensorBoard, since on a fresh load it will correctly cache both tags.

I think a sufficient fix would just be to cache at the granularity of an individual (run, tag) pair; you could have a lookup helper to check the cache and populate it on a cache miss, instead of prepare_metadata(). That might also resolve the issue mentioned in
https://github.com/tensorflow/tensorboard/blob/2b96c2a18da8cfe5f387cc07d03e8227706ca914/tensorboard/plugins/mesh/mesh_plugin.py#L192-L195

Note that it's also best practice in general not to call into the multiplexer during the plugin construction (as is happening now via the prepare_metadata() call at the end of __init__()), since if this is slow it will delay startup for TensorBoard as a whole.

mesh bug

Most helpful comment

Hi

Is there any update regarding this issue?
Just to confirm, this is still happening in TB 2.0.0

Many thanks

All 5 comments

Hi

Is there any update regarding this issue?
Just to confirm, this is still happening in TB 2.0.0

Many thanks

Hello

Apologies for chasing you on this, but this is to report the same bug is present in Tensorboard 2.1.0.

Would it be possible to get some update on that issue?

It makes it difficult to use TB to track mesh values during training, as every time something is updated the whole TB session may need to be restarted, which also means reloading all the data (which takes time).

I believe this issue deserves a fairly high severity rating, as it strongly affects the usability of TB for its primary intended purpose (i.e., monitoring training).

All the best

Hi @mbahri—we don’t have an update on this issue, no. I’ll see if I can
take a look at it today and at least get back to you.

Took a quick look; some observations:

  • Replacing with a memoized lookup function works for two of the three
    caches, but not so easily for _tag_to_instance_tag, which is more
    like an inverted index where the others are forward indices.

  • We could use memoized lookup functions for the two “easy” caches,
    and for _tag_to_instance_tag use a memoized lookup function that
    calls PluginRunToTagToContent to determine the universe of active
    tags, expanding its cache if there are new, never-before-seen tags.

  • Even with these fixes, the caches are still fundamentally broken
    because they won’t handle deletions at all. We could work around
    this for the two “easy” caches by validating that the tag still
    exists at each access. We could do something similar for the
    inverted index, revalidating all of its registered tags. This would
    still be broken, because if a run is deleted and then re-created
    with different data without making any requests to the mesh plugin
    in the middle then the cache dirtying will not be detected.

This raises the question of whether it’s worth keeping the caches around
at all. We can substantially cheapen the full scans by performing them
at the accumulator level: i.e., _tag_to_instance_tags[(run, tag)] need
only look at mux.GetAccumulator(run).PluginTagToContent("mesh").

Hi @wchargin

Thank you for the quick reply and for having a first look at that issue!

I know you guys have a lot on your plate and that this plugin is only a part of the whole system, so I appreciate you taking the time.

Hopefully, a fix can be found. The mesh plugin is really handy for us researchers working with 3D data :+1:

Was this page helpful?
0 / 5 - 0 ratings