Tensorboard: Keras callback creating .profile-empty file blocks loading data

Created on 3 Apr 2019  Â·  12Comments  Â·  Source: tensorflow/tensorboard

Repro steps:

  1. Create a virtualenv with tf-nightly-2.0-preview==2.0.0.dev20190402
    and open two terminals in this environment.

  2. In one terminal, run the following simple Python script (but
    continue to the next step while this script is still running):

    from __future__ import absolute_import
    from __future__ import division
    from __future__ import print_function
    
    import tensorflow as tf
    
    
    DATASET = tf.keras.datasets.mnist
    INPUT_SHAPE = (28, 28)
    OUTPUT_CLASSES = 10
    
    
    def model_fn():
      model = tf.keras.models.Sequential([
          tf.keras.layers.Input(INPUT_SHAPE),
          tf.keras.layers.Flatten(),
          tf.keras.layers.Dense(128, activation="relu"),
          tf.keras.layers.BatchNormalization(),
          tf.keras.layers.Dense(256, activation="relu"),
          tf.keras.layers.Dropout(0.2),
          tf.keras.layers.Dense(OUTPUT_CLASSES, activation="softmax"),
      ])
      model.compile(
          loss="sparse_categorical_crossentropy",
          optimizer="adagrad",
          metrics=["accuracy"],
      )
      return model
    
    
    def main():
      model = model_fn()
      ((x_train, y_train), (x_test, y_test)) = DATASET.load_data()
      model.fit(
          x=x_train,
          y=y_train,
          validation_data=(x_test, y_test),
          callbacks=[tf.keras.callbacks.TensorBoard()],
          epochs=5,
      )
    
    
    if __name__ == "__main__":
      main()
    
  3. Wait for (say) epoch 2/5 to finish training. Then, in the other
    terminal, launch tensorboard --logdir ./logs.

  4. Open TensorBoard and observe that both training and validation runs
    appear with two epochs’ worth of data:

    Screenshot just after launching TensorBoard

  5. As training continues, refresh TensorBoard and/or reload the page.
    Observe that validation data continues to appear, but training data
    has stalled—even after well after the training has completed, the
    plot is incomplete:

    Screenshot of bad state

  6. Kill the TensorBoard process and restart it. Note that the data
    appears as desired:

    Screenshot of good state after TensorBoard relaunch

The same problem occurs in tf-nightly (non-2.0-preview), but
manifests differently: because there is only one run (named .) instead
of separate train/validation, _all_ data stops being displayed after
the epoch in which TensorBoard is opened.

Note as a special case of this that if TensorBoard is running before
training starts, then train data may not appear at all:

Screenshot of validation-only data

backend bug

Most helpful comment

Just to note explicitly: setting profile_batch=0 in the Keras callback
options is a workaround that disables profiling entirely.

All 12 comments

What is the event file structure on disk after running this? This sounds like a case where there are multiple active event files in the run and TensorBoard advances to the last one too quickly and thus misses new data from the earlier one.

Yep, I suspect so, too:

$ find logs/
logs/
logs/validation
logs/validation/events.out.tfevents.1554257194.HOSTNAME.5031.18577.v2
logs/train
logs/train/events.out.tfevents.1554257189.HOSTNAME.5031.1212.v2
logs/train/events.out.tfevents.1554257189.HOSTNAME.profile-empty
logs/train/plugins
logs/train/plugins/profile
logs/train/plugins/profile/2019-04-02_19-06-29
logs/train/plugins/profile/2019-04-02_19-06-29/local.trace

Note that the fake .profile-empty events file only appears in the
train/ directory.

So, related: #1867.

Yeah, that must be it. I thought the code that was writing our the .profile-empty file was supposed to check if there were any events.out.tfevents files pre-existing in the directory, which doesn't totally avoid this happening (due to races) but could at least reduce the incidence. Perhaps that isn't actually the case. Another hacky solution until #1867 is actually ready could be changing the suffix to something that sorts earlier in the alphabet.

supposed to check if there were any events.out.tfevents files
pre-existing in the directory

It only checks whether a .profile-empty events file exists:

_EVENT_FILE_SUFFIX = '.profile-empty'

# …

  for file_name in gfile.ListDirectory(logdir):
    if file_name.endswith(_EVENT_FILE_SUFFIX):
      return
  # TODO(b/127330388): Use summary_ops_v2.create_file_writer instead.
  event_writer = pywrap_tensorflow.EventsWriter(
      compat.as_bytes(os.path.join(logdir, 'events')))
  event_writer.InitWithSuffix(compat.as_bytes(_EVENT_FILE_SUFFIX))

(from python/eager/profiler.py).

Just to note explicitly: setting profile_batch=0 in the Keras callback
options is a workaround that disables profiling entirely.

FYI: The problem is still here, in 1cf0898dd of TF v2.0.0. Workaround above works.

@qqfish Just checking if you are working on this issue, as you are the current assignee. If you're not, I'll be happy to take it.

cc @psybuzz

Thank you Shanqing. Please take it.

@qqfish Done.

Okay, an easy fix for this suggested by @wchargin would be to just update TensorBoard to ignore any events file that ends in .profile-empty. This is simpler than changing TensorFlow and means updating TB will fix this for even existing logdirs.

Background: right now whether something is an events file is determined by the IsTensorFlowEventsFile predicate here:

https://github.com/tensorflow/tensorboard/blob/d0abd84df457a535a38fc7c90e63d2974aeb3fd1/tensorboard/backend/event_processing/io_wrapper.py#L47-L61

The predicate is used for (A) the code that finds sub-directories within a root logdir that contain any event files (which become the run directories), which happens here:

https://github.com/tensorflow/tensorboard/blob/d0abd84df457a535a38fc7c90e63d2974aeb3fd1/tensorboard/backend/event_processing/io_wrapper.py#L215

However the predicate is also used for (B) the code that lists event files within a run directory, which happens here:

https://github.com/tensorflow/tensorboard/blob/d0abd84df457a535a38fc7c90e63d2974aeb3fd1/tensorboard/backend/event_processing/plugin_event_accumulator.py#L627-L641

The point of the .profile-empty files is to trigger the detection in (A) so that directories that otherwise only contain non-tfevents profile data files are still considered to be TensorBoard run directories. The problematic part is that these files are also detected by (B). So we could fix this issue just by splitting these into two predicates, maybe an unchanged IsTensorFlowEventsFile for (A) and then something new like IsSummaryEventsFile for (B).

For the record, here are the commits where .profile-empty was introduced:
https://github.com/tensorflow/tensorflow/commit/23d8e38f8728b7107d78a773e03f8b1a8166237a
https://github.com/tensorflow/tensorflow/commit/c66b603990b9404dc1eb57de9d595aa0ffc8197f

Thanks for the detailed write-up of the root cause of the issue and the proposal for how to fix it. I"ve opened #3108 to implement the suggested fix.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

neouyghur picture neouyghur  Â·  3Comments

yegortokmakov picture yegortokmakov  Â·  3Comments

hadim picture hadim  Â·  4Comments

dniku picture dniku  Â·  3Comments

ectg picture ectg  Â·  3Comments