Tensorboard: Keras callback creating .profile-empty file blocks loading data

Created on 3 Apr 2019 · 12Comments · Source: tensorflow/tensorboard

Repro steps:

Create a virtualenv with tf-nightly-2.0-preview==2.0.0.dev20190402
and open two terminals in this environment.

In one terminal, run the following simple Python script (but
continue to the next step while this script is still running):

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf


DATASET = tf.keras.datasets.mnist
INPUT_SHAPE = (28, 28)
OUTPUT_CLASSES = 10


def model_fn():
  model = tf.keras.models.Sequential([
      tf.keras.layers.Input(INPUT_SHAPE),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(128, activation="relu"),
      tf.keras.layers.BatchNormalization(),
      tf.keras.layers.Dense(256, activation="relu"),
      tf.keras.layers.Dropout(0.2),
      tf.keras.layers.Dense(OUTPUT_CLASSES, activation="softmax"),
  ])
  model.compile(
      loss="sparse_categorical_crossentropy",
      optimizer="adagrad",
      metrics=["accuracy"],
  )
  return model


def main():
  model = model_fn()
  ((x_train, y_train), (x_test, y_test)) = DATASET.load_data()
  model.fit(
      x=x_train,
      y=y_train,
      validation_data=(x_test, y_test),
      callbacks=[tf.keras.callbacks.TensorBoard()],
      epochs=5,
  )


if __name__ == "__main__":
  main()

Wait for (say) epoch 2/5 to finish training. Then, in the other
terminal, launch tensorboard --logdir ./logs.
Open TensorBoard and observe that both training and validation runs
appear with two epochs’ worth of data:
As training continues, refresh TensorBoard and/or reload the page.
Observe that validation data continues to appear, but training data
has stalled—even after well after the training has completed, the
plot is incomplete:
Kill the TensorBoard process and restart it. Note that the data
appears as desired:

The same problem occurs in tf-nightly (non-2.0-preview), but
manifests differently: because there is only one run (named .) instead
of separate train/validation, _all_ data stops being displayed after
the epoch in which TensorBoard is opened.

Note as a special case of this that if TensorBoard is running before
training starts, then train data may not appear at all:

Screenshot of validation-only data

backend bug

Source

wchargin

👍7

Most helpful comment

Just to note explicitly: setting profile_batch=0 in the Keras callback
options is a workaround that disables profiling entirely.

wchargin on 15 Apr 2019

👍16 🎉4

All 12 comments

What is the event file structure on disk after running this? This sounds like a case where there are multiple active event files in the run and TensorBoard advances to the last one too quickly and thus misses new data from the earlier one.

nfelt on 3 Apr 2019

Yep, I suspect so, too:

$ find logs/
logs/
logs/validation
logs/validation/events.out.tfevents.1554257194.HOSTNAME.5031.18577.v2
logs/train
logs/train/events.out.tfevents.1554257189.HOSTNAME.5031.1212.v2
logs/train/events.out.tfevents.1554257189.HOSTNAME.profile-empty
logs/train/plugins
logs/train/plugins/profile
logs/train/plugins/profile/2019-04-02_19-06-29
logs/train/plugins/profile/2019-04-02_19-06-29/local.trace

Note that the fake .profile-empty events file only appears in the
train/ directory.

wchargin on 3 Apr 2019

So, related: #1867.

wchargin on 3 Apr 2019

Yeah, that must be it. I thought the code that was writing our the .profile-empty file was supposed to check if there were any events.out.tfevents files pre-existing in the directory, which doesn't totally avoid this happening (due to races) but could at least reduce the incidence. Perhaps that isn't actually the case. Another hacky solution until #1867 is actually ready could be changing the suffix to something that sorts earlier in the alphabet.

nfelt on 3 Apr 2019

supposed to check if there were any events.out.tfevents files
pre-existing in the directory

It only checks whether a .profile-empty events file exists:

_EVENT_FILE_SUFFIX = '.profile-empty'

# …

  for file_name in gfile.ListDirectory(logdir):
    if file_name.endswith(_EVENT_FILE_SUFFIX):
      return
  # TODO(b/127330388): Use summary_ops_v2.create_file_writer instead.
  event_writer = pywrap_tensorflow.EventsWriter(
      compat.as_bytes(os.path.join(logdir, 'events')))
  event_writer.InitWithSuffix(compat.as_bytes(_EVENT_FILE_SUFFIX))

(from python/eager/profiler.py).

wchargin on 3 Apr 2019

Just to note explicitly: setting profile_batch=0 in the Keras callback
options is a workaround that disables profiling entirely.

wchargin on 15 Apr 2019

👍16 🎉4

FYI: The problem is still here, in 1cf0898dd of TF v2.0.0. Workaround above works.

grwlf on 12 Nov 2019

👍4

@qqfish Just checking if you are working on this issue, as you are the current assignee. If you're not, I'll be happy to take it.

cc @psybuzz

caisq on 22 Dec 2019

Thank you Shanqing. Please take it.

qqfish on 30 Dec 2019

@qqfish Done.

caisq on 30 Dec 2019

Okay, an easy fix for this suggested by @wchargin would be to just update TensorBoard to ignore any events file that ends in .profile-empty. This is simpler than changing TensorFlow and means updating TB will fix this for even existing logdirs.

Background: right now whether something is an events file is determined by the IsTensorFlowEventsFile predicate here:

https://github.com/tensorflow/tensorboard/blob/d0abd84df457a535a38fc7c90e63d2974aeb3fd1/tensorboard/backend/event_processing/io_wrapper.py#L47-L61

The predicate is used for (A) the code that finds sub-directories within a root logdir that contain any event files (which become the run directories), which happens here:

https://github.com/tensorflow/tensorboard/blob/d0abd84df457a535a38fc7c90e63d2974aeb3fd1/tensorboard/backend/event_processing/io_wrapper.py#L215

However the predicate is also used for (B) the code that lists event files within a run directory, which happens here:

https://github.com/tensorflow/tensorboard/blob/d0abd84df457a535a38fc7c90e63d2974aeb3fd1/tensorboard/backend/event_processing/plugin_event_accumulator.py#L627-L641

The point of the .profile-empty files is to trigger the detection in (A) so that directories that otherwise only contain non-tfevents profile data files are still considered to be TensorBoard run directories. The problematic part is that these files are also detected by (B). So we could fix this issue just by splitting these into two predicates, maybe an unchanged IsTensorFlowEventsFile for (A) and then something new like IsSummaryEventsFile for (B).

For the record, here are the commits where .profile-empty was introduced:
https://github.com/tensorflow/tensorflow/commit/23d8e38f8728b7107d78a773e03f8b1a8166237a
https://github.com/tensorflow/tensorflow/commit/c66b603990b9404dc1eb57de9d595aa0ffc8197f

nfelt on 3 Jan 2020

Thanks for the detailed write-up of the root cause of the issue and the proposal for how to fix it. I"ve opened #3108 to implement the suggested fix.

caisq on 5 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings