Repro steps:
Create a virtualenv with tf-nightly-2.0-preview==2.0.0.dev20190402
and open two terminals in this environment.
In one terminal, run the following simple Python script (but
continue to the next step while this script is still running):
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
DATASET = tf.keras.datasets.mnist
INPUT_SHAPE = (28, 28)
OUTPUT_CLASSES = 10
def model_fn():
model = tf.keras.models.Sequential([
tf.keras.layers.Input(INPUT_SHAPE),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(256, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(OUTPUT_CLASSES, activation="softmax"),
])
model.compile(
loss="sparse_categorical_crossentropy",
optimizer="adagrad",
metrics=["accuracy"],
)
return model
def main():
model = model_fn()
((x_train, y_train), (x_test, y_test)) = DATASET.load_data()
model.fit(
x=x_train,
y=y_train,
validation_data=(x_test, y_test),
callbacks=[tf.keras.callbacks.TensorBoard()],
epochs=5,
)
if __name__ == "__main__":
main()
Wait for (say) epoch 2/5 to finish training. Then, in the other
terminal, launch tensorboard --logdir ./logs.
Open TensorBoard and observe that both training and validation runs
appear with two epochs’ worth of data:

As training continues, refresh TensorBoard and/or reload the page.
Observe that validation data continues to appear, but training data
has stalled—even after well after the training has completed, the
plot is incomplete:

Kill the TensorBoard process and restart it. Note that the data
appears as desired:

The same problem occurs in tf-nightly (non-2.0-preview), but
manifests differently: because there is only one run (named .) instead
of separate train/validation, _all_ data stops being displayed after
the epoch in which TensorBoard is opened.
Note as a special case of this that if TensorBoard is running before
training starts, then train data may not appear at all:

What is the event file structure on disk after running this? This sounds like a case where there are multiple active event files in the run and TensorBoard advances to the last one too quickly and thus misses new data from the earlier one.
Yep, I suspect so, too:
$ find logs/
logs/
logs/validation
logs/validation/events.out.tfevents.1554257194.HOSTNAME.5031.18577.v2
logs/train
logs/train/events.out.tfevents.1554257189.HOSTNAME.5031.1212.v2
logs/train/events.out.tfevents.1554257189.HOSTNAME.profile-empty
logs/train/plugins
logs/train/plugins/profile
logs/train/plugins/profile/2019-04-02_19-06-29
logs/train/plugins/profile/2019-04-02_19-06-29/local.trace
Note that the fake .profile-empty events file only appears in the
train/ directory.
So, related: #1867.
Yeah, that must be it. I thought the code that was writing our the .profile-empty file was supposed to check if there were any events.out.tfevents files pre-existing in the directory, which doesn't totally avoid this happening (due to races) but could at least reduce the incidence. Perhaps that isn't actually the case. Another hacky solution until #1867 is actually ready could be changing the suffix to something that sorts earlier in the alphabet.
supposed to check if there were any events.out.tfevents files
pre-existing in the directory
It only checks whether a .profile-empty events file exists:
_EVENT_FILE_SUFFIX = '.profile-empty'
# …
for file_name in gfile.ListDirectory(logdir):
if file_name.endswith(_EVENT_FILE_SUFFIX):
return
# TODO(b/127330388): Use summary_ops_v2.create_file_writer instead.
event_writer = pywrap_tensorflow.EventsWriter(
compat.as_bytes(os.path.join(logdir, 'events')))
event_writer.InitWithSuffix(compat.as_bytes(_EVENT_FILE_SUFFIX))
(from python/eager/profiler.py).
Just to note explicitly: setting profile_batch=0 in the Keras callback
options is a workaround that disables profiling entirely.
FYI: The problem is still here, in 1cf0898dd of TF v2.0.0. Workaround above works.
@qqfish Just checking if you are working on this issue, as you are the current assignee. If you're not, I'll be happy to take it.
cc @psybuzz
Thank you Shanqing. Please take it.
@qqfish Done.
Okay, an easy fix for this suggested by @wchargin would be to just update TensorBoard to ignore any events file that ends in .profile-empty. This is simpler than changing TensorFlow and means updating TB will fix this for even existing logdirs.
Background: right now whether something is an events file is determined by the IsTensorFlowEventsFile predicate here:
The predicate is used for (A) the code that finds sub-directories within a root logdir that contain any event files (which become the run directories), which happens here:
However the predicate is also used for (B) the code that lists event files within a run directory, which happens here:
The point of the .profile-empty files is to trigger the detection in (A) so that directories that otherwise only contain non-tfevents profile data files are still considered to be TensorBoard run directories. The problematic part is that these files are also detected by (B). So we could fix this issue just by splitting these into two predicates, maybe an unchanged IsTensorFlowEventsFile for (A) and then something new like IsSummaryEventsFile for (B).
For the record, here are the commits where .profile-empty was introduced:
https://github.com/tensorflow/tensorflow/commit/23d8e38f8728b7107d78a773e03f8b1a8166237a
https://github.com/tensorflow/tensorflow/commit/c66b603990b9404dc1eb57de9d595aa0ffc8197f
Thanks for the detailed write-up of the root cause of the issue and the proposal for how to fix it. I"ve opened #3108 to implement the suggested fix.
Most helpful comment
Just to note explicitly: setting
profile_batch=0in the Keras callbackoptions is a workaround that disables profiling entirely.