Does TensorBoard support multiple processes creating "tf.summary.FileWriter(logdir," objects with same logdir?
I found this causes error messages like this
E0301 04:53:39.120702 Reloader directory_watcher.py:241] File /efs/runs/psbench3/events.out.tfevents.1519880018.ip-192-168-5-194 updated even though the current file is /efs/runs/psbench3/events.out.tfevents.1519880018.ip-192-168-5-221
I found that when I see this message (version TensorBoard 1.5.1), it fails to show events from one of the files.
Either restarting TensorBoard after event files stopped changing, or removing one of the FileWriters makes things work
Ah, that is indeed problematic. Specifically, no 2 file writers (regardless of which process they're being used in) can write to the same run directory. They can though write to a directory + its children since they're different runs.
Specifically, each TensorBoard run (corresponding to a directory) monitors a current events file. If TensorBoard detects a separate events file to be changed within the directory, TensorBoard advances to that new file and never goes back to reading updates from the previous file.
https://github.com/tensorflow/tensorboard/blob/bf1e23229b8f9efbc5348ffbd7d9418c5fcb8b3c/tensorboard/backend/event_processing/directory_watcher.py#L144
Restarting TensorBoard makes all data appear again because TensorBoard reads all events files during startup.
Yes, this is a known limitation: https://github.com/tensorflow/tensorboard/blob/1.6.0/README.md#tensorboard-is-showing-only-some-of-my-data-or-isnt-properly-updating
Closing since I think we're fairly unlikely to fix this for the log directory case, but this shouldn't be an issue for the new SQLite backend we're working on. In the mean time, the workaround is to either use FileWriterCache to use the same FileWriter instance, have each FileWriter write to a separate directory (e.g. a separate subdirectory of "logdir"), or to just restart TensorBoard when you want to see the full data.
Thanks for the explanation. I've had multiple workers logging worker specific statistics like latency locally, looks like I should refactor my code to send those statistics to the chief worker to be logged into single filewriter
Most helpful comment
Thanks for the explanation. I've had multiple workers logging worker specific statistics like latency locally, looks like I should refactor my code to send those statistics to the chief worker to be logged into single filewriter