Tensorboard: TensorBoard very slow on GCS

Created on 26 Jun 2017 · 9Comments · Source: tensorflow/tensorboard

From @bw4sz on https://github.com/tensorflow/tensorboard/issues/80:

Loading (O(200MB)) of logs via TensorBoard on GCS takes more than ten minutes, but if they copy the event files locally, it loads in seconds.

I am guessing we are running into an issue where on GCS we are loading the events without readahead chunks, so we are doing round trip requests incredibly frequently, maybe as often as for every individual event. This makes for horrible performance.

backend bug

Source

decentralion

👍3

Most helpful comment

Status?

carlthome on 12 Mar 2018

👍5

All 9 comments

Related to that (and more important) - we have also found that using tensorboard to read directly from GCS using default reload intervals incurs incredible costs.

There is a cost related wth count of API calls made to access GCS and it seems that every reload directly from GCP causes every single file to be accessed separately via the GCS api . when you have bigger (but not outrageous) number of log files (few thousands is enough) It can very quickly get out of control - and with the default reload frequency of tensorboard it might get your GCS costs unreasonably high.

In our case it was about 4 USD/day while having just around 3000 files in the logs !!!). We got about 4 millions of API requests generated in 4 days (!) just by running one tensorboard instance. We are changing strategy now to sync the GCS data locally (using gsutil rsync) and read it from there because of that.

potiuk on 7 Aug 2017

Just a comment - gsutil rsync has another problem (it deletes and recreates event files) - I opened a new issue for it: https://github.com/tensorflow/tensorboard/issues/349

potiuk on 12 Aug 2017

Are there any updates on this issue?
After switching to GCS for storing checkpoints and events, it is now nearly unusable to refresh logs.
When launching tensorboard, it will fetch logs from a selected directory, but it won't update it or fetch anything after. In our case even leaving tensorboard for hours won't refresh stats unless fully restarted.
Neither specifying GCS bucket directly:

tensorboard --logdir=${GCS_LOGS_BUCKET} --reload_interval=2

, or using gcsfuse to synchronize local directory with a bucket and then point tensorboard to local directory, works, and in both cases it is necessary to restart tensorboard to refresh logs.
Any fix or workaround would be much appreciated.

MtDersvan on 11 Sep 2017

👍5

Status?

carlthome on 12 Mar 2018

👍5

We discovered an issue (#1225) involving bad interaction between TensorBoard and the underlying TensorFlow tf.gfile API when running against GCS logdirs, and this issue causes excessive network usage and API calls, as @potiuk describes above in https://github.com/tensorflow/tensorboard/issues/158#issuecomment-320565668. PR #1226 should address this in the upcoming 1.9 release of TensorBoard as long as you are also using TensorFlow 1.9+.

This issue should hopefully address some of the general slowness involved in GCS logdirs, but there may still be other sources of unoptimized slow performance.

The best way to ensure good performance and low network bandwidth when using a GCS logdir is to run TensorBoard within the same Google Cloud Platform location where GCS egress traffic is free of charge. For example, you can run TensorBoard on a GCE instance in the same region as your GCS bucket, and optionally port forward using SSH if you want to continue to access TensorBoard at localhost:6006.

nfelt on 5 Jun 2018

🎉2 👍1

Performance aside, if you want to know whether or not GCS network egress is free, here's a helpful shell function:

is_gcs_free() {
  GCS_BUCKET=$1
  GCE_ZONE=$(curl -sfL metadata.google.internal/0.1/meta-data/zone) || { echo Not in GCE fleet >&2; return 1; }
  GCE_REGION=$(printf %s\\n "${GCE_ZONE}" | sed -n 's!.*/\([^-/]*-[^-/]*\).*!\1!p')
  GCP_TOKEN=$(curl -sfLH Metadata-Flavor:Google metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token | sed -n 's/.*"access_token": *"\([^"]*\)".*/\1/p')
  GCS_REGION=$(curl -sfLH "Authorization:Bearer ${GCP_TOKEN}" "https://www.googleapis.com/storage/v1/b/$1" | sed -n 's/.*"location": *"\([^"]*\)".*/\1/p' | tr A-Z a-z)
  [ $(expr "${GCE_REGION}" : "${GCS_REGION}") -gt 0 ]
}

I've been considering possible integrating that into TensorFlow, so we can show a warning, since many people will run TensorBoard on their local machines, rather than something like gcloud compute ssh instance -- -I 6006:localhost:6006 tensorboard --logdir=gs://foo/bar. Would something like this be useful?

jart on 5 Jun 2018

👍3 🎉1

I think #1087 should also address a lot of the slowness involved in scanning the GCS logdirs for event files. That PR should also be included in TensorBoard 1.9.

nfelt on 7 Jun 2018

@jart : IMO integrating that GCS egress check into TB would be very helpful to our users.

amygdala on 11 Jun 2018

I'm hoping that the fixes from June 2018 were sufficient to address the problems, since I haven't seen any followup reports on this issue, so I'm going to close it out.

If you see any problems where using TensorBoard with GCS is drastically slower than using it against local files, please file a new issue and we'll take a look.

nfelt on 18 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ImportError: cannot import name '_message' from 'google.protobuf.pyext' (c:\programdata\anaconda3\lib\site-packages\google\protobuf\pyext\__init__.py)

bestazad · 3Comments

why no graph show?

wengqi123 · 3Comments

Option to highlight min/max point on curve

ismael-elatifi · 3Comments

Getting "E0301 04:53:39.120702 Reloader directory_watcher.py:241: xyz updated updated even though the current file is zyx

yaroslavvb · 4Comments

TensorBoard requires GPU memory to run

KylePiira · 4Comments