Tensorboard: TensorBoard very slow on GCS

Created on 26 Jun 2017  路  9Comments  路  Source: tensorflow/tensorboard

From @bw4sz on https://github.com/tensorflow/tensorboard/issues/80:

Loading (O(200MB)) of logs via TensorBoard on GCS takes more than ten minutes, but if they copy the event files locally, it loads in seconds.

I am guessing we are running into an issue where on GCS we are loading the events without readahead chunks, so we are doing round trip requests incredibly frequently, maybe as often as for every individual event. This makes for horrible performance.

backend bug

Most helpful comment

Status?

All 9 comments

Related to that (and more important) - we have also found that using tensorboard to read directly from GCS using default reload intervals incurs incredible costs.

There is a cost related wth count of API calls made to access GCS and it seems that every reload directly from GCP causes every single file to be accessed separately via the GCS api . when you have bigger (but not outrageous) number of log files (few thousands is enough) It can very quickly get out of control - and with the default reload frequency of tensorboard it might get your GCS costs unreasonably high.

In our case it was about 4 USD/day while having just around 3000 files in the logs !!!). We got about 4 millions of API requests generated in 4 days (!) just by running one tensorboard instance. We are changing strategy now to sync the GCS data locally (using gsutil rsync) and read it from there because of that.

Just a comment - gsutil rsync has another problem (it deletes and recreates event files) - I opened a new issue for it: https://github.com/tensorflow/tensorboard/issues/349

Are there any updates on this issue?
After switching to GCS for storing checkpoints and events, it is now nearly unusable to refresh logs.
When launching tensorboard, it will fetch logs from a selected directory, but it won't update it or fetch anything after. In our case even leaving tensorboard for hours won't refresh stats unless fully restarted.
Neither specifying GCS bucket directly:

tensorboard --logdir=${GCS_LOGS_BUCKET} --reload_interval=2

, or using gcsfuse to synchronize local directory with a bucket and then point tensorboard to local directory, works, and in both cases it is necessary to restart tensorboard to refresh logs.
Any fix or workaround would be much appreciated.

Status?

We discovered an issue (#1225) involving bad interaction between TensorBoard and the underlying TensorFlow tf.gfile API when running against GCS logdirs, and this issue causes excessive network usage and API calls, as @potiuk describes above in https://github.com/tensorflow/tensorboard/issues/158#issuecomment-320565668. PR #1226 should address this in the upcoming 1.9 release of TensorBoard as long as you are also using TensorFlow 1.9+.

This issue should hopefully address some of the general slowness involved in GCS logdirs, but there may still be other sources of unoptimized slow performance.

The best way to ensure good performance and low network bandwidth when using a GCS logdir is to run TensorBoard within the same Google Cloud Platform location where GCS egress traffic is free of charge. For example, you can run TensorBoard on a GCE instance in the same region as your GCS bucket, and optionally port forward using SSH if you want to continue to access TensorBoard at localhost:6006.

Performance aside, if you want to know whether or not GCS network egress is free, here's a helpful shell function:

is_gcs_free() {
  GCS_BUCKET=$1
  GCE_ZONE=$(curl -sfL metadata.google.internal/0.1/meta-data/zone) || { echo Not in GCE fleet >&2; return 1; }
  GCE_REGION=$(printf %s\\n "${GCE_ZONE}" | sed -n 's!.*/\([^-/]*-[^-/]*\).*!\1!p')
  GCP_TOKEN=$(curl -sfLH Metadata-Flavor:Google metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token | sed -n 's/.*"access_token": *"\([^"]*\)".*/\1/p')
  GCS_REGION=$(curl -sfLH "Authorization:Bearer ${GCP_TOKEN}" "https://www.googleapis.com/storage/v1/b/$1" | sed -n 's/.*"location": *"\([^"]*\)".*/\1/p' | tr A-Z a-z)
  [ $(expr "${GCE_REGION}" : "${GCS_REGION}") -gt 0 ]
}

I've been considering possible integrating that into TensorFlow, so we can show a warning, since many people will run TensorBoard on their local machines, rather than something like gcloud compute ssh instance -- -I 6006:localhost:6006 tensorboard --logdir=gs://foo/bar. Would something like this be useful?

I think #1087 should also address a lot of the slowness involved in scanning the GCS logdirs for event files. That PR should also be included in TensorBoard 1.9.

@jart : IMO integrating that GCS egress check into TB would be very helpful to our users.

I'm hoping that the fixes from June 2018 were sufficient to address the problems, since I haven't seen any followup reports on this issue, so I'm going to close it out.

If you see any problems where using TensorBoard with GCS is drastically slower than using it against local files, please file a new issue and we'll take a look.

Was this page helpful?
0 / 5 - 0 ratings