Models: cifar10_multi_gpu_train.py

Created on 14 Jan 2017  路  9Comments  路  Source: tensorflow/models

I just used git to download the codes, and tensorflow was installed from nightly build. Then I compiled tensorflow from source.

# python cifar10_multi_gpu_train.py --num_gpus=4
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
WARNING:tensorflow:From /home/***/Downloads/models/tutorials/image/cifar10/cifar10_input.py:135: image_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.image. Note that tf.summary.image uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, the max_images argument was renamed to max_outputs.
Traceback (most recent call last):
  File "cifar10_multi_gpu_train.py", line 273, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "cifar10_multi_gpu_train.py", line 269, in main
    train()
  File "cifar10_multi_gpu_train.py", line 171, in train
    loss = tower_loss(scope)
  File "cifar10_multi_gpu_train.py", line 78, in tower_loss
    logits = cifar10.inference(images)
  File "/home/***/Downloads/models/tutorials/image/cifar10/cifar10.py", line 207, in inference
    wd=0.0)
  File "/home/***/Downloads/models/tutorials/image/cifar10/cifar10.py", line 137, in _variable_with_weight_decay
    weight_decay = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
AttributeError: 'module' object has no attribute 'mul'
help wanted bug

Most helpful comment

We don't do sync tensorflow/models.

unnamed-14

All 9 comments

@drpngx It looks like cl/141623422 (Dec 9th) didn't get synced to tensorflow/models to rename tf.mul -> tf.multiply. Possibly because December 9th was also the day that https://github.com/tensorflow/models/commit/86ecc9730d751c1f72e3bfecac958166390f4125 moved cifar10 from tensorflow/tensorflow to tensorflow/models.

I'm also curious why tf.mul is throwing AttributeError since the old name is only supposed to be deprecated.

We don't do sync tensorflow/models. They are probably all broken by now. @aselle would know better about tf.mul. I think we actually removed mul sub etc to get a clean slate for the next release.

@weigei123 the new symbol is tf.multiply, would you mind submitting a pull request to fix this?

We don't do sync tensorflow/models.

unnamed-14

@drpngx Thanks for your help,it's my pleasure to submit a pull request. But, I got another error.
:P

# python cifar10_multi_gpu_train.py --num_gpus=4
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
WARNING:tensorflow:From /home/***/Downloads/models/tutorials/image/cifar10/cifar10_input.py:135: image_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.image. Note that tf.summary.image uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, the max_images argument was renamed to max_outputs.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
WARNING:tensorflow:From /home/***/Downloads/models/tutorials/image/cifar10/cifar10_input.py:135: image_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.image. Note that tf.summary.image uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, the max_images argument was renamed to max_outputs.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
WARNING:tensorflow:From /home/***/Downloads/models/tutorials/image/cifar10/cifar10_input.py:135: image_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.image. Note that tf.summary.image uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, the max_images argument was renamed to max_outputs.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
WARNING:tensorflow:From /home/***/Downloads/models/tutorials/image/cifar10/cifar10_input.py:135: image_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.image. Note that tf.summary.image uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, the max_images argument was renamed to max_outputs.
Traceback (most recent call last):
  File "cifar10_multi_gpu_train.py", line 273, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "cifar10_multi_gpu_train.py", line 269, in main
    train()
  File "cifar10_multi_gpu_train.py", line 210, in train
    variables_averages_op = variable_averages.apply(tf.trainable_variables())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/moving_averages.py", line 373, in apply
    colocate_with_primary=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 101, in create_slot
    return _create_slot_var(primary, val, '')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 55, in _create_slot_var
    slot = variable_scope.get_variable(scope, initializer=val, trainable=False)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 987, in get_variable
    custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 889, in get_variable
    custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 347, in get_variable
    validate_shape=validate_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 332, in _true_getter
    caching_device=caching_device, validate_shape=validate_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 656, in _get_single_variable
    "VarScope?" % name)
ValueError: Variable conv1/weights/ExponentialMovingAverage/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?

That error seems to be what #892 is addressing and is hopefully fixed by PR #909 . Closing this issue out as a duplicate of #892.

@asimshankar This scope issue is discussed in tensorflow/tensorflow#6220. I have submitted a PR #911 to resolve these two issues.

@wookayin Thanks a lot. It works.
: )
But each GPU consumes less than 25% of GPU-Util. (I have TITAN X(Pascal)*4 ).
What do you think is the impact of GPU usage?

Yes, mine (Pascal Titan X * 4) also shows a low utilization (15%~20%). I presume this is because Pascal Titan X is too fast (!) than the rate of input being produced into the queue (including the time for I/O and the session.run() overhead), but not 100% confident.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dsindex picture dsindex  路  3Comments

sun9700 picture sun9700  路  3Comments

rakashi picture rakashi  路  3Comments

trungdn picture trungdn  路  3Comments

25b3nk picture 25b3nk  路  3Comments