Hi team!
I'm testing the performance of the new Tranformer model on a translation task English to Italian. I created my own data-generator task and problem. I trained an engine on a single GPU machine (Tesla K80) from AWS, instance type is p2.xlarge.
Everything went well, very good results in terms of translation quality! I wanted to test Transformer on a a multi-GPU environment but, here's the problem: while tensorflow correctly creates multiple devices, only one GPU is used during training, so the process do not speed up at all.
Here's the relevant log parts on my 8 GPU machine (Tesla K80), AWS instance p2.8xlarge:
2017-07-10 10:57:09.588828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:17.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-10 10:57:09.722229: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8521ce0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-07-10 10:57:09.722779: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[...]
2017-07-10 10:57:10.582707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 7 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-10 10:57:10.614015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3 4 5 6 7
2017-07-10 10:57:10.614038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y Y Y Y Y Y Y Y
2017-07-10 10:57:10.614046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: Y Y Y Y Y Y Y Y
2017-07-10 10:57:10.614057: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2: Y Y Y Y Y Y Y Y
2017-07-10 10:57:10.614063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3: Y Y Y Y Y Y Y Y
2017-07-10 10:57:10.614068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 4: Y Y Y Y Y Y Y Y
2017-07-10 10:57:10.614073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 5: Y Y Y Y Y Y Y Y
2017-07-10 10:57:10.614077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 6: Y Y Y Y Y Y Y Y
2017-07-10 10:57:10.614082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 7: Y Y Y Y Y Y Y Y
2017-07-10 10:57:10.614097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:17.0)
2017-07-10 10:57:10.614104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:18.0)
2017-07-10 10:57:10.614121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:19.0)
2017-07-10 10:57:10.614127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:1a.0)
2017-07-10 10:57:10.614131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:00:1b.0)
2017-07-10 10:57:10.614136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:00:1c.0)
2017-07-10 10:57:10.614140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:00:1d.0)
2017-07-10 10:57:10.614144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:1e.0)
2017-07-10 10:57:18.742725: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1266 get requests, put_count=1200 evicted_count=1000 eviction_rate=0.833333 and unsatisfied allocation rate=0.921011
2017-07-10 10:57:18.742770: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Saving checkpoints for 1 into /home/ubuntu/workspace/t2t_train/my_custom_problem/transformer-transformer_base/model.ckpt.
INFO:tensorflow:loss = 9.96959, step = 1
2017-07-10 10:58:17.319308: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 4750 get requests, put_count=4663 evicted_count=1000 eviction_rate=0.214454 and unsatisfied allocation rate=0.233684
2017-07-10 10:58:17.319347: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 256 to 281
INFO:tensorflow:global_step/sec: 1.34529
INFO:tensorflow:loss = 8.00578, step = 101 (74.334 sec)
Here's the output of the command nvidia-smi:
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:00:17.0 Off | 0 |
| N/A 82C P0 136W / 149W | 10917MiB / 11439MiB | 91% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:00:18.0 Off | 0 |
| N/A 49C P0 72W / 149W | 10877MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:00:19.0 Off | 0 |
| N/A 63C P0 60W / 149W | 10877MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:00:1A.0 Off | 0 |
| N/A 53C P0 71W / 149W | 10877MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 0000:00:1B.0 Off | 0 |
| N/A 63C P0 59W / 149W | 10875MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 0000:00:1C.0 Off | 0 |
| N/A 51C P0 70W / 149W | 10875MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 0000:00:1D.0 Off | 0 |
| N/A 64C P0 59W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 0000:00:1E.0 Off | 0 |
| N/A 53C P0 70W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
The time needed for a batch of 100 steps is exactly the same of the 1-GPU case (~75 sec).
Here's the script I used for the training:
PROBLEM=my_custom_problem
MODEL=transformer
HPARAMS=transformer_base # it was transformer_base_single_gpu on 1-GPU test
DATA_DIR=$(pwd)/t2t_data
TMP_DIR=$(pwd)/t2t_datagen
TRAIN_DIR=$(pwd)/t2t_train/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
# Generate data
t2t-datagen \
--data_dir=$DATA_DIR \
--tmp_dir=$TMP_DIR \
--num_shards=100 \
--problem=$PROBLEM
mv $TMP_DIR/tokens.vb $DATA_DIR
t2t-trainer --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR --hparams='batch_size=2048'
Am I doing something wrong? How can I train on multiple GPU with Transformer model?
Thanks in advantage for your help.
Davide
UPDATE: reading the first reply to the issue #99 I decided to add the flag --worker_gpu=8 to my t2t-trainercommand so now it looks like this:
t2t-trainer --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR --hparams='batch_size=2048' --worker_gpu=8
The results are:
Here's a comparison between the first 1000 steps on both cases:
Single GPU
INFO:tensorflow:loss = 10.1301, step = 1
INFO:tensorflow:loss = 8.98003, step = 101 (76.338 sec)
INFO:tensorflow:loss = 8.32426, step = 201 (77.305 sec)
INFO:tensorflow:loss = 7.93614, step = 301 (77.471 sec)
INFO:tensorflow:loss = 7.56578, step = 401 (77.584 sec)
INFO:tensorflow:loss = 7.00716, step = 501 (77.871 sec)
INFO:tensorflow:loss = 6.46784, step = 601 (77.493 sec)
INFO:tensorflow:loss = 6.42706, step = 701 (77.233 sec)
INFO:tensorflow:loss = 5.50415, step = 801 (88.698 sec)
INFO:tensorflow:loss = 5.73775, step = 901 (77.630 sec)
INFO:tensorflow:loss = 4.82792, step = 1001 (77.974 sec)
Multiple (8) GPUs
INFO:tensorflow:loss = 10.2379, step = 1
INFO:tensorflow:loss = 7.91347, step = 101 (104.119 sec)
INFO:tensorflow:loss = 6.46561, step = 201 (106.224 sec)
INFO:tensorflow:loss = 5.05295, step = 301 (105.891 sec)
INFO:tensorflow:loss = 4.85656, step = 401 (107.138 sec)
INFO:tensorflow:loss = 5.04184, step = 501 (106.947 sec)
INFO:tensorflow:loss = 3.94963, step = 601 (152.102 sec)
INFO:tensorflow:loss = 4.29208, step = 701 (108.727 sec)
INFO:tensorflow:loss = 3.95084, step = 801 (108.000 sec)
INFO:tensorflow:loss = 2.87029, step = 901 (108.402 sec)
INFO:tensorflow:loss = 3.39836, step = 1001 (107.186 sec)
I expected to see the time needed for 100 steps divided by 6-8 (depending on the degree of parallelism), while it has been increased by 40%.
Is this normal?
Yes, this is expected. According to https://github.com/tensorflow/tensor2tensor/issues/17#issuecomment-310268149 and https://github.com/tensorflow/tensor2tensor/issues/17#issuecomment-310495062, each step is now 8-times bigger.
Is it working well for you? As @martinpopel says, this is expected because your effective batch size is now 8x larger. I'm closing this for now, but please, let us know how it's working and reopen if there are any problems.
Thanks, now I have definitely a better picture of how the processes works. I'm making some more tests, I'll let you know if there will be any problems.
Thanks again!
As a matter of fact , I have no idea how to use specific gpu in my command. Can you give me some suggestions? @davidecaroselli @lukaszkaiser
set with cuda. cuda_devices control visibility of a cmd window session.
@liesun1994: I use CUDA_VISIBLE_DEVICES env variable. E.g. export CUDA_VISIBLE_DEVICES=0,2,3 for using the first, third and fourth GPU, but hide the second one.
@davidecaroselli Would you share your better idea on Multi-GPUs?
Lost in translation :) I mean I have now a better picture of how the system works.
Most helpful comment
Is it working well for you? As @martinpopel says, this is expected because your effective batch size is now 8x larger. I'm closing this for now, but please, let us know how it's working and reopen if there are any problems.