Tensor2tensor: Multi-GPU training with Transformer model

Created on 10 Jul 2017  路  9Comments  路  Source: tensorflow/tensor2tensor

Hi team!

I'm testing the performance of the new Tranformer model on a translation task English to Italian. I created my own data-generator task and problem. I trained an engine on a single GPU machine (Tesla K80) from AWS, instance type is p2.xlarge.

Everything went well, very good results in terms of translation quality! I wanted to test Transformer on a a multi-GPU environment but, here's the problem: while tensorflow correctly creates multiple devices, only one GPU is used during training, so the process do not speed up at all.

Here's the relevant log parts on my 8 GPU machine (Tesla K80), AWS instance p2.8xlarge:

2017-07-10 10:57:09.588828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:17.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-10 10:57:09.722229: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8521ce0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-07-10 10:57:09.722779: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

[...]

2017-07-10 10:57:10.582707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 7 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-10 10:57:10.614015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3 4 5 6 7 
2017-07-10 10:57:10.614038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y Y Y Y Y Y Y Y 
2017-07-10 10:57:10.614046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1:   Y Y Y Y Y Y Y Y 
2017-07-10 10:57:10.614057: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2:   Y Y Y Y Y Y Y Y 
2017-07-10 10:57:10.614063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3:   Y Y Y Y Y Y Y Y 
2017-07-10 10:57:10.614068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 4:   Y Y Y Y Y Y Y Y 
2017-07-10 10:57:10.614073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 5:   Y Y Y Y Y Y Y Y 
2017-07-10 10:57:10.614077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 6:   Y Y Y Y Y Y Y Y 
2017-07-10 10:57:10.614082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 7:   Y Y Y Y Y Y Y Y 
2017-07-10 10:57:10.614097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:17.0)
2017-07-10 10:57:10.614104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:18.0)
2017-07-10 10:57:10.614121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:19.0)
2017-07-10 10:57:10.614127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:1a.0)
2017-07-10 10:57:10.614131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:00:1b.0)
2017-07-10 10:57:10.614136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:00:1c.0)
2017-07-10 10:57:10.614140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:00:1d.0)
2017-07-10 10:57:10.614144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:1e.0)
2017-07-10 10:57:18.742725: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1266 get requests, put_count=1200 evicted_count=1000 eviction_rate=0.833333 and unsatisfied allocation rate=0.921011
2017-07-10 10:57:18.742770: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Saving checkpoints for 1 into /home/ubuntu/workspace/t2t_train/my_custom_problem/transformer-transformer_base/model.ckpt.
INFO:tensorflow:loss = 9.96959, step = 1
2017-07-10 10:58:17.319308: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 4750 get requests, put_count=4663 evicted_count=1000 eviction_rate=0.214454 and unsatisfied allocation rate=0.233684
2017-07-10 10:58:17.319347: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 256 to 281
INFO:tensorflow:global_step/sec: 1.34529
INFO:tensorflow:loss = 8.00578, step = 101 (74.334 sec)

Here's the output of the command nvidia-smi:

| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:00:17.0     Off |                    0 |
| N/A   82C    P0   136W / 149W |  10917MiB / 11439MiB |     91%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:00:18.0     Off |                    0 |
| N/A   49C    P0    72W / 149W |  10877MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:00:19.0     Off |                    0 |
| N/A   63C    P0    60W / 149W |  10877MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:00:1A.0     Off |                    0 |
| N/A   53C    P0    71W / 149W |  10877MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 0000:00:1B.0     Off |                    0 |
| N/A   63C    P0    59W / 149W |  10875MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 0000:00:1C.0     Off |                    0 |
| N/A   51C    P0    70W / 149W |  10875MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 0000:00:1D.0     Off |                    0 |
| N/A   64C    P0    59W / 149W |  10873MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 0000:00:1E.0     Off |                    0 |
| N/A   53C    P0    70W / 149W |  10873MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

The time needed for a batch of 100 steps is exactly the same of the 1-GPU case (~75 sec).
Here's the script I used for the training:

PROBLEM=my_custom_problem
MODEL=transformer
HPARAMS=transformer_base # it was transformer_base_single_gpu on 1-GPU test

DATA_DIR=$(pwd)/t2t_data
TMP_DIR=$(pwd)/t2t_datagen
TRAIN_DIR=$(pwd)/t2t_train/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

# Generate data
t2t-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --num_shards=100 \
  --problem=$PROBLEM

mv $TMP_DIR/tokens.vb $DATA_DIR

t2t-trainer --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR --hparams='batch_size=2048'

Am I doing something wrong? How can I train on multiple GPU with Transformer model?

Thanks in advantage for your help.
Davide

Most helpful comment

Is it working well for you? As @martinpopel says, this is expected because your effective batch size is now 8x larger. I'm closing this for now, but please, let us know how it's working and reopen if there are any problems.

All 9 comments

UPDATE: reading the first reply to the issue #99 I decided to add the flag --worker_gpu=8 to my t2t-trainercommand so now it looks like this:

t2t-trainer --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR --hparams='batch_size=2048' --worker_gpu=8

The results are:

  • Now all the 8 GPUs of the system are used (GPU-Util always more than 80%).
  • BUT the time needed for 100 steps increased.

Here's a comparison between the first 1000 steps on both cases:

Single GPU

INFO:tensorflow:loss = 10.1301, step = 1
INFO:tensorflow:loss = 8.98003, step = 101 (76.338 sec)
INFO:tensorflow:loss = 8.32426, step = 201 (77.305 sec)
INFO:tensorflow:loss = 7.93614, step = 301 (77.471 sec)
INFO:tensorflow:loss = 7.56578, step = 401 (77.584 sec)
INFO:tensorflow:loss = 7.00716, step = 501 (77.871 sec)
INFO:tensorflow:loss = 6.46784, step = 601 (77.493 sec)
INFO:tensorflow:loss = 6.42706, step = 701 (77.233 sec)
INFO:tensorflow:loss = 5.50415, step = 801 (88.698 sec)
INFO:tensorflow:loss = 5.73775, step = 901 (77.630 sec)
INFO:tensorflow:loss = 4.82792, step = 1001 (77.974 sec)

Multiple (8) GPUs

INFO:tensorflow:loss = 10.2379, step = 1
INFO:tensorflow:loss = 7.91347, step = 101 (104.119 sec)
INFO:tensorflow:loss = 6.46561, step = 201 (106.224 sec)
INFO:tensorflow:loss = 5.05295, step = 301 (105.891 sec)
INFO:tensorflow:loss = 4.85656, step = 401 (107.138 sec)
INFO:tensorflow:loss = 5.04184, step = 501 (106.947 sec)
INFO:tensorflow:loss = 3.94963, step = 601 (152.102 sec)
INFO:tensorflow:loss = 4.29208, step = 701 (108.727 sec)
INFO:tensorflow:loss = 3.95084, step = 801 (108.000 sec)
INFO:tensorflow:loss = 2.87029, step = 901 (108.402 sec)
INFO:tensorflow:loss = 3.39836, step = 1001 (107.186 sec)

I expected to see the time needed for 100 steps divided by 6-8 (depending on the degree of parallelism), while it has been increased by 40%.

Is this normal?

Yes, this is expected. According to https://github.com/tensorflow/tensor2tensor/issues/17#issuecomment-310268149 and https://github.com/tensorflow/tensor2tensor/issues/17#issuecomment-310495062, each step is now 8-times bigger.

Is it working well for you? As @martinpopel says, this is expected because your effective batch size is now 8x larger. I'm closing this for now, but please, let us know how it's working and reopen if there are any problems.

Thanks, now I have definitely a better picture of how the processes works. I'm making some more tests, I'll let you know if there will be any problems.

Thanks again!

As a matter of fact , I have no idea how to use specific gpu in my command. Can you give me some suggestions? @davidecaroselli @lukaszkaiser

set with cuda. cuda_devices control visibility of a cmd window session.

@liesun1994: I use CUDA_VISIBLE_DEVICES env variable. E.g. export CUDA_VISIBLE_DEVICES=0,2,3 for using the first, third and fourth GPU, but hide the second one.

@davidecaroselli Would you share your better idea on Multi-GPUs?

Lost in translation :) I mean I have now a better picture of how the system works.

Was this page helpful?
0 / 5 - 0 ratings