Rasa: Rasa NLU + TensorFlow: no performance gain with way bigger machines and GPU

Created on 18 Sep 2018  Â·  15Comments  Â·  Source: RasaHQ/rasa

Rasa NLU version: 0.13.3

Operating system (windows, osx, ...):

System 1: (AWS p3.8xlarge)
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"

vCpus 32
GPUs: 4x Tesla V100
GPU Memory: 64GB

System 2:
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"

Intel® Core™ i7-4770 Quad-Core
RAM: 32 GB DDR3 RAM

Content of model configuration file:

System 1:

language: "de"
num_threads: 1000

pipeline:
- name: "tokenizer_whitespace"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"

System 2

language: "de"
num_threads: 100

pipeline:
- name: "tokenizer_whitespace"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"

Issue:

Hey,

i have a training dataset (20MB maybe 30k - 40k sentences) and trained it on booth Systems with tensorflow. On System 1 i could confirm that all 4 GPUs are used. The stats look like this:

System 1:
2018-09-18 13:03:07.512768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
Epochs: 100%|██████████| 300/300 [12:27<00:00, 2.49s/it, loss=0.002, acc=1.000]

System 2:
Epochs: 100%|██████████| 300/300 [13:38<00:00, 2.27s/it, loss=0.001, acc=1.000]

Does anyone know why there is no speed gain even if System 1 is way more powerful then System 2? Or is s.th wrong with the configuation?

Thanks in advance.

Best Flo

Most helpful comment

we are working on that. It looks like the bottleneck is numpy calculations in between batches. We're working on using tf.data.Dataset for training, which should boost the performance

All 15 comments

Have you definitely installed tensorflow to run on GPUs?

yes, i did and could confirm it via console log.

The reason for no significant performance gain could be that by default internal neural networks have only zero and two hidden layers, which is not really deep, and there is a bit of numpy based calculation for negative sampling in between epochs

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This issue has been automatically closed due to inactivity. Please create a new issue if you need more help.

@Ghostvv I spent great deal of time optimizing training time with cuda. I gained following results:

35:33<00:00 gpu
36:09<00:00 no gpu

I noticed your post on forum, can you achieve the better ones with GPU?

:-)

we are working on that. It looks like the bottleneck is numpy calculations in between batches. We're working on using tf.data.Dataset for training, which should boost the performance

Vladimir, how the work goes? What's the time horizon for this particular task? #MeInTheLoop

sorry, there is no time plan for it. It combines several major changes, that we're trying

Hello there,
any update on this topic?
Best

@dimitriosp tf.data.Dataset are implemented in the current version of rasa

@Ghostvv

Can you elaborate how to activate this?
When I try to update to tensorflow-gpu, with the newest rasa, the versions mismatch (which one should I use? could not find a tutorial with gpu & rasa).

And do you have a comparison for the performance gain?

best

it is activated by default. Which versions mismatch? The training time will be about the same

@Ghostvv "The training time will be about the same" so this means no performance gain on training?

As I said above, the default model is relatively shallow, why would you expect performance gain from gpus?

Was this page helpful?
0 / 5 - 0 ratings