Wav2letter: saving emissions - slow decoding

Created on 5 Mar 2018 · 12Comments · Source: flashlight/wav2letter

In #29, it was mentioned that transcribing on CPU should be 1sec per sentence but no ideas on how to debug when decoding is slow (minutes per sentence). Has anybody discovered the cause of slow decoding ?

From #29,
"It can't be 6 days, something must be wrong with your system. In my experience transcribing on CPU should take 1s per sequence, not 3 mins."

[Viterbi: reallocating with T=2000 N=30]
<|mister|quilter|is|the|apostle|of|the|mid1le|clas1es|and|we|are|glad|to|welcome|his|gospel|>
<|mister|quilter|is|the|apostle|of|the|mid1le|clas1es|and|we|are|glad|to|welcome|his|gospel|>
[Sentence WER: 000.00%, dataset WER: 000.00%]
<|nor|is|mister|qulter's|man1er|les1|interesting|than|his|mat1er|>......................] ETA: 0ms | Step: 0ms
<|nor|is|mister|quilter's|man1er|les1|interesting|than|his|mat1er|>
[Sentence WER: 010.00%, dataset WER: 003.70%]
<|he|tel1s|us|that|at|this|festive|season|of|the|year|with|christmas|and|roast|be1f|loving|before|us|similes|drawn|from|eating|and|its|results|ocur|most|readily|to|the|mind|>
<|he|tel1s|us|that|at|this|festive|season|of|the|year|with|christmas|and|roast|be1f|lo1ming|before|us|similes|drawn|from|eating|and|its|results|oc1ur|most|readily|to|the|mind|>
[Sentence WER: 006.25%, dataset WER: 005.08%]
<|it|is|obviously|un1eces1ary|for|us|to|point|out|how|luminous|these|criticisms|are|how|delicate|in|expres1ion|>5s
<|it|is|obviously|un1eces1ary|for|us|to|point|out|how|luminous|these|criticisms|are|how|delicate|in|expres1ion|>
[Sentence WER: 000.00%, dataset WER: 003.90%]
[........................................ 4/2703 ......................................] ETA: 6D11h | Step: 3m27s

Source

jojo05

Most helpful comment

What a difference MKL makes!

I tested and re-tested by correctly installing Intel MKL, and my scripts https://github.com/grahamimac/wav2letter-docker work successfully for setup and testing, producing about 5-6 second processing times per step/sentence, which is down significantly from 2 - 4 minutes previously.

<|he|paused|they|never|did|to|me|hy|>...........] ETA: 3h17m | Step: 4s642ms    
<|he|paused|they|never|did|to|me|>
[Sentence WER: 014.29%, dataset WER: 038.21%]
 [=>................. 153/2703 .................] ETA: 3h17m | Step: 4s636ms

@vineelpratap thank you for the help! If you have anything else you can share with us that gives the Facebook infrastructure 5x faster performance besides MKL, or a command that will help this go even faster in lua, I'd love to hear it! But if not, thanks for the MKL lead! Currently, I'm testing with the command:

~/usr/bin/luajit /wav2letter/test.lua /librispeech-glu-highdropout-cpu.bin -progress -show -test dev-clean -save -datadir /librispeech-proc/ -dictdir /librispeech-proc/ -gfsai

grahamimac on 4 May 2018

👍2

All 12 comments

Can you let us know the command you are using

vineelpratap on 6 Mar 2018

thanks

below:

command used
details of model dir and librispeech-proc dir (dictdir / datadir)
initial output (slightly different from previous one)

D=/home/ml/asr/librispeech-proc/
luajit wav2letter/test.lua /home/ml/asr/librispeech-glu-highdropout-cpu.bin -progress -show -test dev-clean -save -dictdir $D -datadir $D -gfsai

[root@i asr]# ll /home/ml/asr
total 1631800
-rwxr--r--. 1 joe joe 317 Mar 5 13:08 dl-model.sh
-rwxr--r--. 1 joe joe 185 Mar 5 12:05 dl.sh
drwxr-xr-x. 9 root root 4096 Mar 5 13:39 LibriSpeech
-rw-r--r--. 1 root root 1670805187 Jan 11 06:59 librispeech-glu-highdropout-cpu.bin
drwxr-xr-x. 9 root root 4096 Mar 6 09:59 librispeech-proc
-rw-r--r--. 1 root root 126976 Mar 6 10:17 output-dev-clean.bin
-rw-r--r--. 1 root root 3725 Mar 6 10:13 transitions-dev-clean.bin

[root@i asr]# ll /home/ml/asr/librispeech-proc
total 157312
-rw-r--r--. 1 root root 38757477 Mar 5 16:16 3-gram.pruned.3e-7.arpa
-rw-r--r--. 1 root root 37458439 Mar 4 22:02 3-gram.pruned.3e-7.bin
drwxr-xr-x. 2 root root 765952 Mar 5 16:54 dev-clean
drwxr-xr-x. 2 root root 831488 Mar 5 16:54 dev-other
-rw-r--r--. 1 root root 3475117 Mar 5 16:16 dict.lst
-rw-r--r--. 1 root root 56 Mar 5 14:56 letters.lst
-rw-r--r--. 1 root root 60 Mar 5 16:13 letters-rep.lst
drwxr-xr-x. 2 root root 745472 Mar 5 16:54 test-clean
drwxr-xr-x. 2 root root 856064 Mar 5 16:55 test-other
drwxr-xr-x. 2 root root 7979008 Mar 5 16:28 train-clean-100
drwxr-xr-x. 2 root root 29356032 Mar 5 16:40 train-clean-360
drwxr-xr-x. 2 root root 40824832 Mar 5 16:54 train-other-500

[Viterbi: reallocating with T=2000 N=30]
<|nor|is|mister|qulter's|man1er|les1|interesting|than|his|mat1er|>
<|nor|is|mister|quilter's|man1er|les1|interesting|than|his|mat1er|>
[Sentence WER: 010.00%, dataset WER: 010.00%]
<|mister|quilter|is|the|apostle|of|the|mid1le|clas1es|and|we|are|glad|to|welcome|his|gospel|>A: 0ms | Step: 0ms
<|mister|quilter|is|the|apostle|of|the|mid1le|clas1es|and|we|are|glad|to|welcome|his|gospel|>
[Sentence WER: 000.00%, dataset WER: 003.70%]
[........................................ 2/2703 ......................................] ETA: 4D12h | Step: 2m24s

jojo05 on 6 Mar 2018

Is anybody running the CPU model with 1s decodings ?

jojo05 on 7 Mar 2018

Hi,
I tried to verify again and it takes around 1sec for me for each step (I made sure GPU is disabled for the process).
screen shot 2018-03-07 at 2 10 07 pm

Can you remove -save option and see if there is a significant change in time. I wonder if it is disk reading/writing is taking the time.
Otherwise, one would have to some benchmarking and see which step is taking most time in https://github.com/facebookresearch/wav2letter/blob/master/test.lua.

vineelpratap on 7 Mar 2018

Removing -save doesn't make a difference. I am using an i7-6700k cpu and hard disk.
Are you using an SSD ?

jojo05 on 12 Mar 2018

If the code is doing random seeks to file, then the difference between SSD and disk would explain the slow decoding

jojo05 on 13 Mar 2018

Hi,
While I'm using SSD, I'm not 100% sure if this could cause such a huge regression. We use ParallelDatasetIterator which should make preprocessing time almost zero.

It may be best to benchmark the code to see what's happening. I'm not able to repro this, can you follow these step to see which step is taking most time.

Find where sgdengine.lua file is in your machine. It comes with torchnet package.
find ~/ -name '*sgdengine*'
Find the time taken for each step in https://github.com/torchnet/torchnet/blob/master/engine/sgdengine.lua#L170-L185 .
If none of the hooks take significant amount of time, then it is most likely file reading which is causing regression.

vineelpratap on 14 Mar 2018

I was able to replicate the same issue, with similar expected wait times of 2-4 minutes per step, or 4-8 days total.

I tried with and without the -save option, using an AWS c5.2xlarge instance with SSD, and running the setup on a fresh ubuntu docker container. Same results with and without save, and I am using SSD.

So I added print statements between each line in the code of sgdengine.lua as @vineelpratap suggested, and visually inspected how long each step took. Sure enough, one of the lines of code, specifically

state.network:forward(sample.input)

seems to be causing the entire delay. So I'm assuming it's not file reading but whatever is going on in this step. Here is the outlines of a Dockerfile I've started that shows all the steps I took to set up wav2letter, if it's helpful: https://github.com/grahamimac/wav2letter-docker/blob/master/Dockerfile

Any ideas from here/where to look next?

grahamimac on 14 Apr 2018

👍1

Hi, I'm not able to repro this on my machine.
Are you using intel MKL - https://github.com/facebookresearch/wav2letter#mkl . It could also make a huge difference (esp. for running convolutions and matrix multiplications in the network) while doing inference.

vineelpratap on 26 Apr 2018

Ah, it turns out MKL was not installed and it does not seem like there was any default BLAS like openBLAS as well, given it was a clean Ubuntu docker container. I have updated my setup code on the Github link above - given it takes a few hours to build then test, I'll aim to test sometime when I have a bit more time and report the results. Thanks!

grahamimac on 3 May 2018

👍2

What a difference MKL makes!

<|he|paused|they|never|did|to|me|hy|>...........] ETA: 3h17m | Step: 4s642ms    
<|he|paused|they|never|did|to|me|>
[Sentence WER: 014.29%, dataset WER: 038.21%]
 [=>................. 153/2703 .................] ETA: 3h17m | Step: 4s636ms

~/usr/bin/luajit /wav2letter/test.lua /librispeech-glu-highdropout-cpu.bin -progress -show -test dev-clean -save -datadir /librispeech-proc/ -dictdir /librispeech-proc/ -gfsai

grahamimac on 4 May 2018

👍2

@grahamimac Good to know it worked for you ! I have used the same command you mentioned for my initial benchmark. As you mentioned the network forward time is the bottleneck, please make sure you are running latest version of torch nn package (if you are not using already).

vineelpratap on 9 May 2018

Was this page helpful?

0 / 5 - 0 ratings