In #29, it was mentioned that transcribing on CPU should be 1sec per sentence but no ideas on how to debug when decoding is slow (minutes per sentence). Has anybody discovered the cause of slow decoding ?
From #29,
"It can't be 6 days, something must be wrong with your system. In my experience transcribing on CPU should take 1s per sequence, not 3 mins."
[Viterbi: reallocating with T=2000 N=30]
<|mister|quilter|is|the|apostle|of|the|mid1le|clas1es|and|we|are|glad|to|welcome|his|gospel|>
<|mister|quilter|is|the|apostle|of|the|mid1le|clas1es|and|we|are|glad|to|welcome|his|gospel|>
[Sentence WER: 000.00%, dataset WER: 000.00%]
<|nor|is|mister|qulter's|man1er|les1|interesting|than|his|mat1er|>......................] ETA: 0ms | Step: 0ms
<|nor|is|mister|quilter's|man1er|les1|interesting|than|his|mat1er|>
[Sentence WER: 010.00%, dataset WER: 003.70%]
<|he|tel1s|us|that|at|this|festive|season|of|the|year|with|christmas|and|roast|be1f|loving|before|us|similes|drawn|from|eating|and|its|results|ocur|most|readily|to|the|mind|>
<|he|tel1s|us|that|at|this|festive|season|of|the|year|with|christmas|and|roast|be1f|lo1ming|before|us|similes|drawn|from|eating|and|its|results|oc1ur|most|readily|to|the|mind|>
[Sentence WER: 006.25%, dataset WER: 005.08%]
<|it|is|obviously|un1eces1ary|for|us|to|point|out|how|luminous|these|criticisms|are|how|delicate|in|expres1ion|>5s
<|it|is|obviously|un1eces1ary|for|us|to|point|out|how|luminous|these|criticisms|are|how|delicate|in|expres1ion|>
[Sentence WER: 000.00%, dataset WER: 003.90%]
[........................................ 4/2703 ......................................] ETA: 6D11h | Step: 3m27s
Can you let us know the command you are using
thanks
below:
D=/home/ml/asr/librispeech-proc/
luajit wav2letter/test.lua /home/ml/asr/librispeech-glu-highdropout-cpu.bin -progress -show -test dev-clean -save -dictdir $D -datadir $D -gfsai
[root@i asr]# ll /home/ml/asr
total 1631800
-rwxr--r--. 1 joe joe 317 Mar 5 13:08 dl-model.sh
-rwxr--r--. 1 joe joe 185 Mar 5 12:05 dl.sh
drwxr-xr-x. 9 root root 4096 Mar 5 13:39 LibriSpeech
-rw-r--r--. 1 root root 1670805187 Jan 11 06:59 librispeech-glu-highdropout-cpu.bin
drwxr-xr-x. 9 root root 4096 Mar 6 09:59 librispeech-proc
-rw-r--r--. 1 root root 126976 Mar 6 10:17 output-dev-clean.bin
-rw-r--r--. 1 root root 3725 Mar 6 10:13 transitions-dev-clean.bin
[root@i asr]# ll /home/ml/asr/librispeech-proc
total 157312
-rw-r--r--. 1 root root 38757477 Mar 5 16:16 3-gram.pruned.3e-7.arpa
-rw-r--r--. 1 root root 37458439 Mar 4 22:02 3-gram.pruned.3e-7.bin
drwxr-xr-x. 2 root root 765952 Mar 5 16:54 dev-clean
drwxr-xr-x. 2 root root 831488 Mar 5 16:54 dev-other
-rw-r--r--. 1 root root 3475117 Mar 5 16:16 dict.lst
-rw-r--r--. 1 root root 56 Mar 5 14:56 letters.lst
-rw-r--r--. 1 root root 60 Mar 5 16:13 letters-rep.lst
drwxr-xr-x. 2 root root 745472 Mar 5 16:54 test-clean
drwxr-xr-x. 2 root root 856064 Mar 5 16:55 test-other
drwxr-xr-x. 2 root root 7979008 Mar 5 16:28 train-clean-100
drwxr-xr-x. 2 root root 29356032 Mar 5 16:40 train-clean-360
drwxr-xr-x. 2 root root 40824832 Mar 5 16:54 train-other-500
[Viterbi: reallocating with T=2000 N=30]
<|nor|is|mister|qulter's|man1er|les1|interesting|than|his|mat1er|>
<|nor|is|mister|quilter's|man1er|les1|interesting|than|his|mat1er|>
[Sentence WER: 010.00%, dataset WER: 010.00%]
<|mister|quilter|is|the|apostle|of|the|mid1le|clas1es|and|we|are|glad|to|welcome|his|gospel|>A: 0ms | Step: 0ms
<|mister|quilter|is|the|apostle|of|the|mid1le|clas1es|and|we|are|glad|to|welcome|his|gospel|>
[Sentence WER: 000.00%, dataset WER: 003.70%]
[........................................ 2/2703 ......................................] ETA: 4D12h | Step: 2m24s
Is anybody running the CPU model with 1s decodings ?
Hi,
I tried to verify again and it takes around 1sec for me for each step (I made sure GPU is disabled for the process).

Can you remove -save option and see if there is a significant change in time. I wonder if it is disk reading/writing is taking the time.
Otherwise, one would have to some benchmarking and see which step is taking most time in https://github.com/facebookresearch/wav2letter/blob/master/test.lua.
Removing -save doesn't make a difference. I am using an i7-6700k cpu and hard disk.
Are you using an SSD ?
If the code is doing random seeks to file, then the difference between SSD and disk would explain the slow decoding
Hi,
While I'm using SSD, I'm not 100% sure if this could cause such a huge regression. We use ParallelDatasetIterator which should make preprocessing time almost zero.
It may be best to benchmark the code to see what's happening. I'm not able to repro this, can you follow these step to see which step is taking most time.
sgdengine.lua file is in your machine. It comes with torchnet package.find ~/ -name '*sgdengine*'I was able to replicate the same issue, with similar expected wait times of 2-4 minutes per step, or 4-8 days total.
I tried with and without the -save option, using an AWS c5.2xlarge instance with SSD, and running the setup on a fresh ubuntu docker container. Same results with and without save, and I am using SSD.
So I added print statements between each line in the code of sgdengine.lua as @vineelpratap suggested, and visually inspected how long each step took. Sure enough, one of the lines of code, specifically
state.network:forward(sample.input)
seems to be causing the entire delay. So I'm assuming it's not file reading but whatever is going on in this step. Here is the outlines of a Dockerfile I've started that shows all the steps I took to set up wav2letter, if it's helpful: https://github.com/grahamimac/wav2letter-docker/blob/master/Dockerfile
Any ideas from here/where to look next?
Hi, I'm not able to repro this on my machine.
Are you using intel MKL - https://github.com/facebookresearch/wav2letter#mkl . It could also make a huge difference (esp. for running convolutions and matrix multiplications in the network) while doing inference.
Ah, it turns out MKL was not installed and it does not seem like there was any default BLAS like openBLAS as well, given it was a clean Ubuntu docker container. I have updated my setup code on the Github link above - given it takes a few hours to build then test, I'll aim to test sometime when I have a bit more time and report the results. Thanks!
What a difference MKL makes!
I tested and re-tested by correctly installing Intel MKL, and my scripts https://github.com/grahamimac/wav2letter-docker work successfully for setup and testing, producing about 5-6 second processing times per step/sentence, which is down significantly from 2 - 4 minutes previously.
<|he|paused|they|never|did|to|me|hy|>...........] ETA: 3h17m | Step: 4s642ms
<|he|paused|they|never|did|to|me|>
[Sentence WER: 014.29%, dataset WER: 038.21%]
[=>................. 153/2703 .................] ETA: 3h17m | Step: 4s636ms
@vineelpratap thank you for the help! If you have anything else you can share with us that gives the Facebook infrastructure 5x faster performance besides MKL, or a command that will help this go even faster in lua, I'd love to hear it! But if not, thanks for the MKL lead! Currently, I'm testing with the command:
~/usr/bin/luajit /wav2letter/test.lua /librispeech-glu-highdropout-cpu.bin -progress -show -test dev-clean -save -datadir /librispeech-proc/ -dictdir /librispeech-proc/ -gfsai
@grahamimac Good to know it worked for you ! I have used the same command you mentioned for my initial benchmark. As you mentioned the network forward time is the bottleneck, please make sure you are running latest version of torch nn package (if you are not using already).
Most helpful comment
What a difference MKL makes!
I tested and re-tested by correctly installing Intel MKL, and my scripts https://github.com/grahamimac/wav2letter-docker work successfully for setup and testing, producing about 5-6 second processing times per step/sentence, which is down significantly from 2 - 4 minutes previously.
@vineelpratap thank you for the help! If you have anything else you can share with us that gives the Facebook infrastructure 5x faster performance besides MKL, or a command that will help this go even faster in lua, I'd love to hear it! But if not, thanks for the MKL lead! Currently, I'm testing with the command: