Deepspeech: TFLite + Quantization

Created on 23 Jan 2019  路  9Comments  路  Source: mozilla/DeepSpeech

Most helpful comment

Quantized TFLite model, v0.4.1, testing on LibriSpeech test-clean dataset:

Test - WER: 0.107406, CER: 0.046574, loss: 0.000000

real    10m42,961s
user    313m45,772s
sys     0m40,347s

All 9 comments

With post_training_quantize=True in ToCo, on Google Pixel 2 device :

walleye:/data/local/tmp $ ./lite_benchmark_model --graph=output_graph_non_quant.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048>
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Num runs: [50]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Warmup runs: [1]
Graph: [output_graph_non_quant.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Loaded model output_graph_non_quant.tflite
resolved reporter
Initialized session in 38.631ms
Running benchmark for 1 iterations 
count=1 curr=581397

Running benchmark for 50 iterations 
count=50 first=480581 curr=480893 min=470353 max=487364 avg=479385 std=3728

Average inference timings in us: Warmup: 581397, Init: 38631, no stats: 479385
walleye:/data/local/tmp $ ./lite_benchmark_model --graph=output_graph_quant.tflite  --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048 --ou
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Num runs: [50]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Warmup runs: [1]
Graph: [output_graph_quant.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Loaded model output_graph_quant.tflite
resolved reporter
Initialized session in 36.913ms
Running benchmark for 1 iterations 
count=1 curr=288370

Running benchmark for 50 iterations 
count=50 first=121900 curr=123450 min=121673 max=126796 avg=122527 std=1159

Average inference timings in us: Warmup: 288370, Init: 36913, no stats: 122527

Accuracy example with native client:

walleye:/data/local/tmp/arm64 $ LD_LIBRARY_PATH=/data/local/tmp/arm64/ ./deepspeech --model /sdcard/deepspeech/output_graph_non_quant.tflite --alphabet /sdcard/deepspeech/alphabet.txt --audio ../test-alex.en.wav -t
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-18-g5d842c2
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=116000
i headlor hendo helow
cpu_time_overall=7.89830
walleye:/data/local/tmp/arm64 $ LD_LIBRARY_PATH=/data/local/tmp/arm64/ ./deepspeech --model /sdcard/deepspeech/output_graph_quant.tflite --alphabet /sdcard/deepspeech/alphabet.txt --audio ../test-alex.en.wav -t    
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-18-g5d842c2
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=116000
a hearlor helo helo
cpu_time_overall=3.01929
walleye:/data/local/tmp/arm64 $

Non-quantized VS Quantized model on LePotato:

lepotato@lepotato:~/ds$ ./lite_benchmark_model --graph=output_graph_non_quant.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048 --output_layer=logits,new_state_c,new_state_h                                                                                                                                                                                                                 
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Num runs: [50]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Warmup runs: [1]
Graph: [output_graph_non_quant.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Loaded model output_graph_non_quant.tflite
resolved reporter
Initialized session in 5.257ms
Running benchmark for 1 iterations
count=1 curr=1887383

Running benchmark for 50 iterations
count=50 first=1791325 curr=1790110 min=1787146 max=1792665 avg=1.78988e+06 std=1173

Average inference timings in us: Warmup: 1.88738e+06, Init: 5257, no stats: 1.78988e+06
lepotato@lepotato:~/ds$ ./lite_benchmark_model --graph=output_graph_quant.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048 --output_layer=logits,new_state_c,new_state_h 
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Num runs: [50]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Warmup runs: [1]
Graph: [output_graph_quant.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Loaded model output_graph_quant.tflite
resolved reporter
Initialized session in 4.922ms
Running benchmark for 1 iterations 
count=1 curr=711037

Running benchmark for 50 iterations 
count=50 first=650351 curr=650633 min=650351 max=651962 avg=651087 std=386

Average inference timings in us: Warmup: 711037, Init: 4922, no stats: 651087

Non quantized TFLite model, v0.4.1, testing on LibriSpeech test-clean dataset:

Test - WER: 0.104306, CER: 0.044113, loss: 0.000000

real    52m54,435s
user    1610m25,233s
sys     2m26,146s

Quantized TFLite model, v0.4.1, testing on LibriSpeech test-clean dataset:

Test - WER: 0.107406, CER: 0.046574, loss: 0.000000

real    10m42,961s
user    313m45,772s
sys     0m40,347s

TF model v0.4.1, testing on LibriSpeech test-clean dataset:

Test - WER: 0.084925, CER: 0.035407, loss: 0.000000

real    46m7,826s
user    1282m18,824s
sys     101m11,893s

Given the low WER impact and the high performances increase, let's enable that.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings