With post_training_quantize=True in ToCo, on Google Pixel 2 device :
walleye:/data/local/tmp $ ./lite_benchmark_model --graph=output_graph_non_quant.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048>
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Num runs: [50]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Warmup runs: [1]
Graph: [output_graph_non_quant.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Loaded model output_graph_non_quant.tflite
resolved reporter
Initialized session in 38.631ms
Running benchmark for 1 iterations
count=1 curr=581397
Running benchmark for 50 iterations
count=50 first=480581 curr=480893 min=470353 max=487364 avg=479385 std=3728
Average inference timings in us: Warmup: 581397, Init: 38631, no stats: 479385
walleye:/data/local/tmp $ ./lite_benchmark_model --graph=output_graph_quant.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048 --ou
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Num runs: [50]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Warmup runs: [1]
Graph: [output_graph_quant.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Loaded model output_graph_quant.tflite
resolved reporter
Initialized session in 36.913ms
Running benchmark for 1 iterations
count=1 curr=288370
Running benchmark for 50 iterations
count=50 first=121900 curr=123450 min=121673 max=126796 avg=122527 std=1159
Average inference timings in us: Warmup: 288370, Init: 36913, no stats: 122527
Before going further we need to document accuraccy: https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/contrib/lite/tools/accuracy/README.md
Accuracy example with native client:
walleye:/data/local/tmp/arm64 $ LD_LIBRARY_PATH=/data/local/tmp/arm64/ ./deepspeech --model /sdcard/deepspeech/output_graph_non_quant.tflite --alphabet /sdcard/deepspeech/alphabet.txt --audio ../test-alex.en.wav -t
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-18-g5d842c2
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=116000
i headlor hendo helow
cpu_time_overall=7.89830
walleye:/data/local/tmp/arm64 $ LD_LIBRARY_PATH=/data/local/tmp/arm64/ ./deepspeech --model /sdcard/deepspeech/output_graph_quant.tflite --alphabet /sdcard/deepspeech/alphabet.txt --audio ../test-alex.en.wav -t
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-18-g5d842c2
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=116000
a hearlor helo helo
cpu_time_overall=3.01929
walleye:/data/local/tmp/arm64 $
Non-quantized VS Quantized model on LePotato:
lepotato@lepotato:~/ds$ ./lite_benchmark_model --graph=output_graph_non_quant.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048 --output_layer=logits,new_state_c,new_state_h
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Num runs: [50]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Warmup runs: [1]
Graph: [output_graph_non_quant.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Loaded model output_graph_non_quant.tflite
resolved reporter
Initialized session in 5.257ms
Running benchmark for 1 iterations
count=1 curr=1887383
Running benchmark for 50 iterations
count=50 first=1791325 curr=1790110 min=1787146 max=1792665 avg=1.78988e+06 std=1173
Average inference timings in us: Warmup: 1.88738e+06, Init: 5257, no stats: 1.78988e+06
lepotato@lepotato:~/ds$ ./lite_benchmark_model --graph=output_graph_quant.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048 --output_layer=logits,new_state_c,new_state_h
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Num runs: [50]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Warmup runs: [1]
Graph: [output_graph_quant.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Loaded model output_graph_quant.tflite
resolved reporter
Initialized session in 4.922ms
Running benchmark for 1 iterations
count=1 curr=711037
Running benchmark for 50 iterations
count=50 first=650351 curr=650633 min=650351 max=651962 avg=651087 std=386
Average inference timings in us: Warmup: 711037, Init: 4922, no stats: 651087
Non quantized TFLite model, v0.4.1, testing on LibriSpeech test-clean dataset:
Test - WER: 0.104306, CER: 0.044113, loss: 0.000000
real 52m54,435s
user 1610m25,233s
sys 2m26,146s
Quantized TFLite model, v0.4.1, testing on LibriSpeech test-clean dataset:
Test - WER: 0.107406, CER: 0.046574, loss: 0.000000
real 10m42,961s
user 313m45,772s
sys 0m40,347s
TF model v0.4.1, testing on LibriSpeech test-clean dataset:
Test - WER: 0.084925, CER: 0.035407, loss: 0.000000
real 46m7,826s
user 1282m18,824s
sys 101m11,893s
Given the low WER impact and the high performances increase, let's enable that.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Quantized TFLite model, v0.4.1, testing on LibriSpeech test-clean dataset: