Models: Quantized SSD-MobileNet Checkpoints Missing Min/Max?

Created on 16 Jul 2018  路  35Comments  路  Source: tensorflow/models

System information

  • What is the top-level directory of the model you are using: models/research
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): Source
  • TensorFlow version (use command below): ('v1.9.0-rc2-400-geb5af9ad02', '1.9.0-rc0')
  • Bazel version (if compiling from source): bazel release 0.14.1
  • CUDA/cuDNN version: 9.2/7.1.4
  • GPU model and memory: NVIDIA P40/12
  • Exact command to reproduce: CUDA_VISIBLE_DEVICES=-1 python object_detection/export_tflite_ssd_graph.py --pipeline_config_path=object_detection/graphs/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/pipeline.config --trained_checkpoint_prefix=object_detection/graphs/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/model.ckpt --output_directory=/tmp/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/ --add_postprocessing_op=true

Describe the problem

@achowdhery Are the "quantized" models in the zoo checkpoint's supposed to contain the FakeQuant (Min/Max)?

I tried ssd_mobilenet_v1_0.75_depth_quantized_coco and ssd_mobilenet_v1_quantized_coco.

I used the command from the tutorial to export a quantized TF-Lite model.

Source code / logs

parvizp@cent-nano-0:~/Git/tensorflow.new/models/research$ CUDA_VISIBLE_DEVICES=-1 python object_detection/export_tflite_ssd_graph.py --pipeline_config_path=object_detection/graphs/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/pipeline.config --trained_checkpoint_prefix=object_detection/graphs/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/model.ckpt --output_directory=/tmp/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/  --add_postprocessing_op=true
2018-07-16 10:47:11.378546: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-16 10:47:11.378601: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: cent-nano-0
2018-07-16 10:47:11.378610: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: cent-nano-0
2018-07-16 10:47:11.378641: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 396.26.0
2018-07-16 10:47:11.378670: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 396.26.0
2018-07-16 10:47:11.378678: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 396.26.0
2018-07-16 10:47:13.255871: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
Traceback (most recent call last):
  File "object_detection/export_tflite_ssd_graph.py", line 137, in <module>
    tf.app.run(main)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "object_detection/export_tflite_ssd_graph.py", line 133, in main
    FLAGS.max_classes_per_detection)
  File "/home/parvizp/Git/tensorflow.new/models/research/object_detection/export_tflite_ssd_graph_lib.py", line 261, in export_tflite_graph
    initializer_nodes='')
  File "/home/parvizp/Git/tensorflow.new/models/research/object_detection/exporter.py", line 72, in freeze_graph_with_def_protos
    saver.restore(sess, input_checkpoint)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1743, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
         [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op u'save/RestoreV2', defined at:
  File "object_detection/export_tflite_ssd_graph.py", line 137, in <module>
    tf.app.run(main)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "object_detection/export_tflite_ssd_graph.py", line 133, in main
    FLAGS.max_classes_per_detection)
  File "/home/parvizp/Git/tensorflow.new/models/research/object_detection/export_tflite_ssd_graph_lib.py", line 261, in export_tflite_graph
    initializer_nodes='')
  File "/home/parvizp/Git/tensorflow.new/models/research/object_detection/exporter.py", line 67, in freeze_graph_with_def_protos
    tf.import_graph_def(input_graph_def, name='')
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
    return func(*args, **kwargs)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
    _ProcessNewOps(graph)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3360, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3251, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1716, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
         [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Most helpful comment

All 35 comments

@parvizp Have you tried the checkpoint in the tutorial also copied here? https://storage.googleapis.com/download.tensorflow.org/models/tflite/ssd_mobilenet_v1_0.75_depth_300x300_quant_pets_2018_06_29.zip

We often get this error when we have not done the export_tflite_ssd_graph.py - double checking you have already passed the checkpoint through that to get the frozen graph. I can double check the ones in the model zoo

@achowdhery Thanks, I just tried your URL and the export succeeds.

Thanks. I have verified the models are converting from model zoo as well.

@achowdhery I got the similar errors when I converted models from https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

NotFoundError (see above for traceback): Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint

The model I am trying to convert is


Please give exact instructions to reproduce. Need to make sure we see same issue

@achowdhery I followed your blog https://medium.com/tensorflow/training-and-serving-a-realtime-mobile-object-detector-in-30-minutes-with-cloud-tpus-b78971cf1193. The only difference is the model file I tried to export.

Export the model with:

python object_detection/export_tflite_ssd_graph.py \
--pipeline_config_path object_detection/samples/configs/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync.config \
--trained_checkpoint_prefix ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/model.ckpt \
--output_directory ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/tflite \
--add_postprocessing_op true \

Hi, I also met the same problem as melody-rain did.
The checkpoint of
https://storage.googleapis.com/download.tensorflow.org/models/tflite/ssd_mobilenet_v1_0.75_depth_300x300_quant_pets_2018_06_29.zip
is ok. But failed by starting with checkpoint from model zoo.

@achowdhery hi锛孖 wanna to export ssdlite_mobilenetv2, and I meet the same issue,
Tensor name "BoxPredictor_0/BoxEncodingPredictor/biases" not found in checkpoint files

@RichardLiee What checkpoint are you using? Please provide a link.

@RichardLiee have you checked this?

hi @achowdhery ,
I tried to train a quantized model for mobile devices.
But when I converted the model to tflite, I got this:

tensorflow/lite/toco/tooling_util.cc:1694] Array FeatureExtractor/MobilenetV1/MobilenetV1/Squeeze_excitation_Conv2d_3_depthwise/mul, which is an input to the Conv operator producing the output array FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6, is lacking min/max data, which is necessary for quantization.
If accuracy matters, either target a non-quantized output format, or run quantized training with your model from a floating point checkpoint to change the input graph to contain min/max information.
If you don't care about accuracy, you can pass --default_ranges_min= and --default_ranges_max= for easy experimentation.
Aborted (core dumped)

Please help deal with this, did the training process need more configs?

When I added --default_ranges_min=0 --default_ranges_max=6, the tflite accuracy drop so bad. But it works for some cases (decrease the accuracy a bit).

i am facing the exact same problem as yours @oopsodd, can someone give some hint or solution to solve this problem?

I didn't solve the problem. (default_ranges_min=0, default_ranges_max=6) option works for some specific image sizes input of the same network.

thank you @oopsodd for your. "dummy quantization" does not work well in performance as you said

I meet the same problem when train my own quantized model. How to fix it

Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
Faced this issue when tried to train with http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_quantized_300x300_coco_2018_09_14.tar.gz

I fixed that problem by using the downloaded model as a pretrained model. Then, in the configuration file for trained, this must be added to the end before retraining:
graph_rewriter { quantization { delay: 48000 weight_bits: 8 activation_bits: 8 } }
By doing this, the model will be prepared for future quantization and export necessary information in checkpoint.

I fixed that problem by using the downloaded model as a pretrained model. Then, in the configuration file for trained, this must be added to the end before retraining:
graph_rewriter { quantization { delay: 48000 weight_bits: 8 activation_bits: 8 } }
By doing this, the model will be prepared for future quantization and export necessary information in checkpoint.

@chrissaher I tried what you suggested but I still get the same error.

Can you please provide the configuration file you are using for training?

Can you please provide the configuration file you are using for training?

@chrissaher https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_mobilenet_v2_quantized_300x300_coco.config

This is the Config file I have used and configured proper paths in PATH to CONFIGURED

@chrissaher When I downloaded this http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_quantized_300x300_coco_2018_09_14.tar.gz model from the model zoo, it doesn't have a checkpoint file in it.

@raj-shah14 I successfully transformed that model to tflite using the following script:
tflite_convert \ --output_file="object_detection/zoo/ssd_mobilenet_v2_quantized_300x300_coco_2018_09_14/model.tflite" \ --graph_def_file="object_detection/zoo/ssd_mobilenet_v2_quantized_300x300_coco_2018_09_14/tflite_graph.pb" \ --inference_type=QUANTIZED_UINT8 \ --input_arrays="normalized_input_image_tensor" \ --output_arrays="TFLite_Detection_PostProcess","TFLite_Detection_PostProcess:1","TFLite_Detection_PostProcess:2","TFLite_Detection_PostProcess:3" \ --mean_values=128 \ --std_dev_values=128 \ --input_shapes=1,300,300,3 \ --change_concat_input_ranges=false \ --allow_nudging_weights_to_use_fast_gemm_kernel=true \ --allow_custom_ops
Please modify the path to your files correctly.

@chrissaher Thanks for your reply. I also was able to do this, but this is post training quantization and affects the accuracy too much. I was trying to do quantization aware training.
It would be great if you could guide me with that.

Hey @raj-shah14 I have the same issue.

@chrissaher adding
graph_rewriter { quantization { delay: 48000 weight_bits: 8 activation_bits: 8 } }
didn't work for me too.

I'm trying to train ssd_mobilenet_v2_quantized_300x300_coco using the legacy train.py and then freeze the checkpoint.
It fails on trying to freeze it.

When I train with --num_clones=1 the freeze succeeds but with --num_clones=4 it fails.

Did anyone solve it?

@oopsodd @NorwayLobster I meet the same issue when train a quantized model and try to covert it to tf.lite. Do you have some idea about this.

@achowdhery yeah, ssd model is fine for me. But when i use shared architecture, ppn. I train it with quantization . When use toco, the wrong information as following:
Array WeightSharedConvolutionalBoxPredictor/PredictionTower/conv2d_0/Conv2D, which is an input to the Mul operator producing the output array WeightSharedConvolutionalBoxPredictor/Relu6, is lacking min/max data, which is necessary for quantization.

@doronAtuar i also met the same problem, and i look into the saved checkpoint, find that there are something wrong in it, there are some node like

clone_1/FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/clone_1/FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/max/biased

but actually it should be

FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/max/biased

i fix this by rewrite the node name

@chrissaher Thanks for your reply. I also was able to do this, but this is post training quantization and affects the accuracy too much. I was trying to do quantization aware training.
It would be great if you could guide me with that.

Did you find help on quantization aware training?
I a trying to use ssd_mobilenet_v1_quantized_coco as my pretrained model for training but i get error. but when i use ssd_mobilenet_v1_coco its working but the loss is not converging and training is too slow.

I met the problem when I tried to train mobileV3(quantization aware), TF version:1.15.2, ubuntu 18.04
image

@doronAtuar i also met the same problem, and i look into the saved checkpoint, find that there are something wrong in it, there are some node like

clone_1/FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/clone_1/FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/max/biased

but actually it should be

FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/max/biased

i fix this by rewrite the node name

I am facing the same problem when I used multi training. How did you rewrite the node?

Thanks you

Was this page helpful?
0 / 5 - 0 ratings