Models: Evaluation/Finetuning of Resnet 50 in TF 2.X

Created on 19 May 2020  路  24Comments  路  Source: tensorflow/models

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [x] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • [x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • [x] I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/official/vision/image_classification/resnet
For pretrained checkpoints, I used the ones linked in the README (https://github.com/tensorflow/models/tree/master/official/vision/image_classification/resnet#pretrained-models)

2. Describe the bug

1) I'm trying to evaluate and finetune the Resnet 50 model available at the URL(mentioned above). However I get near zero accuracy when I evaluate. I would like to know how to evaluate and finetune using the existing RN50 checkpoint.

The command I use for evaluating the existing model
python3 resnet_ctl_imagenet_main.py --model_dir=checkpoints/ --num_gpus=1 --batch_size=32 --train_epochs=1 --train_steps=1 --use_synthetic_data=false --data_dir imagenet_tfr_data/

The model_dir is set to checkpoints directory which has the downloaded checkpoint (from the README link). The checkpoint manager picks up this checkpoint, however does not seem to load as I get many unresolved object issues where the layer names mismatch.

W0518 15:35:57.377599 139869265008448 util.py:144] Unresolved object in checkpoint: (root).layer_with_weights-4.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).layer_with_weights-5.axis

2) Currently I seem to get this resnet model running for Tensorflow 2.2. There were multiple errors for 2.0 and 2.1. One such error is from tensorflow.python.keras.layers.preprocessing import image_preprocessing as image_ops. ImportError: cannot import name 'image_preprocessing'
This may not be relevant to the actual issue but for me TF 2,0 and TF 2.1 seem to give import and not found attribute errors which drove me to try TF 2.2.

3) Evaluation works when I do the following

In the resnet_runnable.py, I use keras way of loading the checkpoint
self.model.load_weights(flags_obj.pretrained_filepath)

This probably loads the checkpoint according to network topology rather than names (used by tf.train.CheckpointManager. (Is this correct way of loading ? )
I disable training manually and run the self._evaluate_once(current_step) to get 76.476. Just wanted to confirm if this is same accuracy that you obtained?

The questions are

  • Is there a plan to add standalone eval script to this repo ?
  • If the way of evaluation described in 3) is recommended, can it be added to the repo as well as update the documentation as well on eval/finetuning steps?
    I would be happy to make a PR if required :)

Thank you !!

official docs

Most helpful comment

@ashiqimranintel
1) I added model.load_weights call immediately after defining the model https://github.com/tensorflow/models/blob/master/official/vision/image_classification/resnet/resnet_runnable.py#L64

2) I used Resnet CTL implementation for training and evaluation https://github.com/tensorflow/models/tree/master/official/vision/image_classification/resnet#resnet-custom-training-loop
For evaluation using this implementation, I call evaluate call manually just before training gets started (with the pre-trained weights loaded)

For the non CTL based classifier_trainer.py implementation, I believe the workflow would be the same
Define model, load weights and call model.evaluate before model.fit. Not sure why your checkpoint is not being detected. Try disabling this and see if it works manually.

All 24 comments

@saberkun Can you please provide some details here ? Thank you

Hi @peri044, the evaluation accuracy is reasonable. Sorry, I do not have record of the exact evaluation metric.
In terms of the checkpoint loading, the problem in 1 is because the released checkpoint is saved by model.save_weights(). model.save_weights() is not compatible with the training checkpoint used by the checkpoint manager which is tf.train.Checkpoint(model=xxx). The root object is different. You should use model.load_weights() for this checkpoint.
The resnet runnable CTL implementation is probably one-off. Would you please use https://github.com/tensorflow/models/blob/master/official/vision/image_classification/classifier_trainer.py#L193 for most usages? Thanks

Hello @saberkun, Thanks for getting back and providing some info on the checkpoint loading.
I did try with classifier trainer implementation but I'm running into issues with it. Here's the following error when I try to finetune the network. I'm using TF 2.2 version. Can you please provide some help/suggestions on this issue ?
Command used:
python3 classifier_trainer.py --mode=train_and_eval --model_type=resnet --dataset=imagenet --model_dir=resnet/checkpoints/ --data_dir=/mnt/cdr/ImageNet/train-val-tfrecord/ --config_file=configs/examples/resnet/imagenet/gpu.yaml
resnet/checkpoints is the directory which has pre-trained checkpoints.
Error stacktrace: (Shortened a bit due for easy viewing)

File "classifier_trainer.py", line 220, in resume_from_checkpoint
model.load_weights(latest_checkpoint)
File "/home/dperi/Downloads/py3/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 250, in load_weights
return super(Model, self).load_weights(filepath, by_name, skip_mismatch)
File "/home/dperi/Downloads/py3/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py", line 1237, in load_weights
status = self._trackable_saver.restore(filepath)
File "/home/dperi/Downloads/py3/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 1304, in restore
checkpoint=checkpoint, proto_id=0).restore(self._graph_view.root)
File "/home/dperi/Downloads/py3/lib/python3.6/site-packages/tensorflow/python/training/tracking/base.py", line 209, in restore
restore_ops = trackable._restore_from_checkpoint_position(self) # pylint: disable=protected-access
File "/home/dperi/Downloads/py3/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 668, in __getattribute__
return super(OptimizerV2, self).__getattribute__(name)
AttributeError: 'LossScaleOptimizer' object has no attribute '_slots'

Hello @saberkun I was able to get past the above error by setting the datatype to tf.float32 instead of the default tf.float16. tf.float16 was using mixed_precision loss scale optimizer.
However I'm facing a new error

ValueError: Failed to find data adapter that can handle input: ,

I'm using a single GPU and tfrecords dataset for my finetuning experiment.
Command used:
python3 classifier_trainer.py --mode=train_and_eval --model_type=resnet --dataset=imagenet --model_dir=resnet/checkpoints/ --data_dir=/mnt/cdr/ImageNet/train-val-tfrecord/ --config_file=configs/examples/resnet/imagenet/gpu.yaml --params_override='runtime.num_gpus=1'

The error occurs at model.fit() call. Upon digging further, the issue happens here https://github.com/tensorflow/tensorflow/blob/r2.2/tensorflow/python/keras/engine/data_adapter.py#L683
Any suggestions to resolve this ? Thank you

The handle for tensorflow.python.distribute.input_lib.DistributedDatasetsFromFunction is added inside TF-nightly I know for sure.
@omalleyt12 Is it added for TF 2.2?

Hi peri044@, classifier_trainer.py is not included in the mode garden 2.2 release. It is added after that which is targeting to the coming 2.3 release.

@peri044 Thanks for the issue!

The handle for tensorflow.python.distribute.input_lib.DistributedDatasetsFromFunction is added inside TF-nightly I know for sure.
@omalleyt12 Is it added for TF 2.2?

Unfortunately it looks like that support didn't make it into 2.2. At head we handle distributed datasets: code

In TF2.2, we expect model.fit to be passed a non-distributed dataset, and then we call tf.distribute.Strategy.distribute_dataset on it

Hi @peri044 , please use tf-nightly with the master head. Uses tf 2.2 with the 2.2 release tag. Thanks

Hello @saberkun @omalleyt12 Thanks for the info. I'm using tf-nightly-gpu (version: 2.2.0-dev20200506) and this resolves the error.

However, now the evaluation results in zero accuracy.

{'accuracy_top_1': 0.0033410692121833563, 'eval_loss': 8.260212898254395, 'step_timestamp_log': [], 'train_finish_time': 1590609771.049209

The pre-trained checkpoint seems to be loaded fine.

seed2 arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
I0527 13:02:47.665966 140359968945984 dataset_factory.py:354] Using TFRecords to load data.
I0527 13:02:48.020327 140359968945984 dataset_info.py:430] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.0.0
I0527 13:02:48.199653 140359968945984 dataset_info.py:361] Load dataset info from /tmp/tmpvkch3y2atfds
I0527 13:02:48.207713 140359968945984 dataset_info.py:401] Field info.description from disk and from code do not match. Keeping the one from code.
I0527 13:02:48.207938 140359968945984 dataset_info.py:401] Field info.citation from disk and from code do not match. Keeping the one from code.
I0527 13:02:48.370851 140359968945984 classifier_trainer.py:328] Global batch size: 64
I0527 13:02:49.645716 140359968945984 optimizer_factory.py:371] Using Piecewise constant decay with warmup. Parameters: batch_size: 64, epoch_size: 1281167, warmup_epochs: 5
, boundaries: [30, 60, 80], multipliers: [1.0, 0.1, 0.01, 0.001]
I0527 13:02:49.645902 140359968945984 optimizer_factory.py:273] Building momentum optimizer with params {'name': 'momentum', 'decay': 0.9, 'epsilon': 0.001, 'momentum': 0.9,
'nesterov': None, 'moving_average_decay': None, 'lookahead': None, 'beta_1': None, 'beta_2': None}
I0527 13:02:49.645954 140359968945984 optimizer_factory.py:281] Using momentum optimizer
I0527 13:02:49.676531 140359968945984 classifier_trainer.py:210] Load from checkpoint is enabled.
I0527 13:02:49.677724 140359968945984 classifier_trainer.py:212] latest_checkpoint: resnet/checkpoints/model.ckpt-0090
I0527 13:02:49.677812 140359968945984 classifier_trainer.py:218] Checkpoint file resnet/checkpoints/model.ckpt-0090 found and restoring from checkpoint
I0527 13:02:50.907355 140359968945984 classifier_trainer.py:221] Completed loading from checkpoint.
I0527 13:02:50.907483 140359968945984 classifier_trainer.py:222] Resuming from epoch 225180

Can you please shed some light on what might be happening here? I'm thinking of sticking with classifier_trainer for my usages (as per your recommendation) and would like to evaluate using this (instead of codebase in resnet/ as mentioned in the main description of this issue). Thank you !!

@allenwang28 Can you check this issue?

The current implementation of resume_from_checkpoint here
will load the checkpoint weights and update the optimizer iteration. This code path assumes the checkpoint was created from this script (for instance if you cancel your training job and want to resume it later)

Since you're loading a checkpoint and fine tuning, I would suggest disabling resume_from_checkpoint (e.g. --params_override='runtime.num_gpus=1,train.resume_checkpoint=False') and insert model.load_weights with the provided path here, this way, optimizer.iterations starts at 0.

Furthermore, some other probably differences from the checkpoint you're using is that

  • classifier_trainer mean subtracts and standardizes the data, whereas old versions only mean subtract. You can turn off standardization by setting standardize=False here and here
  • I'm not sure if that checkpoint uses label smoothing, but you might try turning that off by setting this to 0.0.

@allenwang28 Thanks a lot for that information. standardize is the main reason for the zero accuracy in https://github.com/tensorflow/models/issues/8530#issuecomment-634915764. I tried debugging the difference why the implementation in resnet/ works with load_weights() call and not the classifier_trainer implementation and your comment resolved it. It would be great if this detail can be added to the README so that the checkpoint can be used by both implementations accordingly.

3 months later from the date of the post and this problem still persists.......
Maintainers please help.

from tensorflow.python.keras.layers.preprocessing import image_preprocessing as image_ops. ImportError: cannot import name 'image_preprocessing'

I still get this error! Tensorflow 2.1.0 with Python 3.6.
Those of us who are using windows are not so lucky yet to get the Tensorflow 2.2.0

Here is the full error

When I execute the following

python model_main_tf2.py --alsologtostderr --model_dir=$out_dir --checkpoint_every_n=500  \
                         --pipeline_config_path=../ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8.config \
                         --eval_on_train_data 2>&1 | tee $out_dir/train.log

I get

2020-08-06 19:05:15.707941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Traceback (most recent call last):
  File "model_main_tf2.py", line 31, in <module>
    from object_detection import model_lib_v2
  File "C:\Users\user\AppData\Roaming\Python\Python36\site-packages\object_detection\model_lib_v2.py", line 29, in <module>
    from object_detection import inputs
  File "C:\Users\user\AppData\Roaming\Python\Python36\site-packages\object_detection\inputs.py", line 26, in <module>
    from object_detection.builders import model_builder
  File "C:\Users\user\AppData\Roaming\Python\Python36\site-packages\object_detection\builders\model_builder.py", line 65, in <module>
    from object_detection.models import ssd_efficientnet_bifpn_feature_extractor as ssd_efficientnet_bifpn
  File "C:\Users\user\AppData\Roaming\Python\Python36\site-packages\object_detection\models\ssd_efficientnet_bifpn_feature_extractor.py", line 33, in <module>
    from official.vision.image_classification.efficientnet import efficientnet_model
  File "C:\Users\user\AppData\Roaming\Python\Python36\site-packages\official\vision\image_classification\efficientnet\efficientnet_model.py", line 37, in <module>

    from official.vision.image_classification import preprocessing
  File "C:\Users\user1\AppData\Roaming\Python\Python36\site-packages\official\vision\image_classification\preprocessing.py", line 25, in <module>
    from official.vision.image_classification import augment
  File "C:\Users\user1\AppData\Roaming\Python\Python36\site-packages\official\vision\image_classification\augment.py", line 30, in <module>
    from tensorflow.python.keras.layers.preprocessing import image_preprocessing as image_ops
ImportError: cannot import name 'image_preprocessing'

What am I doing wrong ?

Hmm, your issue is different. That import of image_ops is unfortunately only available with TF 2.2+ and is required for autoaugment. I would suggest using the r2.1.0 branch of code but that doesn't contain EfficientNet.

The best solutions would be:

  1. Upgrade to TF 2.2/2.3 (but I understand that's difficult)
  2. Use the r2.2.0 branch and comment out from tensorflow.python.keras.layers.preprocessing import image_preprocessing as image_ops within augment.py. Do note that you won't be able to use autoaugment.

Hi allenwang28, thank you for shedding some light on this issue. It has been frustrating.

Unfortunately, I cannot upgrade to TF2.2 because it is not available on Windows 10 yet ;(

I do not actually care about EfficientNet, I want to try SSD and MaskRCNN. I will give your hints ago! Thank you very much!

No worries! If you have any further issues feel free to open an issue and assign it to me. Good luck!

Sorry allenwang28, just to check if I understood your comment correctly. Did you mean that I should pip install tensorflow==2.1.0rc2 and try that one instead?

No, this might be better:

pip install tensorflow==2.1.0
git clone -b r2.2.0 https://github.com/tensorflow/models.git

then comment or delete this line: https://github.com/tensorflow/models/blob/r2.2.0/official/vision/image_classification/augment.py#L30

@allenwang28 Can you please take a quick look at this Efficientnet pretrained model question and let me know your thoughts ? Thanks !!

Hi @peri044, I am facing simiar issue like you did initially, I am getting evalution accuracy is around 0.001 with pre-trained resnet50 checkpoint. I have two questions.
1) which line did you add self.model.load_weights(flags_obj.model_dir) in the resnet_runnable.py to make it work?
2) On classifier_trainer.py, could you share your scripts to run for the evaluation. For some reason, my latest checkpoint isn't being detected by the TF. Here is my script.

python classifier_trainer.py \
--mode=train_and_eval \
--model_type=resnet \
--dataset=imagenet \
--model_dir=$CHECKPOINT_DIR \
--data_dir=$DATA_DIR \
--config_file=configs/examples/resnet/imagenet/gpu.yaml \
--params_override='runtime.num_gpus=0, train_dataset.builder=records, validation_dataset.builder=records, train_dataset.batch_size=128, validation_dataset.batch_size=128, train_dataset.dtype=float32, train.resume_checkpoint=False, train.steps=1, train.epochs=1'

@allenwang28 Thanks for this, could finally train on the TPU and export as tflite

@ashiqimranintel
1) I added model.load_weights call immediately after defining the model https://github.com/tensorflow/models/blob/master/official/vision/image_classification/resnet/resnet_runnable.py#L64

2) I used Resnet CTL implementation for training and evaluation https://github.com/tensorflow/models/tree/master/official/vision/image_classification/resnet#resnet-custom-training-loop
For evaluation using this implementation, I call evaluate call manually just before training gets started (with the pre-trained weights loaded)

For the non CTL based classifier_trainer.py implementation, I believe the workflow would be the same
Define model, load weights and call model.evaluate before model.fit. Not sure why your checkpoint is not being detected. Try disabling this and see if it works manually.

Thanks @peri044, it worked for me. I tried with CTL implementation.

@peri044, I tried with classifier_trainer.py, getting following issue
ValueError("Shapes %s and %s are incompatible" % (self, other)) ValueError: Shapes (1000,) and (1001,) are incompatible

Were you successful to use classfier_trainer.py for the evaluation?

@ashiqimranintel you can fix that issue in classifier_trainer.py by adding in a line here

  model_params:
    num_classes: 1001
    rescale_inputs: False
Was this page helpful?
0 / 5 - 0 ratings