Mmdetection: How to get the loss on validation set after each epoch?

Created on 16 Apr 2019  ·  9Comments  ·  Source: open-mmlab/mmdetection

Is there any ways to get the loss on validation set after each epoch when training?
Not just making COCO-style evaluation on validation set.
Thanks for any ideas!

All 9 comments

Take 'Faster-R50' as example, you can modify the workflow in this line as
worflow = [('train', 1), ('val', 1)]
https://github.com/open-mmlab/mmdetection/blob/900968046507c57d25d0d965474937ed4a16775e/configs/faster_rcnn_r50_fpn_1x.py#L156
Also you need to modify the data_loaders in this line, your need to append your valset_loader at the end of data_loaders
https://github.com/open-mmlab/mmdetection/blob/900968046507c57d25d0d965474937ed4a16775e/mmdet/apis/train.py#L64

@yhcao6 Thanks! That solved the problem. Another question is whether I can get this loss after a given iteration interval(such as every 200 iterations, not one epoch like 1000 iterations)? Because I need more points to plot loss curve on both training and validation set.

You need to modify mmcv, to log every n iter in val epoch, you need to add a method after_val_iter just similar to after_train_iter to log every n iters
https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/logger/base.py

I have tried to do some modification in base.py, but it didn't work successfully. The 'val epoch' is always after 'train epoch' like the following log output and I can't make them work alternately in the same epoch.

2019-04-17 18:50:43,930 - INFO - Epoch [1][600/772] lr: 0.00500, eta: 0:01:34, time: 0.569, data_time: 0.004, memory: 5116, loss_rpn_cls: 0.0190, loss_rpn_reg: 0.0082, s0.loss_cls: 0.0461, s0.acc: 98.0684, s0.loss_reg: 0.0441, s0.loss_mask: 0.2014, s1.loss_cls: 0.0256, s1.acc: 97.9508, s1.loss_reg: 0.0803, s1.loss_mask: 0.1152, s2.loss_cls: 0.0169, s2.acc: 97.0288, s2.loss_reg: 0.0633, s2.loss_mask: 0.0582, loss: 0.6783
2019-04-17 18:51:41,695 - INFO - Epoch [1][700/772] lr: 0.00500, eta: 0:00:39, time: 0.578, data_time: 0.004, memory: 5116, loss_rpn_cls: 0.0165, loss_rpn_reg: 0.0075, s0.loss_cls: 0.0470, s0.acc: 98.0078, s0.loss_reg: 0.0465, s0.loss_mask: 0.1982, s1.loss_cls: 0.0268, s1.acc: 97.7954, s1.loss_reg: 0.0817, s1.loss_mask: 0.1110, s2.loss_cls: 0.0168, s2.acc: 97.2259, s2.loss_reg: 0.0688, s2.loss_mask: 0.0556, loss: 0.6763
2019-04-17 18:52:53,718 - INFO - Epoch(val) [1][100] eta: 0:00:00, time: 0.204, data_time: 0.005, memory: 5217, loss_rpn_cls: 0.0201, loss_rpn_reg: 0.0082, s0.loss_cls: 0.0562, s0.acc: 97.6641, s0.loss_reg: 0.0473, s0.loss_mask: 0.2076, s1.loss_cls: 0.0334, s1.acc: 97.2258, s1.loss_reg: 0.0849, s1.loss_mask: 0.1132, s2.loss_cls: 0.0187, s2.acc: 96.7638, s2.loss_reg: 0.0774, s2.loss_mask: 0.0572, loss: 0.7242
2019-04-17 18:53:13,847 - INFO - Epoch(val) [1][200] eta: 0:00:00, time: 0.201, data_time: 0.004, memory: 5217, loss_rpn_cls: 0.0164, loss_rpn_reg: 0.0073, s0.loss_cls: 0.0591, s0.acc: 97.5947, s0.loss_reg: 0.0493, s0.loss_mask: 0.2116, s1.loss_cls: 0.0347, s1.acc: 97.3185, s1.loss_reg: 0.0888, s1.loss_mask: 0.1161, s2.loss_cls: 0.0195, s2.acc: 96.7061, s2.loss_reg: 0.0779, s2.loss_mask: 0.0584, loss: 0.7390
2019-04-17 18:53:17,392 - INFO - Epoch(val) [1][218] eta: 0:00:00, time: 0.202, data_time: 0.004, memory: 5217, loss_rpn_cls: 0.0179, loss_rpn_reg: 0.0076, s0.loss_cls: 0.0568, s0.acc: 97.6536, s0.loss_reg: 0.0475, s0.loss_mask: 0.2110, s1.loss_cls: 0.0337, s1.acc: 97.3061, s1.loss_reg: 0.0856, s1.loss_mask: 0.1156, s2.loss_cls: 0.0190, s2.acc: 96.7671, s2.loss_reg: 0.0766, s2.loss_mask: 0.0583, loss: 0.7294

Then I found method run in https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/runner.py. It seems like train epoch and val epoch must be done at least in one epoch by following the order of cfg.workflow. Am I missing any important details?

Sorry that I misunderstand your point. You may need to insert some code in train method at this line in "runner.py":

https://github.com/open-mmlab/mmcv/blob/506455af0bb4edf6b13cec5c7d171f26c9531d52/mmcv/runner/runner.py#L269

may be like this:

if (i + 1) % interval == 0:
    self.eval()
restore_train_epoch_state

Thank you for helping! Now the modification is working properly.

@yhcao6 You mentioned here https://github.com/open-mmlab/mmdetection/issues/505#issuecomment-483954471 I have to add val_loader in _non_dist_train but I do not understand how I can get it from this function

Could you please clarify it?

Take 'Faster-R50' as example, you can modify the workflow in this line as
worflow = [('train', 1), ('val', 1)]
https://github.com/open-mmlab/mmdetection/blob/900968046507c57d25d0d965474937ed4a16775e/configs/faster_rcnn_r50_fpn_1x.py#L156

Also you need to modify the data_loaders in this line, your need to append your valset_loader at the end of data_loaders
https://github.com/open-mmlab/mmdetection/blob/900968046507c57d25d0d965474937ed4a16775e/mmdet/apis/train.py#L64

@yhcao6
I cannot understand exactly what you mean.
You mean I have to create and import a new python file on 'mmdet/datasets/loader' or add method in 'mmdet/datasets/loader/build_loader.py'?

@forestriveral
How did you solve the problem of getting validation loss?

Run tools/dist_train.sh with —validate will get you on the way~

Yet I dont know what is the relation between ‘work_flow’ in config file and —validate in command line。

发自我的iPhone

在 2019年12月1日,下午8:07,psh9002 notifications@github.com 写道:


Take 'Faster-R50' as example, you can modify the workflow in this line as
worflow = [('train', 1), ('val', 1)]
https://github.com/open-mmlab/mmdetection/blob/900968046507c57d25d0d965474937ed4a16775e/configs/faster_rcnn_r50_fpn_1x.py#L156

Also you need to modify the data_loaders in this line, your need to append your valset_loader at the end of data_loaders
https://github.com/open-mmlab/mmdetection/blob/900968046507c57d25d0d965474937ed4a16775e/mmdet/apis/train.py#L64

I cannot understand exactly what you mean.
You mean I have to create and import a new python file on 'mmdet/datasets/loader' or add method in 'mmdet/datasets/loader/build_loader.py'?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

Was this page helpful?
0 / 5 - 0 ratings