Detectron2: Learning Rate is increasing instead of annealing

Created on 24 Oct 2019  ยท  5Comments  ยท  Source: facebookresearch/detectron2

โ“ Questions and Help

Hey! Amazing work, I've struggled couple of times with tuning previous version of detectron, this new write-up is working really well, and gives ability to tune with minimum amount of time and energy.

A question is: why the learning rate is increasing instead of annealing step-by-step?
As shown below:

[10/24 10:09:06 d2.engine.train_loop]: Starting training from iteration 0
[10/24 10:09:18 d2.utils.events]: eta: 0:18:36  iter: 19  total_loss: 1.902  loss_cls: 1.429  loss_box_reg: 0.487  time: 0.5684  data_time: 0.0405  lr: 0.000002  max_mem: 10456M
[10/24 10:09:29 d2.utils.events]: eta: 0:18:28  iter: 39  total_loss: 1.930  loss_cls: 1.435  loss_box_reg: 0.491  time: 0.5692  data_time: 0.0068  lr: 0.000004  max_mem: 10456M
[10/24 10:09:41 d2.utils.events]: eta: 0:18:17  iter: 59  total_loss: 1.896  loss_cls: 1.399  loss_box_reg: 0.500  time: 0.5695  data_time: 0.0078  lr: 0.000006  max_mem: 10456M
[10/24 10:09:52 d2.utils.events]: eta: 0:18:06  iter: 79  total_loss: 1.918  loss_cls: 1.416  loss_box_reg: 0.494  time: 0.5697  data_time: 0.0060  lr: 0.000008  max_mem: 10456M
[10/24 10:10:04 d2.utils.events]: eta: 0:17:55  iter: 99  total_loss: 1.888  loss_cls: 1.392  loss_box_reg: 0.496  time: 0.5695  data_time: 0.0071  lr: 0.000010  max_mem: 10456M
[10/24 10:10:15 d2.utils.events]: eta: 0:17:44  iter: 119  total_loss: 1.876  loss_cls: 1.382  loss_box_reg: 0.493  time: 0.5699  data_time: 0.0072  lr: 0.000012  max_mem: 10456M
[10/24 10:10:27 d2.utils.events]: eta: 0:17:34  iter: 139  total_loss: 1.894  loss_cls: 1.398  loss_box_reg: 0.485  time: 0.5705  data_time: 0.0064  lr: 0.000014  max_mem: 10456M
[10/24 10:10:38 d2.utils.events]: eta: 0:17:23  iter: 159  total_loss: 1.939  loss_cls: 1.448  loss_box_reg: 0.488  time: 0.5702  data_time: 0.0068  lr: 0.000016  max_mem: 10458M
[10/24 10:10:49 d2.utils.events]: eta: 0:17:12  iter: 179  total_loss: 1.910  loss_cls: 1.422  loss_box_reg: 0.502  time: 0.5701  data_time: 0.0068  lr: 0.000018  max_mem: 10458M
[10/24 10:11:01 d2.utils.events]: eta: 0:17:02  iter: 199  total_loss: 1.918  loss_cls: 1.435  loss_box_reg: 0.489  time: 0.5711  data_time: 0.0069  lr: 0.000020  max_mem: 10458M
[10/24 10:11:13 d2.utils.events]: eta: 0:16:51  iter: 219  total_loss: 1.922  loss_cls: 1.442  loss_box_reg: 0.478  time: 0.5716  data_time: 0.0069  lr: 0.000022  max_mem: 10458M
[10/24 10:11:24 d2.utils.events]: eta: 0:16:39  iter: 239  total_loss: 1.935  loss_cls: 1.441  loss_box_reg: 0.494  time: 0.5722  data_time: 0.0069  lr: 0.000024  max_mem: 10458M
[10/24 10:11:36 d2.utils.events]: eta: 0:16:28  iter: 259  total_loss: 1.924  loss_cls: 1.430  loss_box_reg: 0.495  time: 0.5722  data_time: 0.0106  lr: 0.000026  max_mem: 10458M
[10/24 10:11:47 d2.utils.events]: eta: 0:16:18  iter: 279  total_loss: 1.903  loss_cls: 1.400  loss_box_reg: 0.497  time: 0.5728  data_time: 0.0068  lr: 0.000028  max_mem: 10458M
[10/24 10:11:59 d2.utils.events]: eta: 0:16:07  iter: 299  total_loss: 1.929  loss_cls: 1.437  loss_box_reg: 0.486  time: 0.5733  data_time: 0.0068  lr: 0.000030  max_mem: 10458M
[10/24 10:12:10 d2.utils.events]: eta: 0:15:56  iter: 319  total_loss: 1.956  loss_cls: 1.467  loss_box_reg: 0.479  time: 0.5731  data_time: 0.0069  lr: 0.000032  max_mem: 10458M
[10/24 10:12:22 d2.utils.events]: eta: 0:15:44  iter: 339  total_loss: 1.910  loss_cls: 1.429  loss_box_reg: 0.491  time: 0.5736  data_time: 0.0068  lr: 0.000034  max_mem: 10458M
[10/24 10:12:33 d2.utils.events]: eta: 0:15:33  iter: 359  total_loss: 1.904  loss_cls: 1.409  loss_box_reg: 0.483  time: 0.5734  data_time: 0.0068  lr: 0.000036  max_mem: 10458M
[10/24 10:12:45 d2.utils.events]: eta: 0:15:22  iter: 379  total_loss: 1.951  loss_cls: 1.463  loss_box_reg: 0.488  time: 0.5735  data_time: 0.0067  lr: 0.000038  max_mem: 10458M
[10/24 10:12:56 d2.utils.events]: eta: 0:15:11  iter: 399  total_loss: 1.918  loss_cls: 1.423  loss_box_reg: 0.484  time: 0.5739  data_time: 0.0067  lr: 0.000040  max_mem: 10458M
[10/24 10:13:08 d2.utils.events]: eta: 0:15:00  iter: 419  total_loss: 1.881  loss_cls: 1.418  loss_box_reg: 0.490  time: 0.5743  data_time: 0.0067  lr: 0.000042  max_mem: 10458M
[10/24 10:13:20 d2.utils.events]: eta: 0:14:49  iter: 439  total_loss: 1.878  loss_cls: 1.404  loss_box_reg: 0.486  time: 0.5747  data_time: 0.0067  lr: 0.000044  max_mem: 10458M
[10/24 10:13:31 d2.utils.events]: eta: 0:14:37  iter: 459  total_loss: 1.890  loss_cls: 1.393  loss_box_reg: 0.489  time: 0.5749  data_time: 0.0069  lr: 0.000046  max_mem: 10458M
[10/24 10:13:43 d2.utils.events]: eta: 0:14:26  iter: 479  total_loss: 1.900  loss_cls: 1.409  loss_box_reg: 0.485  time: 0.5750  data_time: 0.0149  lr: 0.000048  max_mem: 10458M
[10/24 10:13:54 d2.utils.events]: eta: 0:14:15  iter: 499  total_loss: 1.906  loss_cls: 1.423  loss_box_reg: 0.482  time: 0.5749  data_time: 0.0067  lr: 0.000050  max_mem: 10458M
[10/24 10:14:06 d2.utils.events]: eta: 0:14:04  iter: 519  total_loss: 1.886  loss_cls: 1.405  loss_box_reg: 0.483  time: 0.5751  data_time: 0.0071  lr: 0.000052  max_mem: 10458M
[10/24 10:14:18 d2.utils.events]: eta: 0:13:52  iter: 539  total_loss: 1.855  loss_cls: 1.369  loss_box_reg: 0.480  time: 0.5752  data_time: 0.0070  lr: 0.000054  max_mem: 10458M
[10/24 10:14:29 d2.utils.events]: eta: 0:13:41  iter: 559  total_loss: 1.888  loss_cls: 1.351  loss_box_reg: 0.483  time: 0.5755  data_time: 0.0168  lr: 0.000056  max_mem: 10458M
[10/24 10:14:41 d2.utils.events]: eta: 0:13:30  iter: 579  total_loss: 1.895  loss_cls: 1.415  loss_box_reg: 0.473  time: 0.5755  data_time: 0.0071  lr: 0.000058  max_mem: 10458M
[10/24 10:14:52 d2.utils.events]: eta: 0:13:19  iter: 599  total_loss: 1.913  loss_cls: 1.411  loss_box_reg: 0.487  time: 0.5756  data_time: 0.0073  lr: 0.000060  max_mem: 10458M
[10/24 10:15:04 d2.utils.events]: eta: 0:13:08  iter: 619  total_loss: 1.899  loss_cls: 1.422  loss_box_reg: 0.485  time: 0.5756  data_time: 0.0068  lr: 0.000062  max_mem: 10458M
[10/24 10:15:15 d2.utils.events]: eta: 0:12:56  iter: 639  total_loss: 1.911  loss_cls: 1.440  loss_box_reg: 0.475  time: 0.5757  data_time: 0.0068  lr: 0.000064  max_mem: 10458M
[10/24 10:15:27 d2.utils.events]: eta: 0:12:45  iter: 659  total_loss: 1.890  loss_cls: 1.413  loss_box_reg: 0.469  time: 0.5756  data_time: 0.0067  lr: 0.000066  max_mem: 10458M
[10/24 10:15:39 d2.utils.events]: eta: 0:12:34  iter: 679  total_loss: 1.928  loss_cls: 1.432  loss_box_reg: 0.478  time: 0.5759  data_time: 0.0071  lr: 0.000068  max_mem: 10458M
[10/24 10:15:50 d2.utils.events]: eta: 0:12:22  iter: 699  total_loss: 1.893  loss_cls: 1.415  loss_box_reg: 0.476  time: 0.5760  data_time: 0.0070  lr: 0.000070  max_mem: 10458M
[10/24 10:16:02 d2.utils.events]: eta: 0:12:11  iter: 719  total_loss: 1.861  loss_cls: 1.403  loss_box_reg: 0.461  time: 0.5760  data_time: 0.0072  lr: 0.000072  max_mem: 10458M
[10/24 10:16:14 d2.utils.events]: eta: 0:12:00  iter: 739  total_loss: 1.921  loss_cls: 1.435  loss_box_reg: 0.472  time: 0.5763  data_time: 0.0071  lr: 0.000074  max_mem: 10458M
[10/24 10:16:25 d2.utils.events]: eta: 0:11:48  iter: 759  total_loss: 1.896  loss_cls: 1.397  loss_box_reg: 0.466  time: 0.5764  data_time: 0.0074  lr: 0.000076  max_mem: 10458M
[10/24 10:16:37 d2.utils.events]: eta: 0:11:37  iter: 779  total_loss: 1.904  loss_cls: 1.462  loss_box_reg: 0.460  time: 0.5763  data_time: 0.0071  lr: 0.000078  max_mem: 10458M
[10/24 10:16:48 d2.utils.events]: eta: 0:11:26  iter: 799  total_loss: 1.847  loss_cls: 1.406  loss_box_reg: 0.467  time: 0.5764  data_time: 0.0072  lr: 0.000080  max_mem: 10458M
[10/24 10:17:00 d2.utils.events]: eta: 0:11:14  iter: 819  total_loss: 1.859  loss_cls: 1.404  loss_box_reg: 0.463  time: 0.5766  data_time: 0.0064  lr: 0.000082  max_mem: 10458M
[10/24 10:17:12 d2.utils.events]: eta: 0:11:03  iter: 839  total_loss: 1.850  loss_cls: 1.400  loss_box_reg: 0.455  time: 0.5768  data_time: 0.0068  lr: 0.000084  max_mem: 10458M
[10/24 10:17:23 d2.utils.events]: eta: 0:10:51  iter: 859  total_loss: 1.881  loss_cls: 1.419  loss_box_reg: 0.458  time: 0.5767  data_time: 0.0067  lr: 0.000086  max_mem: 10458M
[10/24 10:17:35 d2.utils.events]: eta: 0:10:40  iter: 879  total_loss: 1.885  loss_cls: 1.439  loss_box_reg: 0.455  time: 0.5767  data_time: 0.0085  lr: 0.000088  max_mem: 10458M
[10/24 10:17:46 d2.utils.events]: eta: 0:10:29  iter: 899  total_loss: 1.907  loss_cls: 1.454  loss_box_reg: 0.456  time: 0.5769  data_time: 0.0067  lr: 0.000090  max_mem: 10458M
[10/24 10:17:58 d2.utils.events]: eta: 0:10:17  iter: 919  total_loss: 1.859  loss_cls: 1.437  loss_box_reg: 0.445  time: 0.5770  data_time: 0.0086  lr: 0.000092  max_mem: 10458M
[10/24 10:18:10 d2.utils.events]: eta: 0:10:06  iter: 939  total_loss: 1.906  loss_cls: 1.447  loss_box_reg: 0.443  time: 0.5771  data_time: 0.0067  lr: 0.000094  max_mem: 10458M
[10/24 10:18:21 d2.utils.events]: eta: 0:09:55  iter: 959  total_loss: 1.858  loss_cls: 1.403  loss_box_reg: 0.438  time: 0.5773  data_time: 0.0067  lr: 0.000096  max_mem: 10458M
[10/24 10:18:34 d2.utils.events]: eta: 0:09:43  iter: 979  total_loss: 1.894  loss_cls: 1.418  loss_box_reg: 0.454  time: 0.5783  data_time: 0.0068  lr: 0.000098  max_mem: 10458M
[10/24 10:18:45 d2.utils.events]: eta: 0:09:32  iter: 999  total_loss: 1.826  loss_cls: 1.381  loss_box_reg: 0.435  time: 0.5783  data_time: 0.0068  lr: 0.000100  max_mem: 10458M

And one more thing - modified version of trainer (seen in Jupyter Notebook) does not support multi GPU.
I had a look at couple of issues where model diverges and loss becomes NaN after couple of hundred iterations, and as I found out - hyperparameters are really sensitive towards batch size and learning rate.

I've changed couple of lines in detectron2/engine/defaults.py to support DistributedDataParallel instantly, I'll check how this approach works and write the results in comments :)

Most helpful comment

The learing rate will increase during the warmup phase.

All 5 comments

The learing rate will increase during the warmup phase.

Understood. Could you please elaborate on multi-gpu training using modified trainer in Jupyter Notebook?

I don't think distributed data parallel of pytorch can support training in Jupyter Notebook for now.

I'm executing it as a script (via terminal).

Could you please elaborate on multi-gpu training using modified trainer in Jupyter Notebook?

I don't quite understand what you would like to know

Was this page helpful?
0 / 5 - 0 ratings

Related issues

limsijie93 picture limsijie93  ยท  3Comments

aminekechaou picture aminekechaou  ยท  3Comments

RomRoc picture RomRoc  ยท  4Comments

soumik12345 picture soumik12345  ยท  3Comments

guy4261 picture guy4261  ยท  4Comments