Models: [deeplab] How much loss to stop train

Created on 21 Mar 2018  路  7Comments  路  Source: tensorflow/models

I'm using 8 cards to train on cityscape in order to the reproduce the article result.
I don't know when to stop train on the segmentation.
I find that when train segment, the lr is usually smaller than classify.
Can anyone share me the train log on cityscape dataset?and the super-parameters?

Most helpful comment

We will add Cityscapes experiments on the updated deeplabv3+ paper soon.

Some important hyper-parameters we use when training on the train_fine set: learning rate = 1e-2, training crop_size = 769x769, and training iterations = 90K. Most importantly, you need to fine-tune the batch norm parameters during training which however requires large batch size.

If given the limited resource at hand, we would suggest you simply fine-tune from our provided checkpoints whose batch-norm parameters have been trained (i.e., train with a smaller learning rate, set fine_tune_batch_norm = false, and employ longer training iterations since the learning rate is small). If you really would like to train by yourself, we would suggest

  1. Set output_stride = 16 or maybe even 32 (remember to change the flag atrous_rates accordingly, e.g., atrous_rates = [3, 6, 9] for output_stride = 32).

  2. Use as many GPUs as possible (change the flag num_clones in train.py) and set train_batch_size as large as possible.

  3. Adjust the train_crop_size in train.py. Maybe set it to be smaller, e.g., 513x513 (or even 321x321), so that you could use a larger batch size.

  4. Use a smaller network backbone, such as MobileNet-v2, which will be supported soon.

All 7 comments

I hope that the author can share the experiments' detail on cityscape dataset
I'm reading the deeplab v3+ paper. it detailed the procedure on VOC dataset.

does this repository release pretained deelabv3+ model?

We will add Cityscapes experiments on the updated deeplabv3+ paper soon.

Some important hyper-parameters we use when training on the train_fine set: learning rate = 1e-2, training crop_size = 769x769, and training iterations = 90K. Most importantly, you need to fine-tune the batch norm parameters during training which however requires large batch size.

If given the limited resource at hand, we would suggest you simply fine-tune from our provided checkpoints whose batch-norm parameters have been trained (i.e., train with a smaller learning rate, set fine_tune_batch_norm = false, and employ longer training iterations since the learning rate is small). If you really would like to train by yourself, we would suggest

  1. Set output_stride = 16 or maybe even 32 (remember to change the flag atrous_rates accordingly, e.g., atrous_rates = [3, 6, 9] for output_stride = 32).

  2. Use as many GPUs as possible (change the flag num_clones in train.py) and set train_batch_size as large as possible.

  3. Adjust the train_crop_size in train.py. Maybe set it to be smaller, e.g., 513x513 (or even 321x321), so that you could use a larger batch size.

  4. Use a smaller network backbone, such as MobileNet-v2, which will be supported soon.

@aquariusjay There are my train script on cityscape dataset, it only achieve 0.16 mIoU. I think there are some errors
training_number_of_steps=30K
model_variant='xception_65'
atrous_rates=6,12,18
output_stride=4
train_crop_size=769,769
train_batch_size=16
fine_tune_batch_norm=True

Great!

@aquariusjay Hi, I have a few additional questions:

  1. Can you say how much GPU memory is needed (or how many GPUs cards) to train with the training configuration you used to achieve the highest mIOU?
  2. Given I have only one GPU (Tesla K80), do you think I can achieve your results using CPU clones?
  3. If answer to 2 is "potentially yes", is training time on CPU clones completely prohibitive?

Regards.

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

25b3nk picture 25b3nk  路  3Comments

Mostafaghelich picture Mostafaghelich  路  3Comments

frankkloster picture frankkloster  路  3Comments

nmfisher picture nmfisher  路  3Comments

chenyuZha picture chenyuZha  路  3Comments