Detectron: Trouble training custom dataset

Created on 20 Feb 2018 · 30Comments · Source: facebookresearch/Detectron

Training Detectron on custom dataset

I'm trying to train Mask RCNN on my custom dataset to perform segmentation task on new classes that coco or ImageNet never seen.

I first converted my dataset to coco format so it can be loaded by pycocotools.
I added my dataset path into dataset_catalog.py and created the correct link to images directory and annotations path.
The config file I used is based on configs/getting_started/tutorial_1gpu_e2e_faster_rcnn_R-50-FPN.yaml . My dataset contains only 4 classes without background so I set NUM_CLASSES to 5 ( 4 does not work either). When I try to train using the command bellow :
python2 tools/train_net.py --cfg configs/encov/copy_maskrcnn_R-101-FPN.yaml OUTPUT_DIR /tmp/detectron-output/

ERROR 1:

I get the following error (complete log file is here output.txt)
At: /home/encov/Softwares/Detectron/lib/roi_data/fast_rcnn.py(269): _expand_bbox_targets /home/encov/Softwares/Detectron/lib/roi_data/fast_rcnn.py(181): _sample_rois /home/encov/Softwares/Detectron/lib/roi_data/fast_rcnn.py(112): add_fast_rcnn_blobs /home/encov/Softwares/Detectron/lib/ops/collect_and_distribute_fpn_rpn_proposals.py(62): forward terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at pybind_state.h:423] . Exception encountered running PythonOp function: ValueError: could not broadcast input array from shape (4) into shape (0)

This error comes from the expand box procedure that increase the size of bounding box weights by 4 (see roi_data/fast_rcnn.py). It basically takes the first element which represents the class, checks that it is not 0 (the background) and copy weights values at index_class x 4. Error happens because the index is greater than the NUM_CLASSES parameter which has been used to create the output array.

ERROR 2

I try same training except I set NUM_CLASSES to 81 which was the number of classes used for coco training which is working on my set-up by the way.
The error I described above does not appear but in the really early beginning of the the iterations, bounding box areas is null which cause some divisions by zero.
output2.txt

Has someone experienced the same issue for training fast rcnn or mask rcnn on a custom dataset ?
I really suspect an error in my json coco-like file because training on coco dataset in working correctly.
Thank you for your help,

System information

Operating system: Ubuntu 16.04
Compiler version: GCC 5.4.0
CUDA version: 8.0
cuDNN version: 7.0
NVIDIA driver version: 384
GPU model: GeForce GTX 1080 (x1)
python --version output: Python 2.7.12

community help wanted

Source

francoto

Most helpful comment

I finally made it:

first, the bounding box coordinates in my dataset were wrong. I realize my mistakes when I tried to visualize them using pycocotools API (which by default doesn't have a specific method to show them by the way).
Finally, I misunderstood the part where I need a 'background' class (for labelling every pixel not in other classes) so I add one in my dataset but actually json_datatset.py is creating its own one. Delete my 'background' label in my dataset allows me to finally start the training.

francoto on 7 Mar 2018

👍5

All 30 comments

How many classes do you have in your custom dataset? If you have N classes, then you should set NUM_CLASSES: N+1 in your yaml config file. For example, for six classes you should set NUM_CLASSES: 7. For 80 classes COCO you should set it to 81.

realwecan on 22 Feb 2018

😄3 👍1

Thank you :+1: . I have 4 classes so I should set NUM_CLASSES to 5.
Now I now I must put this value but I already tried it and I got first ERROR 1 I described above.

The error (from what I understood in lib/roi_data/fast_rcnn.py) comes from the fact _expand_boxes_targets create an array with size defined by NUM_CLASSES parameter but when this array is filled up in for loop, it takes first box element as the class index and error happens when this class index is greater than the NUM_CLASSES parameter. The fact I can get a greater class index value than NUM_CLASSES is weird.

For the record, I put bellow the lines of code I talking about (in lib/roi_data/fast_rcnn.py ):

l.251 num_bbox_reg_classes = cfg.MODEL.NUM_CLASSES

l.256 bbox_targets = blob_utils.zeros((clss.size, 4 * num_bbox_reg_classes))

ll.260-270

    inds = np.where(clss > 0)[0]
    # print("DEBUG: inds value is {}".format(inds))
    for ind in inds:
        cls = int(clss[ind])
        start = 4 * cls
        end = start + 4
        bbox_targets[ind, start:end] = bbox_target_data[ind, 1:]
        bbox_inside_weights[ind, start:end] = (1.0, 1.0, 1.0, 1.0)

Error occurs when cls is greater than cfg.MODEL.NUM_CLASSES

francoto on 22 Feb 2018

@francoto I have a question, how you converted your dataset to coco format?
Thanks in advance.

raninbowlalala on 24 Feb 2018

@raninbowlalala
From my initial dataset (not COCO_like dataset), I write a Python script to fill every field of COCO dataset dict:
You can find COCO dataset format here.
I also installed pycocotools and copy/paste coco.py as mycustomdataset.py.
Then, you "just" have to redefine your constructor method in order to create similar format dataset.
Make sure it is working by trying to load your final .json file using COCO API.

Hope it will help you

francoto on 26 Feb 2018

@francoto Thanks for your help, I converted my dataset to coco format successfully.

raninbowlalala on 27 Feb 2018

I finally made it:

first, the bounding box coordinates in my dataset were wrong. I realize my mistakes when I tried to visualize them using pycocotools API (which by default doesn't have a specific method to show them by the way).
Finally, I misunderstood the part where I need a 'background' class (for labelling every pixel not in other classes) so I add one in my dataset but actually json_datatset.py is creating its own one. Delete my 'background' label in my dataset allows me to finally start the training.

francoto on 7 Mar 2018

👍5

Hi francoto,
I am also training Mask-RCNN using my own data. But I got a problem, the bbox precision is satisfying (mAP 0.5+, mAR 0.6+), but the segmentation or mask accuracy is poor (mAP 0.2, mAR 0.2). Do you achieve good performance on instance segmentation?

YanWang2014 on 16 Mar 2018

Hello @YanWang2014,
In my case, I got similar performances for bbox and mask (AP ~ 0.8).
My current dataset is quite small (~350 images for test and 40 images for validation) so I don't know if the number I gave is relevant.
Good luck for your task.

francoto on 16 Mar 2018

I'm sorry but I'm still struggling with training on a different number of classes. I have 2 classes in my annotation file so I set the number of classes in my config file to 3. I added some lines in the net.py to prevent the class related layers from loading (after this line):

if (keyname == 'cls_score_w' or keyname == 'cls_score_b' or keyname == 'bbox_pred_w' or keyname == 'bbox_pred_b'):
            logger.info('ignore: ' + keyname)
            continue

That way Detectron should not load the weights from these layers and leave them in the dimensions as configured in the .yaml file.
That's the only code I've changed but I still get the error: could not broadcast input array from shape (4) into shape (0)
@francoto How did you solve this problem or did you train from scratch?

I'm happy for any help.

mattifrind on 4 Apr 2018

Hello @mattifrind !
From my perspective, I'd say that you should let Detectron deal with the configuration you describe in your .yaml file. I re used weights models used in getting_started/yaml* examples.

I would say that you should not 'force' detectron to forget about weights.
The only issue I got was that the name of the classes detected displayed in the pdf results remains the 'old' ones: 'person', 'bicycle', etc.

francoto on 5 Apr 2018

@francoto are you using inference to show your pdf results? as I was initially doing that and in infer_simple.py it uses a dummy dataset in dummy_coco_dataset = dummy_datasets.get_coco_dataset() ... with the COCO dataset labels. Also, when you get your bounding boxes, do they make sense? Because I get decent masks, but the bounding boxes are not around these masks.

gabriellapizzuto on 5 Apr 2018

Hey @francoto! Thanks for your help.
I tried this because of a tip from Kaiming He in this issue. I tried to understand the code and found out that the model structure defined in the .yaml file will be overridden by the weights of the .pkl file. So if I configure 3 classes the, for example, cls_score layer which would be 3 depth will be replaced by the layer from the pkl file with a dimension of 81. Am I wrong?
Unfortunately, I get errors with or without my code change in the net.py.

mattifrind on 5 Apr 2018

Hey @gabriellap,
the commands I use is :
to train:

$ python2 tools/train_net.py \
--cfg configs/<custom_config>.yaml \
OUTPUT_DIR /tmp/detectron-output

to test:

$python2 tools/infer_simple.py --cfg configs/<custom_config> \
--output-dir /tmp/detection-visualizations \
--image-ext png \
--wts /tmp/detectron-output/<ouput_train_directory>/generalized_rcnn/model_final.pkl \
demo # location of the images

I can't share publicly my results but my bounding boxes location and mask are quite fine (I obviously have some errors but considering my dataset is only ~350 images, I think its pretty amazing) but as I said I still have the COCO dataset labels. I need to check the infer_simple.py file.

francoto on 5 Apr 2018

Hey @mattifrind, from what I remember, the error could not broadcast input array from shape (4) into shape (0) happened in my case when the parameter cfg.MODEL.NUM_CLASSES is not matching with clss in lib/roi_data/fast_rcnn.py. I guess that when you apply your fix to delete manually the weights corresponding to the class you don't use, they may still have one index corresponding to an index of your class greater than your cfg.MODEL.NUM_CLASSES.

For the record, I put bellow the lines of code I talking about (in lib/roi_data/fast_rcnn.py ):

l.251 num_bbox_reg_classes = cfg.MODEL.NUM_CLASSES

l.256 bbox_targets = blob_utils.zeros((clss.size, 4 * num_bbox_reg_classes))

ll.260-270
inds = np.where(clss > 0)[0]
# print("DEBUG: inds value is {}".format(inds))
for ind in inds:
    cls = int(clss[ind])
    start = 4 * cls
    end = start + 4
    bbox_targets[ind, start:end] = bbox_target_data[ind, 1:]
    bbox_inside_weights[ind, start:end] = (1.0, 1.0, 1.0, 1.0)
Error occurs when cls is greater than cfg.MODEL.NUM_CLASSES

Have you tried to train without changing the code for the weights ?
Have you added a 'background' label in your dataset ? In my case, I tried to add manually one and that was messing everything up.

Hope that may help you out,

francoto on 6 Apr 2018

Hey, @francoto thanks for your help!
without changing the code I get this error:

Traceback (most recent call last):
  File "tools/train_net.py", line 128, in <module>
    main()
  File "tools/train_net.py", line 110, in main
    checkpoints = utils.train.train_model()
  File "/home/ubuntu/detectron/lib/utils/train.py", line 58, in train_model
    setup_model_for_training(model, weights_file, output_dir)
  File "/home/ubuntu/detectron/lib/utils/train.py", line 161, in setup_model_for_training
    nu.initialize_gpu_from_weights_file(model, weights_file, gpu_id=0)
  File "/home/ubuntu/detectron/lib/utils/net.py", line 119, in initialize_gpu_from_weights_file
    src_blobs[src_name].shape)
AssertionError: Workspace blob cls_score_w with shape (3, 1024) does not match weights file shape (81, 1024)

Didn't you had this problem to when you changed the number of classes and used a pre-trained model?

With my change, I get the broadcast error. My dataset has no background class and my 2 categories have the indices 1 and 2 (i also tried 0 and 1 with the same effect).

mattifrind on 7 Apr 2018

Hello @mattifrind, I haven't seem these kind of errors so I can't really help you on this.
Good luck :crossed_fingers:

francoto on 9 Apr 2018

@mattifrind and @francoto I got that error because I tried with a pre-trained model with 81 classes, so to fix this I just use the ImageNet pretrained model in MODEL_ZOO
Did you find any solution to train without WEIGHTS?, I tried with WEIGHTS: '' (empy) and got AssertionError: Negative areas founds So, any idea?

ambigus9 on 17 Apr 2018

Will you solve the problem? I encountered the same problem. @mattifrind Thanks in advance.

ZSSNIKE on 17 May 2018

@ZSSNIKE because I need to get my task done I stopped trying to fix that. It works for me with 81 classes as a workaround. Good luck!

mattifrind on 18 May 2018

@mattifrind how do you set 81 classes? I mean, only changing NUM_CLASSES to 81 is not enough? right? Do you also need to convert the annotations to contains 81 categories?

chenweisomebody126 on 5 Jun 2018

@chenweisomebody126 yes the pre-trained models from Detectron have 81 classes and so the configuration files (.yaml) too. I wrote a Java program to convert my dataset in the COCO format. After the conversion, the program delets 2 classes of the original COCO dataset and adds the two of me. That's how I train.

mattifrind on 5 Jun 2018

@francoto I am getting exactly the same erroras yours.
ValueError: could not broadcast input array from shape (4) into shape (0)
My custom dataset has 4 classes and I have set Num classes to 5. I have added the dataset in dataset_catalog.py and generated the json for the dataset. A sample annotation in the json file looks like the following :

'id': 6, 'image_id': 1, 'category_id': 1, 'iscrowd': 0, 'area': 4674, 'bbox': [630.0, 482.0, 82.0, 59.0], 'segmentation': [[650.0, 540.5, 629.5, 540.0, 630.0, 483.5, 711.5, 482.0, 711.0, 538.5, 650.0, 540.5]], 'width': 1599, 'height': 1903}

U have written the steps but I can't understand them clearly. Can u please elaborate on the steps u took ,i.e. :
bounding box coordinates in my dataset were wrong : How are they wrong and how did u correct them'
Finally, I misunderstood the part where I need a 'background' class : How did u correct this part

Thanks in advance

vsd550 on 18 Jun 2018

Hello @vsd550, it has been a while I post this and I haven't use Detectron since I got my first results but I will try to explain.

"bounding box coordinates in my dataset were wrong" : as I said, I convert my custom dataset into COCO-like form on my own and I was not taking the correct parameters to compute the bounding box according to the segmentation polygon (if I remember right, my bounding box was only 1 pixel height and width).
"background" Previously I was manually adding a 'background' class in my COCO dataset with id=0 but without any occurrence in the dataset. My problem got solve when I remove this 'background class' from the dataset I design. I think that Detectron is actually creating this background class in the very beginning of the training, when it loads your dataset.

I hope I make my steps clear (or clearer) for you.

francoto on 20 Jun 2018

👍1

I meet error 1, after check my data, i found that my number_class is right = 150, so in the yaml file ,number_class = 151, but error 1. finaly i found that one of 150 classes is not right, i was added a ' ' ,
delete ' ' ,it works all right

my en is so pool!!!
我训练的时候出现了错误1 ，150个类别，yaml文件写的151，我确认这样是正确的写法，因为我以前跑别的数据成功了，所以我去检查了我这次的数据，结果一个类别的名称 "Bao_yan_sheng_chou_1750ml" ,它有两个写法，其中一个是前面多了一个空格，导致实际类别是151个，所以出错了，删了以后就好了，
所以我觉得error1,80%都是自己数据有问题导致的

JaosonMa on 6 Aug 2018

Hello @YanWang2014,
In my case, I got similar performances for bbox and mask (AP ~ 0.8).
My current dataset is quite small (~350 images for test and 40 images for validation) so I don't know if the number I gave is relevant.
Good luck for your task.

Hi! I got a same problem as you when I trained my custom dataset. The box AP is ~0.6, while the mask AP is ~0.5. Did you find the cause for this phenomenon? Look forward to your reply!

lssily on 7 Oct 2018

@francoto

The only issue I got was that the name of the classes detected displayed in the pdf results remains the 'old' ones: 'person', 'bicycle', etc.

I got the same problem. Did you fix it. How? Would you please tell me? Thanks.

maiff on 15 Jun 2019

Hello @maiff,
I actually find out that the category name where written directly in the file detectron/datasets/dummy_datasets.py in method get_coco_dataset() so I just created my get_custom_dataset() method with the category name I wanted. Then you update the file tools/infer_simple.py with your new method. It did the trick for me.

(I'm still using an old version from january 2019)

Good luck :)

francoto on 17 Jun 2019

@francoto Thank you very much, I have solved it

maiff on 17 Jun 2019

@francoto which cloud service you used or do you have gpu on personal computer?