Mmdetection: [Dev plan] Re-design the dataset API

Created on 3 Jul 2019  路  9Comments  路  Source: open-mmlab/mmdetection

The goal of this code refactoring is to make it easier to define new datasets and pre-processing pipelines. After the refactoring, users should be able to add new datasets/pipelines without modifying existing codes, as well as reuse more components. There will be breaking changes and welcome for any discussions.

There will be at least 2 PRs to implement it.

  • [x] Use registry to manage datasets. (#924)
  • [x] Use a list of transforms to define the data pre-processing pipeline. (#935)

The dataset definition will look like this.

from .coco import CocoDataset
from .registry import DATASETS


@DATASETS.register_module
class MyDataset(CocoDataset):

    CLASSES = ('class1', 'class2', 'class3')

    def __init__(self,
                 ann_file,
                 pipeline,
                 img_prefix=None,
                 seg_prefix=None,
                 proposal_file=None,
                 test_mode=False):
        pass

The config file will look like this.

train_pipeline = [
    dict(type='LoadImage'),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=False),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='ToTensor', keys=['gt_bboxes', 'gt_labels']),
    dict(
        type='ToDataContainer',
        fields=[
            dict(key='img', stack=True),
            dict(key='gt_bboxes'),
            dict(key='gt_labels')
        ]),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
train_set = dict(
    type='CocoDataset',
    ann_file=data_root + 'annotations/instances_train2017.json',
    img_prefix=data_root + 'train2017/',
    pipeline=train_pipeline)

Most helpful comment

@hellock Is there any posibility of making data_root overridable via cmd line argument with any changes to the dataset api/config?

In order to use absolute paths for dataset locations and not have data within PWD one must modify/duplicate the config files. For numerous reasons I do not feel that linking datasets into the current dir is a good solution. It works for some people, but has drawbacks.

It'd be nice to be able to python tools/train.py ${CONFIG_FILE} --data_root /abs/path/to/data/root and override the default ./data/dataset prefix. Very similary to how work_dir arg works, and I think it's a valid option for the same reasons you want to specify work_dir.

A change to suppor this withi minimal impact would require:

  1. moving data_root + ann_file/img_prefix concat out of the config py files and into CustomDataset.__init__ with data_root added as an arg so that data_root can be overriden
    def __init__(self,
                 data_root,
                 ann_file,
                 img_prefix,
                 img_scale,
                 ...):
        # prefix of images path
        self.img_prefix = data_root + img_prefix
        ann_file = data_root + ann_file
  1. override cfg.data_root from arg in tool scripts as with work_dir

All 9 comments

This will be a great improvement! It would be great to make it doable through config files also.

I'd love to help on this as I had to jump through a few hoops to get our own dataset format to work. Here are some suggestions:

  • Don't assume the __init__ function signature is going to be the same for all dataset formats, e.g. using an LMDB file doesn't require img_prefix or seg_prefix.
  • Think about hiding some details from the config file. As a user, I wouldn't know the detail that img needs stack=True for ToDataContainer. Also, this wouldn't usually change between runs. I've gotten around this by implementing certain settings as class-variables of the dataset class.

@dseuss Thanks for your suggestions! Good points.

  1. For the signatures of __init__ method, I have considered using only two arguments such as loader and pipeline. The annoying thing is that we want to provide a simple api for using multiple datasets. A common example is that the training set of COCO 2014 usually consists of two parts, 2014train and 2014valminusminival. We need to use ConcatDataset to wrap them, as in this line. The config files can be simply as follows.
train_set = dict(
    type='CocoDataset',
    ann_files = ['ann_file1', 'ann_file2']
    img_prefix = ['prefix1', 'prefix2'], 
    # or just the same prefix
    # img_prefix = 'some prefix'
    pipeline=train_pipeline)
  1. I agree that we should hide as much details as possible, but it sometimes conflicts with flexibility. We cannot use the key img to determine whether to stack it or not, since there may be other fields like img1 and img2. There may also be other fields which is also 3-dim, so we cannot use the dimension to decide the stacking behavior. Thus we just leave the argument in the config file.

We would appreciate any ideas for the above problems and have further discussions. If you come across any other problems, we can also consider it before the API is finalized. We can provide a LMDBDataset just like JsonDataset and XMLDataset, and you may help with the implementation. We can try more datasets to ensure that the API is flexible and users do not need to hack the codebase to implement new datasets.

Great points, I didn't have to deal with those situations so far, so I didn't anticipate them.

  1. The way I've dealt with such a situation is to have a class method from_config that takes the Config objects and deal with such special cases. This way the get_dataset function just has to call e.g. from_config(cfg.data.train) and the dataset itself can decide how to deal with these special cases. A default implementation for this function could be sth like:
@classmethod
def from_config(cls, cfg):
    config_dict = dict(**cfg)
    config_dict.pop('type', None)
    return cls (**config_dict)

A user could then simply overload this function if they want to e.g. concatenate multiple datasets in certain situations.

  1. OK, makes sense. There are ways of dealing with that (e.g. having different classes for datasets dealing with just boxes or boxes + masks), but I am not sure it's going to be simpler

I would like to add here again that the train pipeline should allow for loading from external user modules. the example given does not address this and assumes all augmentations are handled by callable hooks inside mmdetection. This is a serious limitation, especially for augs, where you finetune them to a specific model/dataset. I would like to suggest adding an option to load modules and use them in the training dictionary. something like:

train_pipeline = [
    dict(type='LoadModules', modules=[
            dict(path='/mydir/aug1', as='aug1'),
            dict(path='/mydir/aug2', as='aug2')
    ]),
    dict(type='LoadImage'),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=False),
    dict(type='aug1.Resize', img_scale=(1333, 800), keep_ratio=True), #NOTE THE PREFIX
    dict(type='aug2.RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='ToTensor', keys=['gt_bboxes', 'gt_labels']),
    dict(
        type='ToDataContainer',
        fields=[
            dict(key='img', stack=True),
            dict(key='gt_bboxes'),
            dict(key='gt_labels')
        ]),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]

@mosheliv Thanks for your advice. As I replied in #998, it's our goal to make mmdetection a backend without the need to modify for developing other methods. The preload script is not only necessary for the transforms/augmentations, but also any other user modules like backbones, detectors, ops. We plan to implement a preloading scheme to register all user modules before the execution of high-level scripts (train.py, test.py, etc.).

Thanks, I remembered your answer but wanted it to be in this thread, so it won't be forgotten.

@hellock Is there any posibility of making data_root overridable via cmd line argument with any changes to the dataset api/config?

In order to use absolute paths for dataset locations and not have data within PWD one must modify/duplicate the config files. For numerous reasons I do not feel that linking datasets into the current dir is a good solution. It works for some people, but has drawbacks.

It'd be nice to be able to python tools/train.py ${CONFIG_FILE} --data_root /abs/path/to/data/root and override the default ./data/dataset prefix. Very similary to how work_dir arg works, and I think it's a valid option for the same reasons you want to specify work_dir.

A change to suppor this withi minimal impact would require:

  1. moving data_root + ann_file/img_prefix concat out of the config py files and into CustomDataset.__init__ with data_root added as an arg so that data_root can be overriden
    def __init__(self,
                 data_root,
                 ann_file,
                 img_prefix,
                 img_scale,
                 ...):
        # prefix of images path
        self.img_prefix = data_root + img_prefix
        ann_file = data_root + ann_file
  1. override cfg.data_root from arg in tool scripts as with work_dir

Additionally, we could refactor the tools directory into entrypoints for the python module. Then you could run e.g.

python -m mmdet train 

from anywhere. This would also fix the current issue that the training/evaluation scripts are not installed when we pip-install mmdet from github directly without cloning it.

Was this page helpful?
0 / 5 - 0 ratings