The goal of this code refactoring is to make it easier to define new datasets and pre-processing pipelines. After the refactoring, users should be able to add new datasets/pipelines without modifying existing codes, as well as reuse more components. There will be breaking changes and welcome for any discussions.
There will be at least 2 PRs to implement it.
The dataset definition will look like this.
from .coco import CocoDataset
from .registry import DATASETS
@DATASETS.register_module
class MyDataset(CocoDataset):
CLASSES = ('class1', 'class2', 'class3')
def __init__(self,
ann_file,
pipeline,
img_prefix=None,
seg_prefix=None,
proposal_file=None,
test_mode=False):
pass
The config file will look like this.
train_pipeline = [
dict(type='LoadImage'),
dict(type='LoadAnnotations', with_bbox=True, with_mask=False),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_bboxes', 'gt_labels']),
dict(
type='ToDataContainer',
fields=[
dict(key='img', stack=True),
dict(key='gt_bboxes'),
dict(key='gt_labels')
]),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
train_set = dict(
type='CocoDataset',
ann_file=data_root + 'annotations/instances_train2017.json',
img_prefix=data_root + 'train2017/',
pipeline=train_pipeline)
This will be a great improvement! It would be great to make it doable through config files also.
I'd love to help on this as I had to jump through a few hoops to get our own dataset format to work. Here are some suggestions:
__init__ function signature is going to be the same for all dataset formats, e.g. using an LMDB file doesn't require img_prefix or seg_prefix.img needs stack=True for ToDataContainer. Also, this wouldn't usually change between runs. I've gotten around this by implementing certain settings as class-variables of the dataset class.@dseuss Thanks for your suggestions! Good points.
__init__ method, I have considered using only two arguments such as loader and pipeline. The annoying thing is that we want to provide a simple api for using multiple datasets. A common example is that the training set of COCO 2014 usually consists of two parts, 2014train and 2014valminusminival. We need to use ConcatDataset to wrap them, as in this line. The config files can be simply as follows.train_set = dict(
type='CocoDataset',
ann_files = ['ann_file1', 'ann_file2']
img_prefix = ['prefix1', 'prefix2'],
# or just the same prefix
# img_prefix = 'some prefix'
pipeline=train_pipeline)
img to determine whether to stack it or not, since there may be other fields like img1 and img2. There may also be other fields which is also 3-dim, so we cannot use the dimension to decide the stacking behavior. Thus we just leave the argument in the config file.We would appreciate any ideas for the above problems and have further discussions. If you come across any other problems, we can also consider it before the API is finalized. We can provide a LMDBDataset just like JsonDataset and XMLDataset, and you may help with the implementation. We can try more datasets to ensure that the API is flexible and users do not need to hack the codebase to implement new datasets.
Great points, I didn't have to deal with those situations so far, so I didn't anticipate them.
from_config that takes the Config objects and deal with such special cases. This way the get_dataset function just has to call e.g. from_config(cfg.data.train) and the dataset itself can decide how to deal with these special cases. A default implementation for this function could be sth like:@classmethod
def from_config(cls, cfg):
config_dict = dict(**cfg)
config_dict.pop('type', None)
return cls (**config_dict)
A user could then simply overload this function if they want to e.g. concatenate multiple datasets in certain situations.
I would like to add here again that the train pipeline should allow for loading from external user modules. the example given does not address this and assumes all augmentations are handled by callable hooks inside mmdetection. This is a serious limitation, especially for augs, where you finetune them to a specific model/dataset. I would like to suggest adding an option to load modules and use them in the training dictionary. something like:
train_pipeline = [
dict(type='LoadModules', modules=[
dict(path='/mydir/aug1', as='aug1'),
dict(path='/mydir/aug2', as='aug2')
]),
dict(type='LoadImage'),
dict(type='LoadAnnotations', with_bbox=True, with_mask=False),
dict(type='aug1.Resize', img_scale=(1333, 800), keep_ratio=True), #NOTE THE PREFIX
dict(type='aug2.RandomFlip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_bboxes', 'gt_labels']),
dict(
type='ToDataContainer',
fields=[
dict(key='img', stack=True),
dict(key='gt_bboxes'),
dict(key='gt_labels')
]),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
@mosheliv Thanks for your advice. As I replied in #998, it's our goal to make mmdetection a backend without the need to modify for developing other methods. The preload script is not only necessary for the transforms/augmentations, but also any other user modules like backbones, detectors, ops. We plan to implement a preloading scheme to register all user modules before the execution of high-level scripts (train.py, test.py, etc.).
Thanks, I remembered your answer but wanted it to be in this thread, so it won't be forgotten.
@hellock Is there any posibility of making data_root overridable via cmd line argument with any changes to the dataset api/config?
In order to use absolute paths for dataset locations and not have data within PWD one must modify/duplicate the config files. For numerous reasons I do not feel that linking datasets into the current dir is a good solution. It works for some people, but has drawbacks.
It'd be nice to be able to python tools/train.py ${CONFIG_FILE} --data_root /abs/path/to/data/root and override the default ./data/dataset prefix. Very similary to how work_dir arg works, and I think it's a valid option for the same reasons you want to specify work_dir.
A change to suppor this withi minimal impact would require:
CustomDataset.__init__ with data_root added as an arg so that data_root can be overriden def __init__(self,
data_root,
ann_file,
img_prefix,
img_scale,
...):
# prefix of images path
self.img_prefix = data_root + img_prefix
ann_file = data_root + ann_file
Additionally, we could refactor the tools directory into entrypoints for the python module. Then you could run e.g.
python -m mmdet train
from anywhere. This would also fix the current issue that the training/evaluation scripts are not installed when we pip-install mmdet from github directly without cloning it.
Most helpful comment
@hellock Is there any posibility of making
data_rootoverridable via cmd line argument with any changes to the dataset api/config?In order to use absolute paths for dataset locations and not have data within PWD one must modify/duplicate the config files. For numerous reasons I do not feel that linking datasets into the current dir is a good solution. It works for some people, but has drawbacks.
It'd be nice to be able to
python tools/train.py ${CONFIG_FILE} --data_root /abs/path/to/data/rootand override the default ./data/dataset prefix. Very similary to how work_dir arg works, and I think it's a valid option for the same reasons you want to specify work_dir.A change to suppor this withi minimal impact would require:
CustomDataset.__init__withdata_rootadded as an arg so thatdata_rootcan be overriden