Dali: Accuracy decreasing when training ImageNet if using DALI

Created on 27 Nov 2019 · 7Comments · Source: NVIDIA/DALI

Thanks for your wonderful work! But when I train shufflenetv2 1.0x， the accuracy will decrease if I use DALI, Here are my details:
Accuracy in paper: 69.40%
Accuracy if using DALI: 68.29%
Accuracy without DALI: 68.86% (This is beacuse some images in my ImageNet are broken)

Data augmentation code(using DALI):

class HybridTrainPipe(Pipeline):
    def __init__(self, batch_size, num_threads, device_id, data_dir, crop, dali_cpu=False, local_rank=4, world_size=1):
        super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed=12)
        dali_device = "gpu"
        self.input = ops.FileReader(file_root=data_dir, shard_id=device_id, num_shards=world_size,
                                    shuffle_after_epoch=True)
        self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB)
        self.res = ops.RandomResizedCrop(device="gpu", size=crop, random_area=[0.08, 1])
        self.cmnp = ops.CropMirrorNormalize(device="gpu",
                        crop=(224, 224),
                        output_dtype=types.FLOAT,
                        output_layout=types.NCHW,
                        image_type=types.RGB,
                        # mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                        # std=[0.229 * 255, 0.224 * 255, 0.225 * 255]
                        )
        self.color = ops.ColorTwist(device='gpu', brightness=uniform(0.6, 1.4))
        self.contrast = ops.Contrast(device='gpu', contrast=uniform(0.6, 1.4))
        self.saturation = ops.Saturation(device='gpu', saturation=uniform(0.6, 1.4))
        self.coin = ops.CoinFlip(probability=0.5)
        print('DALI "{0}" variant'.format(dali_device))

class HybridValPipe(Pipeline):
    def __init__(self, batch_size, num_threads, device_id, data_dir, crop, size, local_rank=0, world_size=1):
        super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id,prefetch_queue_depth=1)
        self.input = ops.FileReader(file_root=data_dir, shard_id=device_id, num_shards=world_size,random_shuffle=False)
        self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB)
        self.res = ops.Resize(device="gpu", resize_shorter=size, interp_type=types.INTERP_TRIANGULAR)
        self.cmnp = ops.CropMirrorNormalize(device="gpu",output_dtype=types.FLOAT,output_layout=types.NCHW,crop=(crop, crop),image_type=types.RGB)
    def define_graph(self):
        self.jpegs, self.labels = self.input(name="Reader")
        images = self.decode(self.jpegs)
        images = self.res(images)
        output = self.cmnp(images)
        return [output, self.labels]

Data augmentation code(without DALI):

class OpencvResize(object):
    def __init__(self, size=256):
        self.size = size
    def __call__(self, img):
        assert isinstance(img, PIL.Image.Image)
        img = np.asarray(img) # (H,W,3) RGB
        img = img[:,:,::-1] # 2 BGR
        img = np.ascontiguousarray(img)
        H, W, _ = img.shape
        target_size = (int(self.size/H * W + 0.5), self.size) if H < W else (self.size, int(self.size/W * H + 0.5))
        img = cv2.resize(img, target_size, interpolation=cv2.INTER_LINEAR)
        img = img[:,:,::-1] # 2 RGB
        img = np.ascontiguousarray(img)
        img = Image.fromarray(img)
        return img
class ToBGRTensor(object):
    def __call__(self, img):
        assert isinstance(img, (np.ndarray, PIL.Image.Image))
        if isinstance(img, PIL.Image.Image):
            img = np.asarray(img)
        img = img[:,:,::-1] # 2 BGR
        img = np.transpose(img, [2, 0, 1]) # 2 (3, H, W)
        img = np.ascontiguousarray(img)
        img = torch.from_numpy(img).float()
        return img

    assert os.path.exists(args.train_dir)
    train_dataset = datasets.ImageFolder(
        args.train_dir,
        transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
            transforms.RandomHorizontalFlip(0.5),
            ToBGRTensor(),
        ])
    )
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=True,
        num_workers=1, pin_memory=use_gpu)
    train_dataprovider = DataIterator(train_loader)

    assert os.path.exists(args.val_dir)
    val_loader = torch.utils.data.DataLoader(
        datasets.ImageFolder(args.val_dir, transforms.Compose([
            OpencvResize(256),
            transforms.CenterCrop(224),
            ToBGRTensor(),
        ])),
        batch_size=200, shuffle=False,
        num_workers=4, pin_memory=use_gpu
    )
    val_dataprovider = DataIterator(val_loader)

I think these two data augmentation code are the same, but they have different accuracy.

Another problem is that if I use DALI, CUDA_VISIBLE_DEVICES will loss its effectiveness. For example. If I set os.environ["CUDA_VISIBLE_DEVICES"] = "3,4,5,6,7" and model = nn.DataParallel(model,device_ids='1,2,3'), the model will still training on GPU 1,2,3, instead of GPU 4,5,6. Can you help solve these two problems？

question

Source

Hou9612

Most helpful comment

Hi,
Regarding "CUDA_VISIBLE_DEVICES" it is rather a question to PyTorch community not related to DALI - I see some thread about it here.
Regarding data pipeline I can share a couple of observations:

* your original pipeline uses `transforms.RandomResizedCrop` for training but `OpencvResize` for validation. I would strongly recommend using `transforms.Resize` for validation. The problem might be that OpenCV and torch vision may have a different definition of pixel center, and what is most important torchvision uses a triangular window for interpolation when it scales down (even you ask for bilinear) while OpenCV uses bilinear. That ends up in different resize methodology applied during the training and validation and the DL network is sensitive for this kind of anomalies that may come our from resize artifacts (like aliasing)

* when you use DALI you utilize INTERP_TRIANGULAR for validation pipeline (what resembles the default behavior of `transforms.Resize`) while for the training you use the default value which is INTERP_LINEAR

* small nitpick - ColorTwist can cover saturation, contrast, and brightness in one go, no need to use a separate operator for saturation and contrast augmentation in this case

* a similar question was raised in #400 - you can the discussion there

Thanks for your timely relpy! I will follow your suggestions and train my network again right now.

Don't worry. In my case, the both results of shufflenet V1 and V2 on dali are higher than original paper. The reason for this is that the OpenCVResize only used in val set and it will be different from the train set.

PoonKinWang on 28 Nov 2019

👍2

All 7 comments

Hi,

Regarding "CUDA_VISIBLE_DEVICES" it is rather a question to PyTorch community not related to DALI - I see some thread about it here.
Regarding data pipeline I can share a couple of observations:

your original pipeline uses transforms.RandomResizedCrop for training but OpencvResize for validation. I would strongly recommend using transforms.Resize for validation. The problem might be that OpenCV and torch vision may have a different definition of pixel center, and what is most important torchvision uses a triangular window for interpolation when it scales down (even you ask for bilinear) while OpenCV uses bilinear. That ends up in different resize methodology applied during the training and validation and the DL network is sensitive for this kind of anomalies that may come our from resize artifacts (like aliasing)
when you use DALI you utilize INTERP_TRIANGULAR for validation pipeline (what resembles the default behavior of transforms.Resize) while for the training you use the default value which is INTERP_LINEAR
small nitpick - ColorTwist can cover saturation, contrast, and brightness in one go, no need to use a separate operator for saturation and contrast augmentation in this case
a similar question was raised in https://github.com/NVIDIA/DALI/issues/400 - you can the discussion there

JanuszL on 28 Nov 2019

Hi,

Regarding "CUDA_VISIBLE_DEVICES" it is rather a question to PyTorch community not related to DALI - I see some thread about it here.
Regarding data pipeline I can share a couple of observations:

* your original pipeline uses `transforms.RandomResizedCrop` for training but `OpencvResize` for validation. I would strongly recommend using `transforms.Resize` for validation. The problem might be that OpenCV and torch vision may have a different definition of pixel center, and what is most important torchvision uses a triangular window for interpolation when it scales down (even you ask for bilinear) while OpenCV uses bilinear. That ends up in different resize methodology applied during the training and validation and the DL network is sensitive for this kind of anomalies that may come our from resize artifacts (like aliasing)

* when you use DALI you utilize INTERP_TRIANGULAR for validation pipeline (what resembles the default behavior of `transforms.Resize`) while for the training you use the default value which is INTERP_LINEAR

* small nitpick - ColorTwist can cover saturation, contrast, and brightness in one go, no need to use a separate operator for saturation and contrast augmentation in this case

* a similar question was raised in #400 - you can the discussion there

Thanks for your timely relpy! I will follow your suggestions and train my network again right now.

Hou9612 on 28 Nov 2019

Hi,
Regarding "CUDA_VISIBLE_DEVICES" it is rather a question to PyTorch community not related to DALI - I see some thread about it here.
Regarding data pipeline I can share a couple of observations:

* your original pipeline uses `transforms.RandomResizedCrop` for training but `OpencvResize` for validation. I would strongly recommend using `transforms.Resize` for validation. The problem might be that OpenCV and torch vision may have a different definition of pixel center, and what is most important torchvision uses a triangular window for interpolation when it scales down (even you ask for bilinear) while OpenCV uses bilinear. That ends up in different resize methodology applied during the training and validation and the DL network is sensitive for this kind of anomalies that may come our from resize artifacts (like aliasing)

* when you use DALI you utilize INTERP_TRIANGULAR for validation pipeline (what resembles the default behavior of `transforms.Resize`) while for the training you use the default value which is INTERP_LINEAR

* small nitpick - ColorTwist can cover saturation, contrast, and brightness in one go, no need to use a separate operator for saturation and contrast augmentation in this case

* a similar question was raised in #400 - you can the discussion there

Thanks for your timely relpy! I will follow your suggestions and train my network again right now.

PoonKinWang on 28 Nov 2019

👍2

Hi,
Regarding "CUDA_VISIBLE_DEVICES" it is rather a question to PyTorch community not related to DALI - I see some thread about it here.
Regarding data pipeline I can share a couple of observations:
* your original pipeline uses `transforms.RandomResizedCrop` for training but `OpencvResize` for validation. I would strongly recommend using `transforms.Resize` for validation. The problem might be that OpenCV and torch vision may have a different definition of pixel center, and what is most important torchvision uses a triangular window for interpolation when it scales down (even you ask for bilinear) while OpenCV uses bilinear. That ends up in different resize methodology applied during the training and validation and the DL network is sensitive for this kind of anomalies that may come our from resize artifacts (like aliasing)

* when you use DALI you utilize INTERP_TRIANGULAR for validation pipeline (what resembles the default behavior of `transforms.Resize`) while for the training you use the default value which is INTERP_LINEAR

* small nitpick - ColorTwist can cover saturation, contrast, and brightness in one go, no need to use a separate operator for saturation and contrast augmentation in this case

* a similar question was raised in #400 - you can the discussion there
Thanks for your timely relpy! I will follow your suggestions and train my network again right now.
Don't worry. In my case, the both results of shufflenet V1 and V2 on dali are higher than original paper. The reason for this is that the OpenCVResize only used in val set and it will be different from the train set.

Yeah, Now my accuracy is 69.20%, higher than offical code.

Hou9612 on 11 Dec 2019

I guess the main reason is the color jitter is not random. I refered https://github.com/NVIDIA/DALI/issues/336 and edited my code. Of course, differnet resize between train and val dataset is also a reason.

Hou9612 on 11 Dec 2019

I guess the main reason is the color jitter is not random. I refered https://github.com/NVIDIA/DALI/issues/336 and edited my code. Of course, differnet resize between train and val dataset is also a reason.

Could you post your data augmentation code right here? Thanks in advance.

PoonKinWang on 22 Jan 2020

I guess the main reason is the color jitter is not random. I refered https://github.com/NVIDIA/DALI/issues/336 and edited my code. Of course, differnet resize between train and val dataset is also a reason.

Could you post your data augmentation code right here? Thanks in advance.

I will post it after I come back to school. But it will be a few days, maybe 15 days or more.

Hou9612 on 27 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings