Dali: VideoReader use too much device memory? What should I do?

Created on 14 Oct 2019 · 5Comments · Source: NVIDIA/DALI

Hi,

I write a simple code, just read a 720p video from disk and output it.
The batch size is 8, frame size is 32.And I find it use 8GB device memory. It's too large so that I don't hava enough cuda memory to train model.
I have no idea if it is normal. what shold I do to decrease the memory usage?
The version I used is 0.11, because I want to used dali without c++14.

class VideoReaderPipeline(Pipeline):
    def __init__(self, batch_size, sequence_length, num_threads, device_id, crop_size, files):
        super(VideoReaderPipeline, self).__init__(batch_size, num_threads, device_id = device_id, seed=random.randint(1, 100000))
        self.reader = ops.VideoReader(device="gpu", filenames=files, sequence_length=sequence_length, normalized=False,
                                                             random_shuffle=False, image_type=types.RGB, dtype=types.UINT8,
                                                             initial_fill=1, shard_id=device_id, num_shards=dist.get_world_size(),
                                                             prefetch_queue_depth = 1)

        self.crop = ops.Crop(device="gpu", crop=crop_size, output_dtype=types.FLOAT) 

    def define_graph(self):
        input = self.reader(name="Reader")
        return input

class DALILoader():
    def __init__(self, batch_size, sequence_length, crop_size):
        file_root = "Just a path contains several 720p videoes"
        container_files = os.listdir(file_root)
        container_files = [file_root + '/' + f for f in container_files]
        self.pipeline = VideoReaderPipeline(batch_size=batch_size, sequence_length=sequence_length, num_threads=4, device_id=0, crop_size=crop_size, files = container_files)
        self.pipeline.build()
        self.epoch_size = self.pipeline.epoch_size("Reader")
        self.dali_iterator = pytorch.DALIGenericIterator(self.pipeline, ["data"], self.epoch_size, auto_reset=True)
    def __len__(self):
        return int(self.epoch_size)
    def __iter__(self):
        return self.dali_iterator.__iter__()

def test_speed(loader):
    count = 0
    sum2 = 0
    count2 = 0
    end = time.time()
    for i, inputs in enumerate(loader):
        count = count + 1
        print("i:{}, use_time:{} ".format(i, time.time() - end))
        if (i > 4):
            sum2 = sum2 + time.time() - end
            count2 = count2 + 1
        end = time.time()
    print("average_time:{}  count:{}".format(sum2 / count2, count))

batch_size = 8
frame = 32
loader = DALILoader(batch_size, frame, 224)
test_speed(loader)

question

Source

kfhe00

Most helpful comment

@JanuszL its related to this line based on previous discussions with nvcuvid team.
https://github.com/NVIDIA/DALI/blob/8c577e1cf920452ebe7e903cfd19658b22f3acae/dali/pipeline/operators/reader/nvdecoder/cuvideodecoder.cc#L232
We allocate the maximum number of decode surfaces possible in all cases, which is 20.
It's possible to optimize this and reduce memory usage. This is one of the reasons, although there might be other reasons too.
I'll discuss tomorrow and try to send a PR soon.

a-sansanwal on 14 Oct 2019

👍2

All 5 comments

If I output cropped image to 224*224, it still used a large number of memory

kfhe00 on 14 Oct 2019

@a-sansanwal - do you have any idea why it may be like that?
8 * 32 * 1280*720 should not consume 8GB, other operators should also not consume that much,

JanuszL on 14 Oct 2019

a-sansanwal on 14 Oct 2019

👍2

Thanks a lot for your reply.
The channel is 3, and 8 * 32 * 1280 * 720 * 3 is 675MB. So the pytorch double-buffering use 2*675MB.
And after I set ulNumDecodeSurfaces = 10, the memory usage reduce to 6GB.

kfhe00 on 15 Oct 2019

https://github.com/NVIDIA/DALI/pull/1643 should resolve your problem. If it still doesn't work please reopen.

JanuszL on 21 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings