Hi,
I write a simple code, just read a 720p video from disk and output it.
The batch size is 8, frame size is 32.And I find it use 8GB device memory. It's too large so that I don't hava enough cuda memory to train model.
I have no idea if it is normal. what shold I do to decrease the memory usage?
The version I used is 0.11, because I want to used dali without c++14.
class VideoReaderPipeline(Pipeline):
def __init__(self, batch_size, sequence_length, num_threads, device_id, crop_size, files):
super(VideoReaderPipeline, self).__init__(batch_size, num_threads, device_id = device_id, seed=random.randint(1, 100000))
self.reader = ops.VideoReader(device="gpu", filenames=files, sequence_length=sequence_length, normalized=False,
random_shuffle=False, image_type=types.RGB, dtype=types.UINT8,
initial_fill=1, shard_id=device_id, num_shards=dist.get_world_size(),
prefetch_queue_depth = 1)
self.crop = ops.Crop(device="gpu", crop=crop_size, output_dtype=types.FLOAT)
def define_graph(self):
input = self.reader(name="Reader")
return input
class DALILoader():
def __init__(self, batch_size, sequence_length, crop_size):
file_root = "Just a path contains several 720p videoes"
container_files = os.listdir(file_root)
container_files = [file_root + '/' + f for f in container_files]
self.pipeline = VideoReaderPipeline(batch_size=batch_size, sequence_length=sequence_length, num_threads=4, device_id=0, crop_size=crop_size, files = container_files)
self.pipeline.build()
self.epoch_size = self.pipeline.epoch_size("Reader")
self.dali_iterator = pytorch.DALIGenericIterator(self.pipeline, ["data"], self.epoch_size, auto_reset=True)
def __len__(self):
return int(self.epoch_size)
def __iter__(self):
return self.dali_iterator.__iter__()
def test_speed(loader):
count = 0
sum2 = 0
count2 = 0
end = time.time()
for i, inputs in enumerate(loader):
count = count + 1
print("i:{}, use_time:{} ".format(i, time.time() - end))
if (i > 4):
sum2 = sum2 + time.time() - end
count2 = count2 + 1
end = time.time()
print("average_time:{} count:{}".format(sum2 / count2, count))
batch_size = 8
frame = 32
loader = DALILoader(batch_size, frame, 224)
test_speed(loader)
If I output cropped image to 224*224, it still used a large number of memory
@a-sansanwal - do you have any idea why it may be like that?
8 * 32 * 1280*720 should not consume 8GB, other operators should also not consume that much,
@JanuszL its related to this line based on previous discussions with nvcuvid team.
https://github.com/NVIDIA/DALI/blob/8c577e1cf920452ebe7e903cfd19658b22f3acae/dali/pipeline/operators/reader/nvdecoder/cuvideodecoder.cc#L232
We allocate the maximum number of decode surfaces possible in all cases, which is 20.
It's possible to optimize this and reduce memory usage. This is one of the reasons, although there might be other reasons too.
I'll discuss tomorrow and try to send a PR soon.
Thanks a lot for your reply.
The channel is 3, and 8 * 32 * 1280 * 720 * 3 is 675MB. So the pytorch double-buffering use 2*675MB.
And after I set ulNumDecodeSurfaces = 10, the memory usage reduce to 6GB.
https://github.com/NVIDIA/DALI/pull/1643 should resolve your problem. If it still doesn't work please reopen.
Most helpful comment
@JanuszL its related to this line based on previous discussions with nvcuvid team.
https://github.com/NVIDIA/DALI/blob/8c577e1cf920452ebe7e903cfd19658b22f3acae/dali/pipeline/operators/reader/nvdecoder/cuvideodecoder.cc#L232
We allocate the maximum number of decode surfaces possible in all cases, which is 20.
It's possible to optimize this and reduce memory usage. This is one of the reasons, although there might be other reasons too.
I'll discuss tomorrow and try to send a PR soon.