Dali: VideoReader and PyTorch iterator speed issue

Created on 14 Nov 2019 · 25Comments · Source: NVIDIA/DALI

I am trying to use Dali to speed up the preprocessing part of a PyTorch object detection program. My program keeps receiving some short videos(each around 1 min) and try to do inference on every frame. The Dali pipeline is quite simple which consist of a VideoReader only. And then I wrap the pipeline with PyTorch generic iterator according to the tutorial. I found out that the initialization and building of pipeline only took 0.1ms and 0.1s respectively while the initialization of DALIGenericIterator took around 3s. This could slow down my program because it generates a new pipeline periodically(when it receives new video).

Another thing I would like to ask is the about batch_size and sequence length of VideoReader. For my use case, the program needs to read batches of frames from each mp4 file and then pass the tensor to the model. Let us say the batch size of inference is 25, each tensor returned by the iterator should contain 25 frames. From my understanding, setting batch_size to 1 and sequence length to 25 should result the same set of frames as batch_size to 25 and sequence length to 1. If so, during my testing I found that using the former setting would be much faster than the later one(1.9s vs 48s when reading a 10s long 1080p 25fps h264 video). Why there is a huge speed difference between two settings? What are the concerns when choosing batch_size and sequence length?

Here is part of the code:

import sys
import os
from nvidia.dali.pipeline import Pipeline
import nvidia.dali.ops as ops
import nvidia.dali.types as types
from nvidia.dali.plugin.pytorch import DALIGenericIterator

video_files = [
    '../10sec.mp4',
]

class PreprocessPipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id, data, sequence_length):
        super().__init__(batch_size, num_threads, device_id)
        self.input = ops.VideoReader(device="gpu", filenames=data, sequence_length=sequence_length,
                                            shard_id=0, num_shards=1,
                                            random_shuffle=False)

    def define_graph(self):
        raw_input = self.input(name="Reader")
        return raw_input

pipe = PreprocessPipeline(1, 1, 0, video_files, 25)
pipe.build()
dali_iter = DALIGenericIterator([pipe], ['frame'], pipe.epoch_size("Reader"), fill_last_batch=False)

for i, data in enumerate(dali_iter):
    for d in data:
        frames = d['frame']

bug question

Source

conraddd

Most helpful comment

@conraddd - one more thing that I have missed earlier. When you issue:

out, _ = model(frames)

It is executed asynchronously, and the actual synchronization happens inside DALI https://github.com/NVIDIA/DALI/blob/master/dali/python/nvidia/dali/plugin/pytorch.py#L55. So you are timing model execution as well instead of the loading time only.
To make it meaningful try:

with torch.no_grad():
    t_start = perf_counter()
    start = perf_counter()
    for i, data in enumerate(dali_iter):
        decode_time.append(perf_counter() - start)
        print(perf_counter() - start)
        for d in data:
            frames = d['frame']
            frame_index += 1
            if DO_DETECTION:
                frames = torch.nn.functional.interpolate(frames, size=(384,384,3), mode='nearest')
                frames = torch.squeeze(frames)
                frames = frames.permute(0, 3, 1, 2)
                frames /= 255.0
                out, _ = model(frames)
                torch.cuda.current_stream().synchronize()
        start = perf_counter()
    torch.cuda.synchronize()
    t_end = perf_counter()

Then you are sure that output of the model has been already computed.
With that code in my case decoding overlaps with the inference and the total decode time is a magnitude lower comparing to standalone decoding.

JanuszL on 25 Nov 2019

👍2

All 25 comments

Hi,
Regarding the first question. The pipeline build and initialization are expected to be fast - it mostly does the allocation and setup. But when you create the DALIGenericIterator the first batch of data is prepared - so basically it is run once during the cconstruction - https://github.com/NVIDIA/DALI/blob/master/dali/python/nvidia/dali/plugin/pytorch.py#L147.
Regarding the second question 1x25 vs 25x1, it is expected that the first one is faster. The reason is that producing a 1 batch of 25 is one decoder call that decoded everything at once while producing 25 samples of 1 in 25 calls to the decoder.

JanuszL on 14 Nov 2019

@JanuszL
Thanks for your fast response.
For Q1, is there something wrong with the creation of DALIGenericIterator? It took 3s which is even longer than reading one epoch(all frames) from the 10s h264 video.

conraddd on 14 Nov 2019

@conraddd - can you check how long just pipe.run() takes if you call it instead of dali_iter = DALIGenericIterator([pipe], ['frame'], pipe.epoch_size("Reader"), fill_last_batch=False).
First iteration could be slower as GPU memory gets allocated and it can take some time.

JanuszL on 14 Nov 2019

@JanuszL
it took 0.2s to call

pipe_out = pipe.run()

instead of

dali_iter = DALIGenericIterator([pipe], ['frame'], pipe.epoch_size("Reader"), fill_last_batch=False)

conraddd on 14 Nov 2019

Strange. Will check it with https://github.com/NVIDIA/DALI_extra/blob/master/db/video/sintel/sintel_trailer-720p.mp4 and get back to you with the result.

JanuszL on 14 Nov 2019

I think I found the reason. After I did some profiling using line_profiler, this line https://github.com/NVIDIA/DALI/blob/master/dali/python/nvidia/dali/plugin/pytorch.py#L195 spent most of the 3 seconds.
After reading some posts, this long loading time is due the fact that first call to cuda is usually slow in PyTorch. And I proved this by running multiple calls of DALIGenericIterator constructor, only first call took 3s.
I think this delay in first initialization is acceptable.

conraddd on 14 Nov 2019

👍1

After solving the above issues, I found the video decoding is a bit slow compare to using bare OpenCV. Decoding 1 min 1080p h264 video using VideoReader took 13s(batch_size=1, sequence length = 25) while OpenCV took around 4s. I think this should not happen given that Dali is accelerated by GPU. I already taken away the DALIGenericIterator wrapper when measuring the time. I found that decoding can be faster when I increased the sequence length but device memory usage is too high for me to adjust the sequence length, sequence length 25 is occupying 2.3GB device memory.
Memory issue was mentioned in this issue
https://github.com/NVIDIA/DALI/issues/1372#issue-506498956

conraddd on 15 Nov 2019

@a-sansanwal can you look into this?

JanuszL on 15 Nov 2019

@a-sansanwal i'll take a look
@conraddd can you post the video you used ?

a-sansanwal on 15 Nov 2019

👍1

@a-sansanwal
https://drive.google.com/open?id=1QXVOcEthJKVT3W3GZEKs5Stc5NlO_-Bp
I tried with different videos and same problem occurred.

My testing environment:
CPU: Ryzen 3600
GPU: RTX 2070
RAM: 32GB DDR4
OS: ubuntu 18.04
Python version: 3.7
CUDA version: 10.1
Dali version: 0.15.0

conraddd on 15 Nov 2019

Hi @conraddd

The video you posted has 1500 frames. On my pc, I see around 100 fps using DALI, without re-encoding the file to help DALI. That seems close to the values you reported.

Using opencv I was able to decode the entire video in a minute, which is around 25 fps.
Can you please also post the opencv script that you used to decode the video ?
I used the following script.

import cv2 as cv
cap = cv.VideoCapture('test_video.mp4')
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        print("Can't receive frame (stream end?). Exiting ...")
        break
    gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
cap.release()

Also, after I re-encoded the video you posted using the following command to reduce gop length, I was able to decode at around 500 fps.
ffmpeg -i test_video.mp4 -map v:0 -c:v libx264 -crf 18 -pix_fmt yuv420p -g 5 -profile:v high output.mp4

a-sansanwal on 18 Nov 2019

Hi @a-sansanwal , here is the opencv code. It took 4.06s to decode the video.

import cv2
from time import perf_counter


cap = cv2.VideoCapture('test_video.mp4')

frame_index = 0

start = perf_counter()
while cap.isOpened():
    ret, frame = cap.read()
    if ret:
        shape = frame.shape
        print(frame_index, shape)
        frame_index += 1
    else:
        break
print(perf_counter() - start)
cap.release()

Remark: Decoding is done in cap.read() according to OpenCV doc https://docs.opencv.org/2.4/modules/highgui/doc/reading_and_writing_images_and_video.html#videocapture-read

conraddd on 18 Nov 2019

@conraddd
On a different machine, I was able to replicate your numbers.

The way DALI works is different, to provide a richer feature set.
For example the opencv script you used, just decodes frames from start to end.
DALI breaks down down the video into seqeunces based on sequence_length, stride, step, random_shuffle et cetera parameters provided. These chunks are then decoded to produce batches.

To support many of the features that VideoReader provides, it leads to some inefficiencies such as we might have to start decoding from anywhere in between a stream from any of the hundreds of files that might be passed to it. The sequences might not necessarily need to be in order or be related to the previous sequence or start at a keyframe for example.

Like I mentioned above, despite all of this DALI can still outperform cpu decoding by re-encoding the video with smaller gop size.

With some work it should also be possible to hint VideoReader to reduce features in lieu of more performance.

a-sansanwal on 18 Nov 2019

👍1

@conraddd The video you posted has this gop structure

GOP: IBBBPBBBPBBBPBBBPBBBPBP 23 CLOSED
GOP: iBBBPBPBBPBPBPPPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPPBBBPBBBPBBBPBBBPBPPPPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBPBBBPBBBPBPBBBPPBBBPBBBPBPBBBPBBPBBBPBBBPBBBPBP 227 OPEN
GOP: IBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPP 250 CLOSED
GOP: IBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPP 250 CLOSED
GOP: IBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPP 250 CLOSED
GOP: IBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPP 250 CLOSED
GOP: IBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPBBBPP 250 CLOSED

There are 250 frames between two keyframes, so even if we need a frame in between, we must decode starting from previous keyframe.

By re-encoding the file with
ffmpeg -i test_video.mp4 -map v:0 -c:v libx264 -crf 18 -pix_fmt yuv420p -g 5 -profile:v high output.mp4, it introduces more keyframes. DALI has to decode less number of frames as a result.

a-sansanwal on 18 Nov 2019

@a-sansanwal
Thanks for your great explanation, I have few more questions regarding the performance issue.

1) May you explain more on this point?

With some work it should also be possible to hint VideoReader to reduce features in lieu of more performance.

2) Is it true that we should set gop size equal to sequence length to obtain the best decoding performance? Under my testing seems it is true.

3) Is there any way to limit gpu memory usage by VideoReader? Setting sequence length to 5 already using 1455MiB gpu memory which makes me difficult to raise the sequence length(so I can use larger gop size for better compression of video). Each sequence returned by VideoReader should roughly occupy 5 * 1080 * 1920 * 3 / 1024 / 1024 = 29.6MiB ?

4) After changing gop size(5) and sequence length(5), decoding using Dali VideoReader is faster than OpenCV. However, when I put the pytorch inference part(yolov3) into the loop, I found that the VideoReader decoding slow down a lot(from 3.6s to 14.4s, excluded inference time) which makes the benefit of Dali preprocessing unnoticeable. At that moment GPU usage became 100% but I did not expect the decoding speed drops that much.

conraddd on 20 Nov 2019

@conraddd

May you explain more on this point?
With some work it should also be possible to hint VideoReader to reduce features in lieu of more performance.

The idea I had was if all the user wanted is decode all frames from start to end then a lot of seeking and wasteful decode can be avoided. That will need many changes in VideoReader.

Is it true that we should set gop size equal to sequence length to obtain the best decoding performance? Under my testing seems it is true.

Yes, when the first frame in the sequence is a key frame, DALI does not decode any extra frames that it does not need. It is best to re-encode the video with a fixed gop length and have a sequence length equal to or a multiple of the gop length, that way each sequence starts at a key frame.

Is there any way to limit gpu memory usage by VideoReader?

Try setting additional_decode_surfaces to 1 or 0.
Also different codec's have different memory usage. I haven't experimented enough to tell you which takes least memory.

Each sequence returned by VideoReader should roughly occupy 5 * 1080 * 1920 * 3 / 1024 / 1024 = 29.6MiB

video memory is being used to allocate nvdec decode surfaces too, which is also another factor in memory consumption.

However, when I put the pytorch inference part(yolov3) into the loop, I found that the VideoReader decoding slow down a lot(from 3.6s to 14.4s, excluded inference time) which makes the benefit of Dali preprocessing unnoticeable.

Not sure, please post code snippet that your'e running.

a-sansanwal on 20 Nov 2019

@a-sansanwal
Here are the code and procedures to run the test.

cd ~/YOUR_WORKING_DIR
git clone https://github.com/ultralytics/yolov3
cd yolov3/

# Please download new test_video.mp4(gop size 5) and test_dali.py from google drive and put them in project root dir

pip3 install --user -r requirements.txt
bash weights/download_yolov3_weights.sh
python3 test_dali.py

test_video.mp4
test_dali.py

conraddd on 21 Nov 2019

@JanuszL maybe you have some idea about this ?
I was able to verify that without pytorch inferencing code in the loop, enumerate over iterator was faster.

a-sansanwal on 21 Nov 2019

@conraddd when you set DO_DETECTION to True you throw more work on the GPU so DALI needs to fight for the computing time (I see in the profiler that kernels that convert for yuv to RGB and D2D memcopy - from DALI to PyTorch tensors wait a bit for the GPU).
Have you checked end2end performance? Have you measured what is the performance with the synthetic pipeline and some other CPU based approach? Even if DALI slows downs' a bit when overlapped with other computations I guess that CPU based approach would be inferior because you need to copy data from CPU to GPU which would be much slower than DALI does.
I have run a couple of tests with the following code:

decode_time = []
frame_index = 0
data = [{'frame': torch.zeros([1, 5, 1080, 1920, 3],
                    dtype=torch.float32,
                    device='cuda:0')}]
with torch.no_grad():
    t_start = perf_counter()
    start = perf_counter()
    #for i, data in enumerate(dali_iter):
    for i in range(300):
        decode_time.append(perf_counter() - start)
        #print(perf_counter() - start)
        for d in data:
            frames = d['frame']
            frame_index += 1
            if DO_DETECTION:
                frames = torch.nn.functional.interpolate(frames, size=(384,384,3), mode='nearest')
                frames = torch.squeeze(frames)
                frames = frames.permute(0, 3, 1, 2)
                frames /= 255.0
                out, _ = model(frames)
                #ev = torch.cuda.Event()
                #ev.record()
                #ev.synchronize()
        start = perf_counter()
    torch.cuda.synchronize()
    t_end = perf_counter()
print("frame_index:", frame_index)
print("Total decode time:", sum(decode_time))
print("Total inference time:", t_end - t_start)

JanuszL on 21 Nov 2019

@JanuszL

| | Dali + Torch | OpenCV + Torch |
|----------------|--------------|----------------|
| Decode | 13.7s(GPU) | 3.9s(CPU) |
| Preprocess | 0.027s(GPU) | 0.062s(GPU) |
| Host to device | N.A | 15.1s |
| Infer | 2.22s(GPU) | 2.32s(GPU) |
| End to end | 16.0s | 21.4s |

So decoding in cpu does not slow down when we do inference but huge overhead to copy tensor from host to device. Overall dali is still faster but not very significant(~30%).

Using dali(gpu) to decode - test_dali.py

import sys
import os
import torch
from nvidia.dali.pipeline import Pipeline
import nvidia.dali.ops as ops
import nvidia.dali.types as types
from nvidia.dali.plugin.pytorch import DALIGenericIterator, TorchPythonFunction

from time import perf_counter

from models import Darknet, load_darknet_weights

DO_DETECTION = True

video_files = [
    'test_video.mp4',
]

class PreprocessPipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id, data, sequence_length):
        super().__init__(batch_size, num_threads, device_id, exec_async=False, exec_pipelined=False)
        self.input = ops.VideoReader(device="gpu", filenames=data, sequence_length=sequence_length,
                                            shard_id=0, num_shards=1,
                                            random_shuffle=False, dtype=types.FLOAT)

    def define_graph(self):
        raw_input = self.input(name="Reader")
        return raw_input

pipe = PreprocessPipeline(1, 1, 0, video_files, 5)
pipe.build()
dali_iter = DALIGenericIterator([pipe], ['frame'], pipe.epoch_size("Reader"), fill_last_batch=False)

if DO_DETECTION:
    model = Darknet('cfg/yolov3-spp.cfg', 384)
    load_darknet_weights(model, 'weights/yolov3-spp.weights')
    model.to(torch.device('cuda', 0)).eval()

decode_time = []
preprocess_time = []
infer_time = []
frame_index = 0
end_to_end_start = perf_counter()
with torch.no_grad():
    start = perf_counter()
    for i, data in enumerate(dali_iter):
        decode_time.append(perf_counter() - start)
        for d in data:
            frames = d['frame']
            frame_index += 1
            if DO_DETECTION:
                start = perf_counter()
                frames = torch.squeeze(frames)
                frames = frames.permute(0, 3, 1, 2)
                frames = torch.nn.functional.interpolate(frames, size=(384,384), mode='nearest')
                frames /= 255.0
                preprocess_time.append(perf_counter() - start)
                start = perf_counter()
                out, _ = model(frames)
                infer_time.append(perf_counter() - start)
        start = perf_counter()

print("frame_index:", frame_index)
print("Total decode time:", sum(decode_time))
print("Total preprocess time:", sum(preprocess_time))
print("Total infer time:", sum(infer_time))
print("End to end time: ", perf_counter() - end_to_end_start)

Using OpenCV(cpu) to decode - test_cpu.py

import sys
import os
import torch
import cv2
import numpy as np

from time import perf_counter

from models import Darknet, load_darknet_weights

DO_DETECTION = True
batch_size = 5
video_files = [
    'test_video.mp4',
]

cap = cv2.VideoCapture(video_files[0])
length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

if DO_DETECTION:
    model = Darknet('cfg/yolov3-spp.cfg', 384)
    load_darknet_weights(model, 'weights/yolov3-spp.weights')
    model.to(torch.device('cuda', 0)).eval()

frames = []
decode_time = []
h2d_time = []
preprocess_time = []
infer_time = []
frame_index = 0
end_to_end_start = perf_counter()
with torch.no_grad():
    while cap.isOpened():
        start = perf_counter()
        ret, frame = cap.read()
        if ret:
            decode_time.append(perf_counter() - start)
            frames.append(frame)
            frame_index += 1
            if len(frames) == batch_size or (len(frames) and frame_index == length):
                if DO_DETECTION:
                    start = perf_counter()
                    frames = torch.from_numpy(np.array(frames)).type(torch.cuda.FloatTensor)
                    h2d_time.append(perf_counter() - start)
                    start = perf_counter()
                    frames = frames.permute(0, 3, 1, 2)
                    frames = torch.nn.functional.interpolate(frames, size=(384,384), mode='nearest')
                    frames /= 255.0
                    preprocess_time.append(perf_counter() - start)
                    start = perf_counter()
                    out, _ = model(frames)
                    infer_time.append(perf_counter() - start)
                frames = []
        else:
            break

print("frame_index:", frame_index)
print("Total decode time:", sum(decode_time))
print("Total host to device time:", sum(h2d_time))
print("Total preprocess time:", sum(preprocess_time))
print("Total infer time:", sum(infer_time))
print("End to end time: ", perf_counter() - end_to_end_start)

conraddd on 22 Nov 2019

@conraddd
Kernel to convert ycbcr to rgb can be avoided if you dont need rgb.
image_type can be set to DALI_YCbCr. Might help squeeze out a little more perf.

a-sansanwal on 22 Nov 2019

@a-sansanwal
Unfortunately i need all 3 channels to do object detection.

conraddd on 22 Nov 2019

@conraddd - long term when the resize will be available for sequences there will be less overhead for copying memory form DALI to torch tensor. Also long term we consider utilizing DLPack to have a zero-copy at all.
Regarding utilizing the CPU - yes that is the trade-off. DALI shines when the CPU is the bottleneck. If you have plenty of free CPU cycles and GPU is already well utilized then DALI won't help you much.
Still I see it is a bit faster with DALI anyway.

JanuszL on 22 Nov 2019

👍1

@JanuszL @a-sansanwal
Agree, in my case CPU is not the bottleneck so the performance boost from Dali is not very significant.
Despite supporting sequence resize operation, I hope the video decoding part can be improved in the future. May be use less device memory in the whole process and some how make the decoding speed less sensitive to the gop size(I am not very familiar with the decoding mechanism but can we keep the key frame in memory to avoid seeking in sequential reading case?).

conraddd on 25 Nov 2019

@conraddd - one more thing that I have missed earlier. When you issue:

out, _ = model(frames)

with torch.no_grad():
    t_start = perf_counter()
    start = perf_counter()
    for i, data in enumerate(dali_iter):
        decode_time.append(perf_counter() - start)
        print(perf_counter() - start)
        for d in data:
            frames = d['frame']
            frame_index += 1
            if DO_DETECTION:
                frames = torch.nn.functional.interpolate(frames, size=(384,384,3), mode='nearest')
                frames = torch.squeeze(frames)
                frames = frames.permute(0, 3, 1, 2)
                frames /= 255.0
                out, _ = model(frames)
                torch.cuda.current_stream().synchronize()
        start = perf_counter()
    torch.cuda.synchronize()
    t_end = perf_counter()

JanuszL on 25 Nov 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings