Dali: Videos with various length and fps

Created on 4 Apr 2019 · 14Comments · Source: NVIDIA/DALI

Hi,
Thanks for the nice library. I found DALI while looking for a video loader for action recognition. I found that DALI yet cannot handle various resolution as in the issue #725 which is necessary for public dataset such as Kinetics.

Another necessary component might be processing videos with various length and fps.
It seems VideoReader only support extract whole video into batch of sequence_length of sequences. I'm not sure because I've just tested video_test.py only.

I wonder if it is possible to randomly extract one short "clip" (sequence of length sequence_length and step) from one video. It seems that this way is commonly used in training phase of Kinetics dataset.
For evaluation, people often extract several clips along the whole video with equal interval.
Additionally it will be nice if we can set the fps of all video same since videos vary in fps. I often use ffmpeg with fps filter when I extract the frames manually.

Hopefully, those process can be possible already or do you have any plan to support those features?

Video question

Source

kkjh0723

👍1

Most helpful comment

@raofengyun Usually you need sample N frames from the video by one of the following approaches:

Splitting the video into fixed-length clips (N frames) (last clip will be padded by the last frame to fix the size). Each time you pass one clip to the network. C3D and two-stream (K Simonyan - ‎2014) used this method.
Sampling N frames randomly from the entire video. In this case, each video splitted into N segments and from each segment one frame is selected randomly. This technique is used in some recent papers like: TSN, ECO, and TSM
Sampling N frames from a time interval [start_frame/time, end_frame/time]: This is used in instructional videos or sequential actions which each video has multiple consecutive actions.

@JanuszL I think adding these three sampling techniques to your framework would be quite helpful for video understanding community.
Sample datasets:

Kinetics
UCF101
EpicKitchen
SomethingSomething
YouCook2

mzolfaghari on 26 Aug 2019

👍10 ❤2

All 14 comments

Hi,
Currently, it is not possible to do all the things you requested (am I right @Kh4L?). Nevertheless, we are putting them in our backlog.
Do you have any particular dataset and network in mind you are using and you want to plug DALI inside
Tracked as DALI-694 and DALI-695.

JanuszL on 8 Apr 2019

Hi @kkjh0723 ,

If I understand correctly, these are the feature you are asking for:

Support of video files of different spatial dims (Height Width). This is currently not supported but in our backlog with higher priority.
Various length support: this is currently not support but in our backlog.
FPS: currently we are extracting all the frames frames from the video. Are you asking for an option where you could decide to get only a subset of the frames, with a given stride (like every n-th frame)?

Kh4L on 8 Apr 2019

Thanks, @JanuszL,
I want to use DALI with Pytorch for Kinetics and AVA dataset. I want to follow the pre-processing of Non-local network and Slowfast network. Non-local network code is opened but written in Caffe2 (link). It seems they directly read video (from lmdb) and preprocess on GPUs.

Currently, I'm using the 3D-Resnet based code. For data loading, they extract the frames from videos and save into jpeg file. Then load and process frames using multiple workers as Pytorch dataloader.
I faced a problem when the data saved into the NFS and multiple training programs try to access the data. The GPU utility drops down from around 70% to 20% . I guess that reading thousands of small jpeg file cause the I/O bottleneck. I thought video reader can be helpful (it will be better to use database like lmdb also though).

Thanks @Kh4L ,
Exactly for the first two.
As for the FPS, what I was considered is,
when the video's fps is larger than the set fps, then same as what you said. But in case the video's fps is lower than the set fps, then duplicate the frames just like in the ffmpeg's fps filter.
I'm not sure it is typical setting used in video research, but it seems reasonable to me.

kkjh0723 on 8 Apr 2019

I am also looking for a feature that can support loading videos with different length. I want to mention that an alternative way of dealing with these kind of data is to load full videos in a mini-batch and then pad the shorter ones to the longest video length. This is also very common in video tasks such as captioning. It would be very appreciated if one day this can be supported in DALI.

aBlueDragon on 30 May 2019

👍1

@aBlueDragon - what kind of padding do you have in mind, dummy, replication of the last frame, something else?
Also by loading a full video, you mean to create a sequence with all frames from this file?

JanuszL on 30 May 2019

@JanuszL Padding zeros would be fine, but the dataloader has to return the actually length of each sample so that we know the real length of the padded videos.
Yes, loading full video means loading all frames from the file, which is generally within 300 frames.

aBlueDragon on 30 May 2019

@aBlueDragon - thanks for the explanation.

JanuszL on 30 May 2019

vote for various length.

We could encode videos to the same resolution and the same fps, but NO way to the same length.
I don't know any situation with fixed length.

frankrao on 21 Aug 2019

👍2

@raofengyun Usually you need sample N frames from the video by one of the following approaches:

Splitting the video into fixed-length clips (N frames) (last clip will be padded by the last frame to fix the size). Each time you pass one clip to the network. C3D and two-stream (K Simonyan - ‎2014) used this method.
Sampling N frames randomly from the entire video. In this case, each video splitted into N segments and from each segment one frame is selected randomly. This technique is used in some recent papers like: TSN, ECO, and TSM
Sampling N frames from a time interval [start_frame/time, end_frame/time]: This is used in instructional videos or sequential actions which each video has multiple consecutive actions.

@JanuszL I think adding these three sampling techniques to your framework would be quite helpful for video understanding community.
Sample datasets:

Kinetics
UCF101
EpicKitchen
SomethingSomething
YouCook2

mzolfaghari on 26 Aug 2019

👍10 ❤2

@mzolfaghari I agree your idea, for action recognition, we need the different frame sampler.

cvnovice95 on 3 Nov 2019

Hi,
Some updates on the request.

videos with the different resolutions are supported
videos with different lengths are supported (as long as the length is bigger than the requested sequence length - no padding yet)
using file_list argument (like in this example) you can specify allowed ranges of senescence for each video file

JanuszL on 6 May 2020

😄1

@raofengyun Usually you need sample N frames from the video by one of the following approaches:

Splitting the video into fixed-length clips (N frames) (last clip will be padded by the last frame to fix the size). Each time you pass one clip to the network. C3D and two-stream (K Simonyan - ‎2014) used this method.

Sampling N frames randomly from the entire video. In this case, each video splitted into N segments and from each segment one frame is selected randomly. This technique is used in some recent papers like: TSN, ECO, and TSM

Sampling N frames from a time interval [start_frame/time, end_frame/time]: This is used in instructional videos or sequential actions which each video has multiple consecutive actions.

@JanuszL I think adding these three sampling techniques to your framework would be quite helpful for video understanding community.
Sample datasets:

Kinetics

UCF101

EpicKitchen

SomethingSomething

YouCook2

the second one implemented in https://github.com/SunGaofeng/DALI
use case please refer to https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleCV/video/models/tsn

huangjun12 on 14 Jul 2020

@huangjun12 - if you think that is useful for the rest of the community you can fill a PR using the code from https://github.com/SunGaofeng/DALI.

JanuszL on 14 Jul 2020

@huangjun12 - if you think that is useful for the rest of the community you can fill a PR using the code from https://github.com/SunGaofeng/DALI.

The VideoReader for TSN model is developed by modifying the codes of the original DALI repo. I did this in such a hurry that few thoughts was considered on how to be compatible with the origin code and how to fit more models. Maybe @huangjun12 can put more time on this subject to fit for more classical models.