Hi,
Thanks for the nice library. I found DALI while looking for a video loader for action recognition. I found that DALI yet cannot handle various resolution as in the issue #725 which is necessary for public dataset such as Kinetics.
Another necessary component might be processing videos with various length and fps.
It seems VideoReader only support extract whole video into batch of sequence_length of sequences. I'm not sure because I've just tested video_test.py only.
sequence_length and step) from one video. It seems that this way is commonly used in training phase of Kinetics dataset.ffmpeg with fps filter when I extract the frames manually.Hopefully, those process can be possible already or do you have any plan to support those features?
Hi,
Currently, it is not possible to do all the things you requested (am I right @Kh4L?). Nevertheless, we are putting them in our backlog.
Do you have any particular dataset and network in mind you are using and you want to plug DALI inside
Tracked as DALI-694 and DALI-695.
Hi @kkjh0723 ,
If I understand correctly, these are the feature you are asking for:
Thanks, @JanuszL,
I want to use DALI with Pytorch for Kinetics and AVA dataset. I want to follow the pre-processing of Non-local network and Slowfast network. Non-local network code is opened but written in Caffe2 (link). It seems they directly read video (from lmdb) and preprocess on GPUs.
Currently, I'm using the 3D-Resnet based code. For data loading, they extract the frames from videos and save into jpeg file. Then load and process frames using multiple workers as Pytorch dataloader.
I faced a problem when the data saved into the NFS and multiple training programs try to access the data. The GPU utility drops down from around 70% to 20% . I guess that reading thousands of small jpeg file cause the I/O bottleneck. I thought video reader can be helpful (it will be better to use database like lmdb also though).
Thanks @Kh4L ,
Exactly for the first two.
As for the FPS, what I was considered is,
when the video's fps is larger than the set fps, then same as what you said. But in case the video's fps is lower than the set fps, then duplicate the frames just like in the ffmpeg's fps filter.
I'm not sure it is typical setting used in video research, but it seems reasonable to me.
I am also looking for a feature that can support loading videos with different length. I want to mention that an alternative way of dealing with these kind of data is to load full videos in a mini-batch and then pad the shorter ones to the longest video length. This is also very common in video tasks such as captioning. It would be very appreciated if one day this can be supported in DALI.
@aBlueDragon - what kind of padding do you have in mind, dummy, replication of the last frame, something else?
Also by loading a full video, you mean to create a sequence with all frames from this file?
@JanuszL Padding zeros would be fine, but the dataloader has to return the actually length of each sample so that we know the real length of the padded videos.
Yes, loading full video means loading all frames from the file, which is generally within 300 frames.
@aBlueDragon - thanks for the explanation.
vote for various length.
We could encode videos to the same resolution and the same fps, but NO way to the same length.
I don't know any situation with fixed length.
@raofengyun Usually you need sample N frames from the video by one of the following approaches:
@JanuszL I think adding these three sampling techniques to your framework would be quite helpful for video understanding community.
Sample datasets:
@mzolfaghari I agree your idea, for action recognition, we need the different frame sampler.
Hi,
Some updates on the request.
file_list argument (like in this example) you can specify allowed ranges of senescence for each video file@raofengyun Usually you need sample N frames from the video by one of the following approaches:
- Splitting the video into fixed-length clips (N frames) (last clip will be padded by the last frame to fix the size). Each time you pass one clip to the network. C3D and two-stream (K Simonyan - ‎2014) used this method.
- Sampling N frames randomly from the entire video. In this case, each video splitted into N segments and from each segment one frame is selected randomly. This technique is used in some recent papers like: TSN, ECO, and TSM
- Sampling N frames from a time interval [start_frame/time, end_frame/time]: This is used in instructional videos or sequential actions which each video has multiple consecutive actions.
@JanuszL I think adding these three sampling techniques to your framework would be quite helpful for video understanding community.
Sample datasets:
- Kinetics
- UCF101
- EpicKitchen
- SomethingSomething
- YouCook2
the second one implemented in https://github.com/SunGaofeng/DALI
use case please refer to https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleCV/video/models/tsn
@huangjun12 - if you think that is useful for the rest of the community you can fill a PR using the code from https://github.com/SunGaofeng/DALI.
@huangjun12 - if you think that is useful for the rest of the community you can fill a PR using the code from https://github.com/SunGaofeng/DALI.
The VideoReader for TSN model is developed by modifying the codes of the original DALI repo. I did this in such a hurry that few thoughts was considered on how to be compatible with the origin code and how to fit more models. Maybe @huangjun12 can put more time on this subject to fit for more classical models.
Most helpful comment
@raofengyun Usually you need sample N frames from the video by one of the following approaches:
@JanuszL I think adding these three sampling techniques to your framework would be quite helpful for video understanding community.
Sample datasets: