Yolov5: Batch Inference for Video

Created on 18 Aug 2020 · 18Comments · Source: ultralytics/yolov5

❔Question

First of all, thank you for such great yolov5.
I would like to ask you about the batch inference for video using Yolov5.

Currently, I've tested successfully with yolov5s at a specific video of 1920x1080 Full HD resolution.
However, the FPS that I got is approximately 40 FPS. It's quite slow in comparison with the report. In some articles, they mentioned about the batch inference (batch of images) when doing inference.

However, in the detect.py file, it seems to me that each image will be passed into the model.

Could you help me provide some suggestions for further improvement?

Thank you so much,
Hung

System Information
CPU Core i7-8700 12 cores
RAM: 32 GB
GPU: Nvidia 1660

Additional context

Stale question

Source

leviethung2103

All 18 comments

Hello @leviethung2103, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

github-actions[bot] on 18 Aug 2020

@leviethung2103 we do not have availability to work on this ourselves, but feel free to implement your proposed feature additions and submit a PR!

glenn-jocher on 18 Aug 2020

@glenn-jocher how would you want this implemented ? In detect.py ? I'd be interested in helping getting this in!

Ownmarc on 18 Aug 2020

@Ownmarc detect.py is already natively batch capable, running batched inference on rtsp/http streams using the LoadStreams dataloader for i.e. inference on 32 simultaneous video streams at batch-size 32.

I suppose what we are missing is intelligence on typical use-cases/features that most people would say add value to the product.

glenn-jocher on 18 Aug 2020

I guess we could batch frames together using open-cv, shouldn't be too hard this way, but I doubt it would be the most efficient way of doing it. I looked at Pytorch dataloader to see if they support generating batches from videos directly, but I think it doesn't and would require some customization of the dataloader.

Ownmarc on 18 Aug 2020

@Ownmarc yes, this would require modification to the LoadImages() dataloader, and also coordination with detect.py to make sure everything works well with the updated dataloader.

LoadImages() comes up with a list of assets to inference, seperates them into images and videos, and then steps through each asset one at a time. The updated dataloader would need to create dynamically sized batches, and could accept an argument for maximum possible batch size, but would need to be smart not to mix asset classes (images with videos), nor multiple videos in a single batch, due to the way detect.py is expecting sources to arrive.

glenn-jocher on 18 Aug 2020

Hello guys i think if you just want to use object detection on already saved video, then just directly feed the video using a virtual
webcam software and use n==1 on line 262 of datasets.py code for smooth input stream. This works like charm and further improved fps 30 -> 60 for me.

AditTuf on 10 Sep 2020

@AditTuf that's a good trick!

glenn-jocher on 10 Sep 2020

Hello guys i think if you just want to use object detection on already saved video, then just directly feed the video using a virtual
webcam software and use n==1 on line 262 of datasets.py code for smooth input stream. This works like charm and further improved fps 30 -> 60 for me.

Hi @AditTuf
could you please paste here the code snippet where n==1 should be used ? I'm not sure your line 262 has still the same number...
Thanks !

scamianbas on 29 Sep 2020

Hi @scamianbas please find this function inside class LoadStreams in datasets.py code

    def update(self, index, cap):
        # Read next stream frame in a daemon thread
        n = 0
        while cap.isOpened():
            n += 1
            # _, self.imgs[index] = cap.read()
            cap.grab()
            if n == 1:  # read every 4th frame, changed to every frame
                _, self.imgs[index] = cap.retrieve()
                n = 0
            time.sleep(0.01)  # wait time

The Fps increase seems due to not fetching all the frames, when the condition was n==4

AditTuf on 29 Sep 2020

👍1

Thank you very much @AditTuf
to sum up,
since now I was detecting from a video file at about 23 FPS,
then, thanks to a virtual webcam I was able to stream and detect this very same file at about 72 FPS but jerkily (3 frames skipped every 4 frames),
and finally, thanks to your patch, l'm still able to detect at 72 FPS but smoothly :+1: (all the frames are used)
Thanks again !

scamianbas on 29 Sep 2020

👍1

@glenn-jocher Maybe you could just use a dynamic stack to save all frames read from stream and clear them out every time you pop the last frame for inference. This should prevent cache overload and optimize the use of cpu/gpu.

SiyangXie on 19 Oct 2020

@SiyangXie

The most common scenario is that frames will be incoming at a faster rate than they can be processed, so we can't afford to accumulate a large queue, we want to drop frames as required to keep the streams operating in realtime.

Though in the contrary case, say poor internet connection, we do not want to run inference on the same frame twice, so ideally we'd want some change detection to allow a batch for inference.

Also keep in mind we want a solution that can process large batches, say 8 simultaneous streams, where some may be arriving faster than others.

I'm open to ideas here.

glenn-jocher on 19 Oct 2020

features
Hi! I would like to apply yolov5 to multiple RTSP cameras. Right now I have 400 RTSP cameras to deal with, but the expense of starting a thread for each of them is too much. And I may just have to infer for 32 channels at once and do it in sequence. I was wondering about any suggestions you might have?

System Information
CPU: E5-2680V4* 2
RAM: 128 GB
GPU: Nvidia RTX 2080TI *4

LIYHUI on 25 Oct 2020

@LIYHUI thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

Cloud-based AI systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

glenn-jocher on 25 Oct 2020

@SiyangXie

The most common scenario is that frames will be incoming at a faster rate than they can be processed, so we can't afford to accumulate a large queue, we want to drop frames as required to keep the streams operating in realtime.

Though in the contrary case, say poor internet connection, we do not want to run inference on the same frame twice, so ideally we'd want some change detection to allow a batch for inference.

Also keep in mind we want a solution that can process large batches, say 8 simultaneous streams, where some may be arriving faster than others.

I'm open to ideas here.

The problem is the speed rate of incoming streams is uncertain. It's possible that it's still slightly faster than you could process, even if some frames are dropped. In that case, the memory will only overload when it's a very long time detection (e.g. a 24-hr surveillance camera??)

SiyangXie on 2 Nov 2020

@SiyangXie yes exactly, we want a solution that is robust to both slower and faster framerates, and dropped frames, from multiple simultaneous sources.

glenn-jocher on 2 Nov 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] on 3 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings