Darknet: ELI5: How does YOLO detect objects in the input image?

Created on 12 Feb 2019 · 5Comments · Source: AlexeyAB/darknet

For those of you who don't know, Explain Like I'm 5 (ELI5) is a famous subreddit where people will ask questions and the answers should be "understandable" by 5-year-olds.

How would you explain how yolo detects the object/s in the input image to someone who has no idea of YOLO/object detection in general?

Explanations

Source

OndoyManing

Most helpful comment

If we talk quite childish:

In general Deep Neaural Network (DNN) just remembers all the images that were seen during training and the coordinates and class_id of objects by using optimal lossy compression - so it only remembers the most important details. So it can predict objects (+-~15% size/color/...) that it saw during training in training and augmented images.

Trained-weights:
Yolo uses convolutional layers, so you can think about trained-weights as:

1st convolutionla layer (with params size=3 filters=16) collects 16 puzzles each consists of 3x3 black or white points
2nd convolutionla layer (with params size=3 filters=32) collects 32 puzzles each consists of 3x3 collected puzzles from the 1st layer (filters from the 1st layer)
....
Last convolutional layer collect several puzzles (which look likes desired objects) each consists of 3x3 puzzles from previous layer where each puzzle consists of 3x3 puzzles from prev-previous layer ....

Forward-inference:
During detection the each convolutional layer just compare these puzzles with each place on the image, and and outputs the degree of coincidence (predict probability of presence of this object).

More about it: https://github.com/AlexeyAB/darknet/issues/796#issuecomment-388553709

AlexeyAB on 12 Feb 2019

👍2 ❤1

All 5 comments

It's a magic box that uses math. There are a bunch of great articles on neural nets, but you have to understand some math for it to be more than magic.

PeterQuinn925 on 12 Feb 2019

❤1 👍1

yes, it is a lot of math. The closed I have seen it explain (probably still not 5yo. level but close) is Prof. Ng's lecture on it:
https://www.youtube.com/watch?v=3Pv66biqc1E

aniketvartak on 12 Feb 2019

❤1 👍1

If we talk quite childish:

In general Deep Neaural Network (DNN) just remembers all the images that were seen during training and the coordinates and class_id of objects by using optimal lossy compression - so it only remembers the most important details. So it can predict objects (+-~15% size/color/...) that it saw during training in training and augmented images.

Trained-weights:
Yolo uses convolutional layers, so you can think about trained-weights as:

1st convolutionla layer (with params size=3 filters=16) collects 16 puzzles each consists of 3x3 black or white points
2nd convolutionla layer (with params size=3 filters=32) collects 32 puzzles each consists of 3x3 collected puzzles from the 1st layer (filters from the 1st layer)
....
Last convolutional layer collect several puzzles (which look likes desired objects) each consists of 3x3 puzzles from previous layer where each puzzle consists of 3x3 puzzles from prev-previous layer ....

Forward-inference:
During detection the each convolutional layer just compare these puzzles with each place on the image, and and outputs the degree of coincidence (predict probability of presence of this object).

More about it: https://github.com/AlexeyAB/darknet/issues/796#issuecomment-388553709

AlexeyAB on 12 Feb 2019

👍2 ❤1

It's a magic box that uses math. There are a bunch of great articles on neural nets, but you have to understand some math for it to be more than magic.

Love that! Hahaha I usually say "An input image goes through a network and poof! objects will be detected."

OndoyManing on 12 Feb 2019

If we talk quite childish:

In general Deep Neaural Network (DNN) just remembers all the images that were seen during training and the coordinates and class_id of objects by using optimal lossy compression. So it can predict objects (+-~15% size/color/...) that it saw during training in training and augmented images.

Trained-weights:
Yolo uses convolutional layers, so you can think about trained-weights as:

1st convolutionla layer (with params size=3 filters=16) collects 16 puzzles each consists of 3x3 black or white points

2nd convolutionla layer (with params size=3 filters=32) collects 32 puzzles each consists of 3x3 collected puzzles from the 1st layer (filters from the 1st layer)
....

Last convolutional layer collect several puzzles (which look likes desired objects) each consists of 3x3 puzzles from previous layer where each puzzle consists of 3x3 puzzles from prev-previous layer ....

Forward-inference:
During detection the each convolutional layer just compare these puzzles with each place on the image, and and outputs the degree of coincidence (predict probability of presence of this object).

More about it: #796 (comment)

Awesome! I think my nephew will be able to get the concept with that kind of explanation.

OndoyManing on 12 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to show one label only even though it detects all the objects

rezaabdullah · 3Comments

one more question about fine-tuning vs transfer-learning

shootingliu · 3Comments

Why FN is negetive number and recall>1?

hemp110 · 3Comments

Python 3.8 issues with yolo_cpp_dll.dll

Greta-A · 3Comments

YOLO is struggling to detect the nested objects

siddharth2395 · 3Comments