Darknet: ELI5: How does YOLO detect objects in the input image?

Created on 12 Feb 2019  路  5Comments  路  Source: AlexeyAB/darknet

For those of you who don't know, Explain Like I'm 5 (ELI5) is a famous subreddit where people will ask questions and the answers should be "understandable" by 5-year-olds.

How would you explain how yolo detects the object/s in the input image to someone who has no idea of YOLO/object detection in general?

Explanations

Most helpful comment

If we talk quite childish:

In general Deep Neaural Network (DNN) just remembers all the images that were seen during training and the coordinates and class_id of objects by using optimal lossy compression - so it only remembers the most important details. So it can predict objects (+-~15% size/color/...) that it saw during training in training and augmented images.

Trained-weights:
Yolo uses convolutional layers, so you can think about trained-weights as:

  • 1st convolutionla layer (with params size=3 filters=16) collects 16 puzzles each consists of 3x3 black or white points
  • 2nd convolutionla layer (with params size=3 filters=32) collects 32 puzzles each consists of 3x3 collected puzzles from the 1st layer (filters from the 1st layer)
    ....
  • Last convolutional layer collect several puzzles (which look likes desired objects) each consists of 3x3 puzzles from previous layer where each puzzle consists of 3x3 puzzles from prev-previous layer ....

Forward-inference:
During detection the each convolutional layer just compare these puzzles with each place on the image, and and outputs the degree of coincidence (predict probability of presence of this object).

More about it: https://github.com/AlexeyAB/darknet/issues/796#issuecomment-388553709

All 5 comments

It's a magic box that uses math. There are a bunch of great articles on neural nets, but you have to understand some math for it to be more than magic.

yes, it is a lot of math. The closed I have seen it explain (probably still not 5yo. level but close) is Prof. Ng's lecture on it:
https://www.youtube.com/watch?v=3Pv66biqc1E

If we talk quite childish:

In general Deep Neaural Network (DNN) just remembers all the images that were seen during training and the coordinates and class_id of objects by using optimal lossy compression - so it only remembers the most important details. So it can predict objects (+-~15% size/color/...) that it saw during training in training and augmented images.

Trained-weights:
Yolo uses convolutional layers, so you can think about trained-weights as:

  • 1st convolutionla layer (with params size=3 filters=16) collects 16 puzzles each consists of 3x3 black or white points
  • 2nd convolutionla layer (with params size=3 filters=32) collects 32 puzzles each consists of 3x3 collected puzzles from the 1st layer (filters from the 1st layer)
    ....
  • Last convolutional layer collect several puzzles (which look likes desired objects) each consists of 3x3 puzzles from previous layer where each puzzle consists of 3x3 puzzles from prev-previous layer ....

Forward-inference:
During detection the each convolutional layer just compare these puzzles with each place on the image, and and outputs the degree of coincidence (predict probability of presence of this object).

More about it: https://github.com/AlexeyAB/darknet/issues/796#issuecomment-388553709

It's a magic box that uses math. There are a bunch of great articles on neural nets, but you have to understand some math for it to be more than magic.

Love that! Hahaha I usually say "An input image goes through a network and poof! objects will be detected."

If we talk quite childish:

In general Deep Neaural Network (DNN) just remembers all the images that were seen during training and the coordinates and class_id of objects by using optimal lossy compression. So it can predict objects (+-~15% size/color/...) that it saw during training in training and augmented images.

Trained-weights:
Yolo uses convolutional layers, so you can think about trained-weights as:

  • 1st convolutionla layer (with params size=3 filters=16) collects 16 puzzles each consists of 3x3 black or white points
  • 2nd convolutionla layer (with params size=3 filters=32) collects 32 puzzles each consists of 3x3 collected puzzles from the 1st layer (filters from the 1st layer)
    ....
  • Last convolutional layer collect several puzzles (which look likes desired objects) each consists of 3x3 puzzles from previous layer where each puzzle consists of 3x3 puzzles from prev-previous layer ....

Forward-inference:
During detection the each convolutional layer just compare these puzzles with each place on the image, and and outputs the degree of coincidence (predict probability of presence of this object).

More about it: #796 (comment)

Awesome! I think my nephew will be able to get the concept with that kind of explanation.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hemp110 picture hemp110  路  3Comments

HilmiK picture HilmiK  路  3Comments

Jacky3213 picture Jacky3213  路  3Comments

Mididou picture Mididou  路  3Comments

shootingliu picture shootingliu  路  3Comments