Orb_slam2: Why do we need initialization

Created on 30 Sep 2018 · 7Comments · Source: raulmur/ORB_SLAM2

I do not quite understand why initialization is need for Visual SLAM. Can we just assume the first frame coordinate is the work coordinate and go from there? Besides the scale, is there any other parameters needs to estimated by the initialization?

Source

ArtlyStyles

Most helpful comment

Hi @ArtlyStyles ,

I know I am repeating thing you already know, it is for other people reading this post.

SLAM uses a map in order to localize itself, and from this localization maps new areas. But it needs a map to start it all. Initialization is the process to build that initial map, before slam get working.

Each slam method has different sensitivity to initial map errors, that's why each method chooses its own way to initialize.

ORB-SLAM2 uses two parallel methods: "Pose N Point" and "Homography". Neither of them are visual odometry. The best initialization wins, and it is highly dependent on the scene. Both methods uses matched features from two frames. Those are the green lines on screen, growing and disappearing. Initialization need two frames with enough parallax and a more than a minimum number of matched features in order to build the initial map with enough precision.

As usual, neither of these initialization methods has as much precision as slam - initial map improves a lot after the first bundle adjustment. But there's more. Brute force matching between two frames is too slow. You can blindly narrow it, but it's still slower than slam. Visual slam doesn't apply brute force, it applies geometry, it projects known mappoints onto the image and look for a match nearby. The exact opposite of brute force.

About Visual Odometry (VO): it is commonly view as a Visual SLAM that forget the map. It only remember local map, a map that vanishes when you walk away. VO needs initialization too.

AlejandroSilvestri on 18 Oct 2018

👍3

All 7 comments

In order for SLAM to localize itself, it needs a map.
Initialization is the process where ORBSLAM2 builds its initial map of the environment.
Now that it has a map, it can continue mapping and localizing itself, adding to this initial map and calculating its pose and trajectory as it moves.

I believe ORBSLAM needs about 100 points/landmarks to initialize.
When shooting my own sample videos, I get the best results when facing an area with many features and slowly walking parallel to it, allowing ORBSLAM to take it all in an make that initial map.

Simply holding the camera or using the first frame often won't work, as it needs some movement so it can see more points and build the map.
Conversely, turning the camera often changes the camera's view too fast for ORBSLAM2 to initialize.

Still, I'm fairly new to SLAM myself, so you may want to double check some of this.

BW25 on 4 Oct 2018

Thank you for your post. Still not very clear to me that if we can build the initial map from, saying, 1st and 2nd frame, what prevents us from build additional map using frame 2 and frame 3, and so no. What's the benefit to have a map and tracking at the same time?

ArtlyStyles on 5 Oct 2018

I'm not sure I understand the question, but here goes.
The initialization just makes the starting map. As the camera moves around, it will update the map, adding new information to it. By the end, we end up with a map of everywhere the camera has been and much of what it has seen.
So it does build an additional map as it goes, and adds it to the original one.
We just need the initialization to give us a starting point.

As for your question, you cannot track your own location without a map, because your location is always defined relative to your surroundings.
So the machine looks at its surroundings, and uses them as landmarks to understand how it is moving. At the same time, it updates its map of the surroundings to make it larger and more accurate.

This is the essence of SLAM; both the mapping the environment and locating yourself in it are done constantly and simultaneously to get an accurate view of your surroundings and location in them.

BW25 on 5 Oct 2018

Thank you for your answer and I am sorry for not making myself clear.

Usually, the front-end of SLAM is VO, which just uses two frames to reconstruct the relative motion, R and t. During this process, it will compute some map points. When new frame comes in, it also just use the new frame and the frame in front of it to construct a new R and t. Then use BA to do a global optimization involving R, t and map points. And then loop closure.

It seems ORB SLAM works this way: from frame 1 and frame 2, calculate motion R, t between them and the some map points. When frame 3 comes in, it uses the existing map points to calculate the motion from frame 2 and frame3. And then go on. Is my understanding correct?

If yes, I do not see why this approach is better then the traditional VO...

ArtlyStyles on 5 Oct 2018

My understanding is that SLAM does use Visual Odometry in a sense, or rather, it uses a nun of the same techniques.
It does continue to calculate relative motion and t and do bundle adjustment.
The locating of the points and the building of the map is just the means by which it does this.

However, VO only uses the map information it has right now, it doesn’t remember the entire map and everywhere it has been. SLAM does.
As a result, VO can’t always do things like loop closing, because it’s many not remember it has been somewhere before, as the map is not saved long term.

But you may find more comprehensive answers here.
https://github.com/raulmur/ORB_SLAM2/issues/256

BW25 on 5 Oct 2018

Hi @ArtlyStyles ,

I know I am repeating thing you already know, it is for other people reading this post.

Each slam method has different sensitivity to initial map errors, that's why each method chooses its own way to initialize.

About Visual Odometry (VO): it is commonly view as a Visual SLAM that forget the map. It only remember local map, a map that vanishes when you walk away. VO needs initialization too.

AlejandroSilvestri on 18 Oct 2018

👍3

Hi @ArtlyStyles, when you start to see the environment, the 1st frame cannot say meaningful things alone. You miss the scale, you miss the relativitya and you cannot estimate depth. The baseline is not enough to begin to create your map. You cannot evaluate the relative distance of features without establishing triangulation. There are different methods in literature. You can also check this paper.
You can imagine yourself trying to stand in the middle of a circle. You can't stand in the middle of it in one try. You can say "yes, this is the center of the circle" after doing small movements back and forth or right and left a few times.