Models: struct2depth training data prep request for documentation

Created on 8 Feb 2019  路  44Comments  路  Source: tensorflow/models

Describe the problem

I did not find enough documentation on preparing the data for the training. I am trying to replicate the results for struct2depth on kitti or cityscapes dataset.
However, I do not exactly know how to generate the data in the correct format.
Mainly, I would like to know how to do the following:

  1. generate the train.txt and valid.txt files for the dataset (currently I am using the script in vid2depth subdir to do that)
  2. How to generate the segmentation results and what is the naming convention used? ( is it using Mask RCNN model? should it be -fseg.png?
  3. It is mentioned that "It is assumed that motion masks are already generated and stored as images". Can you explain how to do this? what is the naming convention and how to generate this?

Thank you for sharing results of your work.
This is a really impressive paper and your response is appreciated.
@aneliaangelova @VincentCa

System information

  • models/research/struct2depth:
  • Stock code:
  • Centos 7:
  • Tensorflow from binary (pip):
  • v1.11.0-0-gc19e29306c 1.11.0:
  • CUDA 9/CUDNN 7
  • V100 32GB:
    python gen_data_kitti.py
research

Most helpful comment

Hi, you can鈥檛 pass the masks in that format - note that your current input is full RGB with half-transparent segmentation overlays and thus can鈥檛 be parsed correctly. The input to the script needs to be a simplified mask, where background is entirely black (0, 0, 0) and every different object in the image has a different shade of grey that is consistent across all channels - e.g. car1 has (255, 255, 255), car2 (254, 254, 254) and pedestrian1 (253, 253, 253).

Please also refer to other github issues covering this. Hope this helps!

All 44 comments

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
What is the top-level directory of the model you are using
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

Have I written custom code: No
OS Platform and Distribution: Centos 7
TensorFlow installed from: Binary
TensorFlow version: * v1.11.0-0-gc19e29306c 1.11.0
Bazel version: N/A
CUDA/cuDNN version: 9.0/7.0
GPU model and memory: Volta V100 32GB
Exact command to reproduce: python gen_data_kitti.py or python gen_data_city.py

Hi,

  1. Please refer to method compile_file_list in reader.py to understand the structure of the input text files listing individual frame triplets. This function is equivalent to vid2depth's compile_file_list in reader.py. Basically you have a folder path and then a filename (without extension, this is passed as a flag) separated by a single space per line. For every input image you can have accompanying supplemental files, like this:

Input file: 0000.png
Related aligned segmentation map: 0000-fseg.png (only needed if you train a motion model)
Related camera intrinsics: 0000_cam.txt

The related line in your input text file would look like "some/sub/folder 0000" and your file_extension flag would be "png" in order to read it properly.

  1. We used a pre-trained Mask-RCNN (on a different dataset). Note that the output of running this frame-by-frame gives you instance-level labels for each frame, but they are not temporally consistent, i.e. the same object will almost never have the same instance ID assigned across frames. Use alignment.py to align them, or (preferably) work on a nicer method to make the instance ID's temporally consistent. We call the Mask-RCNN raw output X-seg.png, while we call the aligned ones X-fseg.png. You need to use the latter for the model, it definitely expects aligned labels.

  2. Refer also to 1) and 2); you need to run an instance segmentation model to obtain masks and then perform the alignment. Make sure to save the masks in a lossless image format to avoid compression artifacts compromising the label IDs.

Hope this helps!
Vincent

Hi,

@VincentCa

  1. We used a pre-trained Mask-RCNN (on a different dataset)
    Which dataset? Which Mask-RCNN (backbone)?

Is it possible somehow to reproduce quantitative results from the paper using the released pretrained models? As I understand, it is necessary to use instance segmentation masks in this case.

Thanks!

@VincentCa Thank you for your response. It was very helpful.

it seems like if you set the "handle_motion" flag to true or false, the script will still look for the segmentation mask. So I will put in a workaround to avoid that for now.

For generating the motion masks, is there requirements on the color codes? do we need to encode colors or can we just have rgb codes for every semantic label?

@ramanishka For inference, instance segmentation masks are not needed if you are only interested in depth and/or odometry prediction. Only if you want to specifically look at object motion prediction they would be needed to feed the input properly to the object motion estimator network.

@saeed68gm Yes, feel free to just work around that issue. There is no specific requirement on the color codes you use. Just make sure you save the masks in a lossless format (PNG), as I described earlier. Simply assign the same value across all channels. 0 always stands for background, so you can use 1-255 for different object instances across all channels.
As I also mentioned, alignment is crucial. For each triplet you provide to the network, it expects to find every instance ID in every subframe also in every neighboring one, and it further expects these to actually corresond to the same object. If within the same triplet an instance ID appears in mask t but not mask t+1, this would be a problem. However, given that segmentation models are not perfect, don't worry if there are some triplets where masks are missing for an object throughout.

@VincentCa Can you share standard training parameters that can be used to replicate result from paper by training model from scratch? Thanks for your help. I have been able to successfully run the training with objection motion mask, and now interested in replicating paper result and then try my own idea.

@amanrajdce Please refer to the paper to find the parameter settings we used. If you think some info is missing just let me know.
@ferdyandannes You can simply disable using the joint encoder. If you pass --joint_encoder false as a flag, this error shouldn't occur.

@VincentCa

  1. Can you please provide the evaluation script that is used in your repo?
  2. For online refinement process, how do you generate the triplets, can you share the script for the same?
  3. My understanding is the checkpoint model provided with the repo on the project website, is the final model after online-refinement learning right? could you please confirm this.

Thank you very much.

The evaluation script is from the SFMLearner work of T. Zhou et al
https://github.com/tinghuiz/SfMLearner

@amanrajdce Please refer to the paper to find the parameter settings we used. If you think some info is missing just let me know.

I am interested in the number of iterations for model training, I can't find this information in paper.

@VincentCa
Can you give some info about:
triplet_list_file="$data_dir/test_files_eigen_triplets.txt" triplet_list_file_remains="$data_dir/test_files_eigen_triplets_remains.txt"
in the online-refinement stage? It looks like you use test_files_eigen.txt and generate triplets from it. If my understanding correct? Also, if you have any script for the same can you share if possible?

@amanrajdce Have u managed to train this?

@VincentCa
How can MaskRCNN segment Dynamic object? It can only produce instance mask. Maybe the instance is not a dynamic object.Have you considered this issue?

Yes, definitely. We use all masks (of moving and not moving objects) and estimate (meaning learn to estimate) each individual object's motion. If the object is not moving the estimated motion will be 0, which is exactly as desired.
During inference no object masks are needed - as you can directly run the depth network only.

@amanrajdce Have u managed to train this?

yes, i was able to train this model as well as predict on set of images. However, I wasn't able to run the online refinement stage, due to some lacking information in the paper on how to do it. I have asked the same question from authors on this thread but I haven't got any response yet.

@VincentCa

  1. Can you please provide the evaluation script that is used in your repo?
  2. For online refinement process, how do you generate the triplets, can you share the script for the same?
  3. My understanding is the checkpoint model provided with the repo on the project website, is the final model after online-refinement learning right? could you please confirm this.

Thank you very much.

@aneliaangelova could you please provide some response for 2. and 3. I understand it is not possible to share the code/script but if you can at least talk about the contents of these triplets that is supposed to be there, that will be great.

Hi,

  1. You can refer to the evaluation script of SfMLearner (https://github.com/tinghuiz/SfMLearner) for inspiration on how to evaluate on KITTI. Make sure to replicate normalization steps, crops etc. as described in the paper. The far majority of related work uses the exact same evaluation procedure and parameters (e.g. cut-off at 50/80m, mean alignment), so their implementation might be helpful, too.
  2. See reader.py for reference; triplets can be stored in ordinary image files by stitching seq_length many frames together horizontally. They will be split up and stacked in the channel axis, you can find this in unpack_images(). If you are trying your own dataset, it can certainly be helpful to try different degrees of subsampling (temporally). Ideally, it relates to the same subsampling you applied during training, of course. If, however, the movements are slower/faster during inference, it might be a good idea to adjust, or to implement an adaptive frame rate - because if in a triplet no movement is present at all or it's very subtle, (almost) no training signal will be provided to the network.
  3. No, it is important to note that online-refinement is applied during inference only. We do not ever save checkpoints after running refinement, as the goal is only to allow online-adaption to produce higher-quality inference results, and not to improve the network weights persistently.

@saeed68gm I am also trying to create a workaround for the "handle_motion" flag not working. Were you able to find a good workaround you could share?

@alexbarnett12 It was not a very clean workaround. I just went into the code and commented everywhere it was using masks.

@saeed68gm @alexbarnett12 you can look at fork of this repository. I have done necessary changes here https://github.com/amanrajdce/struct2depth

@amanrajdce I can't find the page you're refering to; the link is not working. I tried to find the repo in your git but it seems that it doesn't exist anymore!

@KawtarM I believe the link has moved to here: https://github.com/amanrajdce/struct2depth_pub

sorry! please refer to the line that @tlalexander posted above. If you find something missing let me know.

Thanks guys!

sorry! please refer to the line that @tlalexander posted above. If you find something missing let me know.

it seems like if you set the "handle_motion" flag to true or false, the script will still look for the segmentation mask, Have you solved this issue? It seems that your code still have this issue

No we haven't but It is easy to fix, in that case a mask which is all 0's can be reused for all images.

@VincentCa @amanrajdce
Hi,
I'm doing the data preprocess,and I run the script "alignment.py", the input (xx-seg.png)and output(xx-fseg.png) are as follows,
https://github.com/nowburn/Show
is the result right? maybe the script is wrong? Can you show me the correct output sample?
Thanks!

Hi, you can鈥檛 pass the masks in that format - note that your current input is full RGB with half-transparent segmentation overlays and thus can鈥檛 be parsed correctly. The input to the script needs to be a simplified mask, where background is entirely black (0, 0, 0) and every different object in the image has a different shade of grey that is consistent across all channels - e.g. car1 has (255, 255, 255), car2 (254, 254, 254) and pedestrian1 (253, 253, 253).

Please also refer to other github issues covering this. Hope this helps!

it seems like if you set the "handle_motion" flag to true or false, the script will still look for the segmentation mask, Have you solved this issue? It seems that your code still have this issue

No we haven't but It is easy to fix, in that case a mask which is all 0's can be reused for all images.

I'm not sure are the segmentation masks really used in the training process (when _handle_motion_ is _False_), or they are just loaded in the _reader.py_ and not used at all (in the _reader.py_, the parameter _handle_motion_ is not used, unlike in the _train.py_)? Because, if they are not used, there is a simple solution (not the best, but the simplest).

Here is the part of the code, where the process of loading the masks starts. I've managed to get around this problem by simply renaming _-fseg._ with _._, so instead masks, the images will be loaded again. The training process has started successfully, but I'm not sure is it actually a good solution.

@amanrajdce Hi, you have mentioned that you have been able to successfully run the training with objection motion mask. So have you been able to reproduce the similar result of what the authors report in the paper?
For me, I am able to run the the training with objection motion mask but not able to reproduce similar result (my abs rel is 0.1587, quite far from the authors' 0.1412).

@amanrajdce Hi, you have mentioned that you have been able to successfully run the training with objection motion mask. So have you been able to reproduce the similar result of what the authors report in the paper?
For me, I am able to run the the training with objection motion mask but not able to reproduce similar result (my abs rel is 0.1587, quite far from the authors' 0.1412).

you can find the number here on page-5, https://github.com/amanrajdce/CSE-291D-Final-Project/blob/master/CSE_291D_Final_Project.pdf

@VincentCa
I'm sorry to disturb you again.But the problem really confuses me a lot.
(1) My now problem is that 'Tensor NaN' happens when training. I refered the issureissue/6392, but they all don't work for me.

(2) My processed seg images are as follows: (every motion object is masked like(1,1,1),(2,2,2) just like you described before).
Specifically, I use mask-rcnn generates xx-seg.png , and then use alignment.py to align them(for every 3 xx-seg.png images) and the final xx-fseg.png is like here
fseg
mask-rcnn, code segment for saving masked images

init_color = (1, 1, 1)

    for i in range(N):

        color = init_color
        init_color = [x + 1 for x in init_color]

(3) I can run the training by ignoring the 'motion constraint loss' in 'model.py-336', but the trained model cant't predict the motion object depth
-model.py-336

#losses = tf.map_fn(
                                #     get_losses, object_masks, dtype=tf.float32)
                                # self.inf_loss += tf.reduce_mean(losses)

So How can I solve it? is the xx-fseg.png right?
Thank your again!

Hi, I have the same issue as the comment above. I tried all suggestions from issue/6392 but still I get the "LossTensor is inf or nan : Tensor had NaN values" error. Can someone help please. Thanks.

I tried now many things. Like input black images --> Which worked! Next step was to generate an image with square objects --> which also worked. If I change an id of just one square of the sequence or if I make the height of a square to 1 pixel --> it failed. So what are now the requirements to run the labels

  • No object with a pixel height of 1
  • all object ID`s need to be exist in each image in a sequence?
  • what else?

Hope someone can help.

Hi, you can鈥檛 pass the masks in that format - note that your current input is full RGB with half-transparent segmentation overlays and thus can鈥檛 be parsed correctly. The input to the script needs to be a simplified mask, where background is entirely black (0, 0, 0) and every different object in the image has a different shade of grey that is consistent across all channels - e.g. car1 has (255, 255, 255), car2 (254, 254, 254) and pedestrian1 (253, 253, 253).

Please also refer to other github issues covering this. Hope this helps!

I have generated the segmentation masks as below.Is this the correct format for the input ?Could someone please guide me on this?

Screenshot from 2020-03-10 17-51-16

Looks ok, it's hard to say from the image if the ids are reasonable.

You can run and follow within the code if they are ok.

Thanks for the quick response! I am supposed to stack three of these images in sequence as the input right?

Yes that's right - the masks should follow the same format as the raw images.

Hello ,
Great work! @VincentCa ,@aneliaangelova and team.
I am trying to inference object motion and depth using following command :
python3 inference.py --depth --egomotion true --use_masks=true --input_dir ./intest2withseg/ --output_dir ./out/ --batch_size=1 --model_ckpt ./model-35620
(intest2withseg contains the images as follows)
79539859-c5d0aa80-80a4-11ea-8743-a35c5c1416ca
Corresponding mask-
79539871-c9fcc800-80a4-11ea-9046-9192d548c0d4
I get this error while running the above command

ValueError: operands could not be broadcast together with shapes (1,128,416,9) (3,128,416,27)

I tried with just giving a color image (not stacked) instead of the first image shown and the corresponding mask ,same as the one shown above,but I still get the same error.
Can someone let me know what is going wrong?
Also is there any other thing to change other than adding the motion masks, for getting the motion inference, when compared to inference for depth?

Any suggestion is greatly appreciated!
Thanks

Hi,
Are you still having trouble with this? The error basically says that you are inputting an image tensor of the wrong shapes.

Thanks for the reply!
Yes there was a problem with the input and now it is solved.
Thank you

Hi,

  1. You can refer to the evaluation script of SfMLearner (https://github.com/tinghuiz/SfMLearner) for inspiration on how to evaluate on KITTI. Make sure to replicate normalization steps, crops etc. as described in the paper. The far majority of related work uses the exact same evaluation procedure and parameters (e.g. cut-off at 50/80m, mean alignment), so their implementation might be helpful, too.
  2. See reader.py for reference; triplets can be stored in ordinary image files by stitching seq_length many frames together horizontally. They will be split up and stacked in the channel axis, you can find this in unpack_images(). If you are trying your own dataset, it can certainly be helpful to try different degrees of subsampling (temporally). Ideally, it relates to the same subsampling you applied during training, of course. If, however, the movements are slower/faster during inference, it might be a good idea to adjust, or to implement an adaptive frame rate - because if in a triplet no movement is present at all or it's very subtle, (almost) no training signal will be provided to the network.
  3. No, it is important to note that online-refinement is applied during inference only. We do not ever save checkpoints after running refinement, as the goal is only to allow online-adaption to produce higher-quality inference results, and not to improve the network weights persistently.

Thank you for your excellent work @VincentCa @aneliaangelova, I wonder which dataset you use to train the segmentation network? I plan to conduct depth estimation using intance segmentation results, but I'm confused by which trained Maskrcnn model to choose for this task. Could you please offer the link of the pretrained Mask-RCNN model you use in Struct2Depth? Thanks a lot!

The dataset Mask-RCNN was previously trained on was MS-Coco, so it can be any model that's working reasonable for common objects.
We have a followup work which also does not even need good instance segmentations, but a crude box over them:
http://openaccess.thecvf.com/content_ICCV_2019/papers/Gordon_Depth_From_Videos_in_the_Wild_Unsupervised_Monocular_Depth_Learning_ICCV_2019_paper.pdf
Code is also open sourced here:
https://github.com/google-research/google-research/tree/master/depth_from_video_in_the_wild

Was this page helpful?
0 / 5 - 0 ratings