Models: Training object detection model with multiple training sets

Created on 20 Dec 2017  ·  18Comments  ·  Source: tensorflow/models

I didn't find any description in the document shows I can assign multiple input path. Is there any method to train a model with two or more datasets without converting them into one big tfrecords file?

Most helpful comment

You can simply assign list of the file path by changing config file

from

train_input_reader: {
  tf_record_input_reader {
    input_path: "PATH_TO_BE_CONFIGURED/train.record"
  }
  label_map_path: "PATH_TO_BE_CONFIGURED/label_map.pbtxt"
}

to

train_input_reader: {
  tf_record_input_reader {
    input_path: ["PATH_TO_BE_CONFIGURED/train_a.record", 
                 "PATH_TO_BE_CONFIGURED/train_b.record"]
  }
  label_map_path: "PATH_TO_BE_CONFIGURED/label_map.pbtxt"
}

this change may only work when multiple tfrecord files use the same label_map.

All 18 comments

+1, I was planning on building a script for this, but didn't think to ask if it's built-in already.

+1, I want to train detection model with more than 1 tfrecord
_, string_tensor = parallel_reader.parallel_read( config.input_path, reader_class=tf.TFRecordReader, num_epochs=(input_reader_config.num_epochs if input_reader_config.num_epochs else None), num_readers=input_reader_config.num_readers, shuffle=input_reader_config.shuffle, dtypes=[tf.string, tf.string], capacity=input_reader_config.queue_capacity, min_after_dequeue=input_reader_config.min_after_dequeue)
this line of code seems to read only one tfrecord, maybe add a loop to read multiply tfrecord will help.

@FightForCS I know there is a way to read a series of tfrecords, you can take a look at the examples in the slim folder. It is possible to load a list of the file path by using slim.dataset.Dataset, but you may need to rewrite the script you are using.

You can simply assign list of the file path by changing config file

from

train_input_reader: {
  tf_record_input_reader {
    input_path: "PATH_TO_BE_CONFIGURED/train.record"
  }
  label_map_path: "PATH_TO_BE_CONFIGURED/label_map.pbtxt"
}

to

train_input_reader: {
  tf_record_input_reader {
    input_path: ["PATH_TO_BE_CONFIGURED/train_a.record", 
                 "PATH_TO_BE_CONFIGURED/train_b.record"]
  }
  label_map_path: "PATH_TO_BE_CONFIGURED/label_map.pbtxt"
}

this change may only work when multiple tfrecord files use the same label_map.

This issue is closed since I found answers in the code

The object detection API use parallel reader to import your dataset, here are the comments by the developer

Usage:
      data_sources = ['path_to/train*']
      key, value = parallel_read(data_sources, tf.CSVReader, num_readers=4)

  Args:
    data_sources: a list/tuple of files or the location of the data, i.e.
      /path/to/train@128, /path/to/train* or /tmp/.../train*

So basically you can define a list of input path as @byungjae89 mentioned above, or simply provide the input directory like

input_path: my_dataset/train/*

The reader will read the entire folder for you.

Hello @izzrak and @byungjae89 ,

Thank you for sharing the approach to train multiple tfrecords.

However,

Do you confirm that the model really trains with those tfrecords, not just the first one?

I follow @byungjae89 's approach to add three tfrecords in the config file,

and intentionally put the incorrect names for the 2nd and 3rd tfrecords. ( I use the correct name for the 1st tfrecord )

The whole training(200k iteration) completes without any problem.

But the error will pop up immediately if I put the incorrect name for the 1st tfrecord.

It seems that only the 1st tfrecord is used in training.

I am working on visualizing the training input image to confirm my suspicion.

Please let me know if you have the similar experience.

Thank you.

If using protoc 3.5, import multiple tfrecord files in following format:

train_input_reader: {
  tf_record_input_reader {
    input_path: "/path/to/tfrecord.1"
    input_path: "/path/to/tfrecord.2"
  }
  label_map_path: "/path/to/label_map.pbtxt"
}

@willSapgreen, "surprisingly", those input_path are not exact paths, they are PATTERNS. Having scrutinized the tf.gfile.Glob function at https://github.com/tensorflow/models/blob/master/research/object_detection/builders/dataset_builder.py#L61 revealed that you will get NotFound error only if you put something incorrect in the folder's part of the path. Something incorrect in the file's part of the path will be accepted and an empty value will be returned. I.e.:
incorrect_folder/correct_file -> NotFound
correct_folder/incorrect_file -> OK
Of course if you have none of right input_path then you will anyway end up with some other error down the road.
Howhever, if you have at least one right input_path, chances are some mistypes in file's parts of others will be silently skipped!
Be careful!

Hi @izzrak,

Could you specify where you found this documentation please?

Thanks,
Kevin

This issue is closed since I found answers in the code

The object detection API use parallel reader to import your dataset, here are the comments by the developer

Usage:
      data_sources = ['path_to/train*']
      key, value = parallel_read(data_sources, tf.CSVReader, num_readers=4)

  Args:
    data_sources: a list/tuple of files or the location of the data, i.e.
      /path/to/train@128, /path/to/train* or /tmp/.../train*

So basically you can define a list of input path as @byungjae89 mentioned above, or simply provide the input directory like

input_path: my_dataset/train/*

The reader will read the entire folder for you.

When filling out the tf_record_input_reader parameter, can you specify a directory and then fill in the file rules? Fill in one of the sub-files, it feels too time consuming

@CasiaFan @izzrak

If you want to read all the record files under a directory and these files are ended with the suffix record, the input configuration could be written like:

tf_record_input_reader {
    input_path: "/path/to/*.tfrecord"
  }

@gzchenjiajun

Now generate more tfrecord and config loading multiple tfrecord has been solved, but found that it does not help the memory (still OOM, I thought that more tfrecord can reduce the memory usage, increase the batch size parameters), I would like to ask How should I handle it?

@CasiaFan @kevin-apl @failure-to-thrive @willSapgreen

OMM is mainly caused by large input batch, rather than the number of
tfrecords. The tfrecords only provide the data source for training and
evaluating. If you don't want to reduce the batch size, add more GPUs cards
or try a smaller input size.

gzchenjiajun notifications@github.com 于2019年11月15日周五 上午11:54写道:

Now generate more tfrecord and config loading multiple tfrecord has been
solved, but found that it does not help the memory (still OOM, I thought
that more tfrecord can reduce the memory usage, increase the batch size
parameters), I would like to ask How should I handle it?

@CasiaFan https://github.com/CasiaFan @kevin-apl
https://github.com/kevin-apl @failure-to-thrive
https://github.com/failure-to-thrive @willSapgreen
https://github.com/willSapgreen


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/3031?email_source=notifications&email_token=ACQ6CWGC7JGFCDF5LBWZPZ3QTYMO5A5CNFSM4EJAHJC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEEG25I#issuecomment-554200437,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACQ6CWG3NVKFFCKXVDTTDMTQTYMO5ANCNFSM4EJAHJCQ
.

OMM is mainly caused by large input batch, rather than the number of tfrecords. The tfrecords only provide the data source for training and evaluating. If you don't want to reduce the batch size, add more GPUs cards or try a smaller input size. gzchenjiajun notifications@github.com 于2019年11月15日周五 上午11:54写道:

Now generate more tfrecord and config loading multiple tfrecord has been solved, but found that it does not help the memory (still OOM, I thought that more tfrecord can reduce the memory usage, increase the batch size parameters), I would like to ask How should I handle it? @CasiaFan https://github.com/CasiaFan @kevin-apl https://github.com/kevin-apl @failure-to-thrive https://github.com/failure-to-thrive @willSapgreen https://github.com/willSapgreen — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3031?email_source=notifications&email_token=ACQ6CWGC7JGFCDF5LBWZPZ3QTYMO5A5CNFSM4EJAHJC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEEG25I#issuecomment-554200437>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQ6CWG3NVKFFCKXVDTTDMTQTYMO5ANCNFSM4EJAHJCQ .

Meaning that when the hardware reaches the performance boundary, whether it is a tfrecord or multiple tfrecords, the batch size can no longer be expanded?
Or is there a way for multiple tfrecords to achieve a larger batch size?
For example, loading tfrecord in batches does not help memory? Is tensorflow/tensorflow officially supported? I have not found a lot of information.

@CasiaFan

Meaning that when the hardware reaches the performance boundary, whether it is a tfrecord or multiple tfrecords, the batch size can no longer be expanded?
Or is there a way for multiple tfrecords to achieve a larger batch size?
For example, loading tfrecord in batches does not help memory? Is tensorflow/tensorflow officially supported? I have not found a lot of information.

Yep, tf.data api would operate the input data stream from the tfrecord file. Even multiple tfrecords are provided, the input batch size would be set during reading based on hardware and model configuration, having nothing to do with the number of tfrecords. As for the reason to split the datasets into shards, I think this post may help you. @gzchenjiajun

Ok, that means splitting multiple tfrecords and improving batch_size doesn't help.
Then I have two more questions:

  1. In addition to directly upgrading hardware / similar to semi-precision reasoning, how can I improve batch_size?
  2. Split tfrecord does not help, then can you change the batch read tfrecord when reading? (Read, finish training, then discard, start next one), is this helpful for memory?

@CasiaFan

Try grouped convlolution. @gzchenjiajun

gzchenjiajun notifications@github.com 于2019年11月21日周四 上午11:02写道:

Ok, that means splitting multiple tfrecords and improving batch_size
doesn't help.
Then I have two more questions:

  1. In addition to directly upgrading hardware / similar to
    semi-precision reasoning, how can I improve batch_size?
  2. Split tfrecord does not help, then can you change the batch read
    tfrecord when reading? (Read, finish training, then discard, start next
    one), is this helpful for memory?

@CasiaFan https://github.com/CasiaFan


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/3031?email_source=notifications&email_token=ACQ6CWFMJ35VAP45DP4ESPLQUX24DA5CNFSM4EJAHJC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEYXCKA#issuecomment-556888360,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACQ6CWEYZKY7HIX5APTRHUDQUX24DANCNFSM4EJAHJCQ
.

Was this page helpful?
0 / 5 - 0 ratings