Bert: How to handle create_pretraining_data.py for extremely large input data?

Created on 14 Nov 2018 · 1Comment · Source: google-research/bert

I ran create_pretraining_data.py for large data which created from the entire text of enwiki dump, in a docker container which limited 70G memory.

However, OOM seemed to happen while running it, and its process was killed by OOM killer of 70G limit.
(But there was no error message, just was killed.)

How to handle it?

P.S.
I checked that *** Writing to output files *** message was outputted, so maybe the problem is on write_instance_to_example_files function.

Source

sugiyamath

Most helpful comment

You should shard the input data (text.txt_00000, text.txt_00001), run the script for each shard (tf_examples.tfrecord_00000, tf_examples.tf_record_00001), and then pass in a file glob (e.g., tf_examples.tfrecord*) to run_pretraining.py.