I planned to pre-train bert on my own dataset( 130M finicial news , 3.3GB, Chinese) for event extraction and sentiment detection. I had applied sentence piece model from google to build my own vocabluary and tokenization the whole dataset. However, it already took two days but still couldn't generate related tfrecords.

no ilike
Sent from my Redmi 4A
On Colanim notifications@github.com, Nov 9, 2018 11:19 AM wrote:
Are you running it on GPU ?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com/google-research/bert/issues/89#issuecomment-437245242, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWBz_LT0ZWFiQPQb-1ov6EOKhF5TpEPgks5utQJNgaJpZM4YV3Sf.
just use create_pretraining_data.py with sentence piece for tokenization on CPU to create tfrecords. It worded for small dataset...
I mention this in the README, but for large files I would recommend splitting the file into a number of smaller chunks and then calling create_pretraining_data for each chunk, because the memory use can be very large for large files. So you could generate tf_examples.tfrecord_000, tf_examples.tf_record_001, ... and then you can pass in a glob like tf_examples.tfrecord* to run_pretraining.py.
Most helpful comment
I mention this in the README, but for large files I would recommend splitting the file into a number of smaller chunks and then calling
create_pretraining_datafor each chunk, because the memory use can be very large for large files. So you could generatetf_examples.tfrecord_000,tf_examples.tf_record_001, ... and then you can pass in a glob liketf_examples.tfrecord*torun_pretraining.py.