Bert: It took a several days to create pretraining data and hasn't finished.

Created on 9 Nov 2018  Â·  3Comments  Â·  Source: google-research/bert

I planned to pre-train bert on my own dataset( 130M finicial news , 3.3GB, Chinese) for event extraction and sentiment detection. I had applied sentence piece model from google to build my own vocabluary and tokenization the whole dataset. However, it already took two days but still couldn't generate related tfrecords.

image

Most helpful comment

I mention this in the README, but for large files I would recommend splitting the file into a number of smaller chunks and then calling create_pretraining_data for each chunk, because the memory use can be very large for large files. So you could generate tf_examples.tfrecord_000, tf_examples.tf_record_001, ... and then you can pass in a glob like tf_examples.tfrecord* to run_pretraining.py.

All 3 comments

no ilike

Sent from my Redmi 4A
On Colanim notifications@github.com, Nov 9, 2018 11:19 AM wrote:

Are you running it on GPU ?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com/google-research/bert/issues/89#issuecomment-437245242, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWBz_LT0ZWFiQPQb-1ov6EOKhF5TpEPgks5utQJNgaJpZM4YV3Sf.

just use create_pretraining_data.py with sentence piece for tokenization on CPU to create tfrecords. It worded for small dataset...

I mention this in the README, but for large files I would recommend splitting the file into a number of smaller chunks and then calling create_pretraining_data for each chunk, because the memory use can be very large for large files. So you could generate tf_examples.tfrecord_000, tf_examples.tf_record_001, ... and then you can pass in a glob like tf_examples.tfrecord* to run_pretraining.py.

Was this page helpful?
0 / 5 - 0 ratings