Bert: It took a several days to create pretraining data and hasn't finished.

Created on 9 Nov 2018 · 3Comments · Source: google-research/bert

I planned to pre-train bert on my own dataset( 130M finicial news , 3.3GB, Chinese) for event extraction and sentiment detection. I had applied sentence piece model from google to build my own vocabluary and tokenization the whole dataset. However, it already took two days but still couldn't generate related tfrecords.

Source

yyht

Most helpful comment

I mention this in the README, but for large files I would recommend splitting the file into a number of smaller chunks and then calling create_pretraining_data for each chunk, because the memory use can be very large for large files. So you could generate tf_examples.tfrecord_000, tf_examples.tf_record_001, ... and then you can pass in a glob like tf_examples.tfrecord* to run_pretraining.py.

jacobdevlin-google on 9 Nov 2018

👍7

All 3 comments

no ilike

Sent from my Redmi 4A
On Colanim notifications@github.com, Nov 9, 2018 11:19 AM wrote:

Are you running it on GPU ?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com/google-research/bert/issues/89#issuecomment-437245242, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWBz_LT0ZWFiQPQb-1ov6EOKhF5TpEPgks5utQJNgaJpZM4YV3Sf.

IntelOSt on 9 Nov 2018

just use create_pretraining_data.py with sentence piece for tokenization on CPU to create tfrecords. It worded for small dataset...

yyht on 9 Nov 2018

I mention this in the README, but for large files I would recommend splitting the file into a number of smaller chunks and then calling create_pretraining_data for each chunk, because the memory use can be very large for large files. So you could generate tf_examples.tfrecord_000, tf_examples.tf_record_001, ... and then you can pass in a glob like tf_examples.tfrecord* to run_pretraining.py.

jacobdevlin-google on 9 Nov 2018

👍7

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Why hidden size must be a multiple of the number of attention head

HaodaY · 4Comments

What are the requirements of the language in order to included in the BERT?

sharavsambuu · 3Comments

The fine_tune codes for NER CoNLL2003 with bert_base(cased) model

HAWLYQ · 3Comments

run run_classifier.py on chinese data, Failed to find any matching files for /path/chinese_L-12_H-768_A-12/bert_model.ckpt

qiugen · 4Comments

train_batch_size in run_classifier.py

awasthiabhijeet · 3Comments