Bert: Need clarification for pre-training

Created on 1 Nov 2018 · 5Comments · Source: google-research/bert

In the README.md, it says for the pre-training:

It is important that these be actual sentences 
for the "next sentence prediction" task

and the example sample_text.txt does have each line ends with either . or ;.

Whereas in the BERT paper, it says

... we sample two spans of text from the corpus, which we refer to as "sentences" 
even though they are typically much longer than single sentences 
(but can be shorter also)

So it becomes unclear whether this implementation does expect actual sentences per line or just documents be broken down into multiple lines arbitrarily.

Source

winston-zillow

Most helpful comment

The thing that the paper refers to is happening inside of create_preprocessing_data.py. See here

So the input text file should be actual sentences, although feel free to add some noise if you want to make things more robust for fine-tuning (e.g., if your sentence segmenter always splits on ., this may create a weird bias. So arbitrarily truncating or concatenating 5% of the training data may make the fine-tuning more robust to non-sententential data. We don't have any hard numbers on this.)

For our sentence segmenter I just used some Google-internal library I found, but anything off the shelf like (SpaCy)[https://spacy.io/usage/spacy-101] should work.

jacobdevlin-google on 1 Nov 2018

👍3

All 5 comments

The thing that the paper refers to is happening inside of create_preprocessing_data.py. See here

For our sentence segmenter I just used some Google-internal library I found, but anything off the shelf like (SpaCy)[https://spacy.io/usage/spacy-101] should work.

jacobdevlin-google on 1 Nov 2018

👍3

I see, especially the explanation in create_preprocessing_data.py. It would be nice to mention your comments in the README, or even update the paper.

winston-zillow on 1 Nov 2018

👍1

I added a paragraph in the README about this, thanks.

jacobdevlin-google on 1 Nov 2018

Hello @jacobdevlin-google
Really appreciate your work:) Would you mind to share us the pre-training set size in terms of instances? Even though you mentioned the corpus size is about 3.3 billion words, but when preprocess the data before feeding to the model, I found you have a setting of replicate factor of 10 by default. And based on my current experiment, I found the preprocess will enlarge the training data by about 20x of original. Not sure if I made any mistake, would you share us the actual training set size and confirm if my finds is correct?
Thanks!

xgk on 8 Nov 2018

@xgk are you using Chinese or some other languages where one token correspond to one char and not one word ? That would explain this size augmentation.