Bert: why need to change words to "###*"by apply tokenization?

Created on 5 Nov 2018 · 4Comments · Source: google-research/bert

Hi,I couldn't understand why need to change words to "###*" by apply tokenization.
E.g., john johanson ' s , → john johan ##son ' s ,

Source

waallf

Most helpful comment

This is the WordPiece tokenization. "Rare" words are split up into pieces. We use the ## to delimit tokens that have been split off. So john is common enough to not be split, and johanson is split in two pieces, johan and ##son.

jacobdevlin-google on 5 Nov 2018

👍12

All 4 comments

jacobdevlin-google on 5 Nov 2018

👍12

THX

guotong1988 on 6 Nov 2018

Hi, why do we need to split rare words in to sub words. also please explain how WordPiece tokenization is useful in the context of BERT.

telukuntla on 20 Mar 2019

Hi, why do we need to split rare words in to sub words. also please explain how WordPiece tokenization is useful in the context of BERT.

The idea is you can reduce the size of your vocabulary. For example, run, running, runner are all very similar words. Without wordpiece you would need to store and learn the meaning of all three independently.

With wordpiece, each of the three words would be split into 'run' and the related '##SUFFIX' (if any suffix at all). So the context of all three words would intermingle, which makes sense, since they are all similar words. The rest of the meaning would be encoded in the suffix, which would be learning from other words with similar suffixes.

The benefit is that you will have a reduced vocab size and makes for better training.