Hi,I couldn't understand why need to change words to "###*" by apply tokenization.
E.g., john johanson ' s , → john johan ##son ' s ,
This is the WordPiece tokenization. "Rare" words are split up into pieces. We use the ## to delimit tokens that have been split off. So john is common enough to not be split, and johanson is split in two pieces, johan and ##son.
THX
Hi, why do we need to split rare words in to sub words. also please explain how WordPiece tokenization is useful in the context of BERT.
Hi, why do we need to split rare words in to sub words. also please explain how WordPiece tokenization is useful in the context of BERT.
The idea is you can reduce the size of your vocabulary. For example, run, running, runner are all very similar words. Without wordpiece you would need to store and learn the meaning of all three independently.
With wordpiece, each of the three words would be split into 'run' and the related '##SUFFIX' (if any suffix at all). So the context of all three words would intermingle, which makes sense, since they are all similar words. The rest of the meaning would be encoded in the suffix, which would be learning from other words with similar suffixes.
The benefit is that you will have a reduced vocab size and makes for better training.
Most helpful comment
This is the WordPiece tokenization. "Rare" words are split up into pieces. We use the
##to delimit tokens that have been split off. Sojohnis common enough to not be split, andjohansonis split in two pieces,johanand##son.