Transformers: training a new BERT tokenizer model

Created on 18 Dec 2019 · 3Comments · Source: huggingface/transformers

❓ Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?

Source

hahmyg

Most helpful comment

Checkout the tokenizers repo.

There's an example of how to train a WordPiece tokenizer: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/train_bert_wordpiece.py

julien-c on 13 Jan 2020

❤3 👍2 🚀1 🎉1 😄1

All 3 comments

Follow sentencepiece github or Bert tensorflow GitHub. You will have some
feedback

On Wed, Dec 18, 2019 at 07:52 Younggyun Hahm notifications@github.com
wrote:

❓ Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/2210?email_source=notifications&email_token=AIEAE4BMLLHVIADDR5PGZ63QZFQ27A5CNFSM4J4DE7PKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IBGFX2A,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AIEAE4CUY2ESKVEH4IPDL63QZFQ27ANCNFSM4J4DE7PA
.

pohanchi on 18 Dec 2019

If you want to see some examples of custom implementation of tokenizers into Transformers' library, you can see how they have implemented Japanese Tokenizer.

In general, you can read more information about adding a new model into Transformers here.