Transformers: training a new BERT tokenizer model

Created on 18 Dec 2019  ·  3Comments  ·  Source: huggingface/transformers

❓ Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?

Most helpful comment

Checkout the tokenizers repo.

There's an example of how to train a WordPiece tokenizer: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/train_bert_wordpiece.py

All 3 comments

Follow sentencepiece github or Bert tensorflow GitHub. You will have some
feedback

On Wed, Dec 18, 2019 at 07:52 Younggyun Hahm notifications@github.com
wrote:

❓ Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/2210?email_source=notifications&email_token=AIEAE4BMLLHVIADDR5PGZ63QZFQ27A5CNFSM4J4DE7PKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IBGFX2A,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AIEAE4CUY2ESKVEH4IPDL63QZFQ27ANCNFSM4J4DE7PA
.

If you want to see some examples of custom implementation of tokenizers into Transformers' library, you can see how they have implemented Japanese Tokenizer.

In general, you can read more information about adding a new model into Transformers here.

Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?

Checkout the tokenizers repo.

There's an example of how to train a WordPiece tokenizer: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/train_bert_wordpiece.py

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ereday picture ereday  ·  3Comments

zhezhaoa picture zhezhaoa  ·  3Comments

siddsach picture siddsach  ·  3Comments

adigoryl picture adigoryl  ·  3Comments

alphanlp picture alphanlp  ·  3Comments