Transformers: training a new BERT tokenizer model

Created on 18 Dec 2019  ·  3Comments  ·  Source: huggingface/transformers

❓ Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?

Most helpful comment

Checkout the tokenizers repo.

There's an example of how to train a WordPiece tokenizer: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/train_bert_wordpiece.py

All 3 comments

Follow sentencepiece github or Bert tensorflow GitHub. You will have some
feedback

On Wed, Dec 18, 2019 at 07:52 Younggyun Hahm notifications@github.com
wrote:

❓ Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/2210?email_source=notifications&email_token=AIEAE4BMLLHVIADDR5PGZ63QZFQ27A5CNFSM4J4DE7PKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IBGFX2A,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AIEAE4CUY2ESKVEH4IPDL63QZFQ27ANCNFSM4J4DE7PA
.

If you want to see some examples of custom implementation of tokenizers into Transformers' library, you can see how they have implemented Japanese Tokenizer.

In general, you can read more information about adding a new model into Transformers here.

Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?

Checkout the tokenizers repo.

There's an example of how to train a WordPiece tokenizer: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/train_bert_wordpiece.py

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fyubang picture fyubang  ·  3Comments

yspaik picture yspaik  ·  3Comments

0x01h picture 0x01h  ·  3Comments

lemonhu picture lemonhu  ·  3Comments

zhezhaoa picture zhezhaoa  ·  3Comments