Bert: pertained Chinese language model request, please.

Created on 26 Oct 2018 · 13Comments · Source: google-research/bert

It would be very nice you can release the some other language models, like German, Chinese etc. Then we can experiment them on other language domains.

Thanks a lot. :)
John.

Source

Jorigorn

👍30

Most helpful comment

I am in the process of training models on the top 60 languages which have the largest Wikipedias. This contains both simplified and traditional Chinese, Korean, Japanese, German, Spanish, and many others. I will likely release a single BERT-Base and BERT-Large models trained on all of these languages (including English). Hopefully this will be released within the next few weeks, but I have to actually test it and make sure it works reasonably well.

To handle the spacing issue in Chinese, we simply character-tokenize any characters in the CJK Unicode set, and perform our same recipe for everything else. There is one big shared 100k WordPiece vocabulary. This recipe seems to work well for almost all languages (unfortunately, the one exception is Thai, since there are few spaces but too many characters-per-word to use character tokenization).

jacobdevlin-google on 31 Oct 2018

👍46 🎉18 ❤17 😄4 🚀2 👀1

All 13 comments

We have the similar request, because we didn't have enough computing power to train a model. If u can provide the Chinese pretrained model that will be very helpful. Thanks for your time and help.

Best Wishes
Rico

ghost on 29 Oct 2018

👍16

Also please Japanese version.
It would break the barrier.

Thanks

kuni-kuni on 29 Oct 2018

👍5

Both simplified and traditional Chinese, please.

hauturier on 30 Oct 2018

👍12

Simplified Chinese pretrained model +1 ！ It's very kind of you to release that.

yaleimeng on 30 Oct 2018

Please don't forget Korean. 🐙

nakosung on 31 Oct 2018

👍14

Please don't forget Korean. +1

ehrudxo on 31 Oct 2018

China's developer need Chinese pretrained model badly.
Thanks a lot!

KevinChen1994 on 31 Oct 2018

very excepting

chendengshuai on 31 Oct 2018

jacobdevlin-google on 31 Oct 2018

👍46 🎉18 ❤17 😄4 🚀2 👀1

@jacobdevlin-google Could you also include a model for Macedonian? It is ranked as number 63 :) Would be awesome!

stefan-it on 31 Oct 2018

If I re-generate the data (which looks likely at this point to better weight the low-resource languages) then I'll try to include more from the 60-100 range.

jacobdevlin-google on 31 Oct 2018

👍3

Thx for all your hard-work!