Bert: pertained Chinese language model request, please.

Created on 26 Oct 2018  ·  13Comments  ·  Source: google-research/bert

It would be very nice you can release the some other language models, like German, Chinese etc. Then we can experiment them on other language domains.

Thanks a lot. :)
John.

Most helpful comment

I am in the process of training models on the top 60 languages which have the largest Wikipedias. This contains both simplified and traditional Chinese, Korean, Japanese, German, Spanish, and many others. I will likely release a single BERT-Base and BERT-Large models trained on all of these languages (including English). Hopefully this will be released within the next few weeks, but I have to actually test it and make sure it works reasonably well.

To handle the spacing issue in Chinese, we simply character-tokenize any characters in the CJK Unicode set, and perform our same recipe for everything else. There is one big shared 100k WordPiece vocabulary. This recipe seems to work well for almost all languages (unfortunately, the one exception is Thai, since there are few spaces but too many characters-per-word to use character tokenization).

All 13 comments

We have the similar request, because we didn't have enough computing power to train a model. If u can provide the Chinese pretrained model that will be very helpful. Thanks for your time and help.

Best Wishes
Rico

Also please Japanese version.
It would break the barrier.

Thanks

Both simplified and traditional Chinese, please.

Simplified Chinese pretrained model +1 ! It's very kind of you to release that.

Please don't forget Korean. 🐙

Please don't forget Korean. +1

China's developer need Chinese pretrained model badly.
Thanks a lot!

very excepting

I am in the process of training models on the top 60 languages which have the largest Wikipedias. This contains both simplified and traditional Chinese, Korean, Japanese, German, Spanish, and many others. I will likely release a single BERT-Base and BERT-Large models trained on all of these languages (including English). Hopefully this will be released within the next few weeks, but I have to actually test it and make sure it works reasonably well.

To handle the spacing issue in Chinese, we simply character-tokenize any characters in the CJK Unicode set, and perform our same recipe for everything else. There is one big shared 100k WordPiece vocabulary. This recipe seems to work well for almost all languages (unfortunately, the one exception is Thai, since there are few spaces but too many characters-per-word to use character tokenization).

@jacobdevlin-google Could you also include a model for Macedonian? It is ranked as number 63 :) Would be awesome!

If I re-generate the data (which looks likely at this point to better weight the low-resource languages) then I'll try to include more from the 60-100 range.

Thx for all your hard-work!

Will the large Chinese model be released any time soon? Thanks for the hard work by the way.

Was this page helpful?
0 / 5 - 0 ratings