Bert: What are the requirements of the language in order to included in the BERT?

Created on 5 Nov 2018  路  3Comments  路  Source: google-research/bert

I saw there are 102 languages supported in the BERT.
Unfortunately my native language (Mongolian language) is not supported.

Just wondering, is it because availability or size of the certain language corpus?
If so how can we help?

Most helpful comment

I generated a new version of the multilingual model to fix normalization and include Thai, and I also included Mongolian in the new version. See the README for a pointer.

All 3 comments

In this page, the authors mention this:

The languages chosen were the top 100 languages with the largest Wikipedias.

Yes, unfortunately Mongolian is ranked 114 so it was not included.

We are probably not going to release any more more languages or datasets. You may be able to use the existing multilingual model to train a Mongolian system relatively inexpensively , but we haven't tried adding a new language after the fact.

To do this, you can either use the existing WordPiece vocabulary, which will not have many Mongolian but does support Cyrillic WordPieces from Russian/Ukranian/etc., or you can use a new embedding table but use the same Transformer. We haven't tried either one but it's an interesting thing to try.

I generated a new version of the multilingual model to fix normalization and include Thai, and I also included Mongolian in the new version. See the README for a pointer.

Was this page helpful?
0 / 5 - 0 ratings