Flair: Can I do NER with Chinese corpus by flair?

Created on 6 May 2019  路  6Comments  路  Source: flairNLP/flair

Hi:
The flair looks great, and I would like to do NER tasks with Chinese corpus. My questions are:

  1. Is it possible to do NER with Chinese text currently?
  2. If hopefully so, could you provide some brief guidance about this? (or some links about this)

I understand that there is no Chinese flair embedding yet, but BERT chinese model is available. So I hope there would be a solution.

Any enlightenment would be appreciated !

Xiaoguang

question

Most helpful comment

Many thanks @alanakbik !
I will take a try and get back to you if there was any update.
Best

All 6 comments

Hello @yangxg we do not have a pre-trained model for Chinese NER yet, but you can use Flair to train your own model for Chinese. To do this, you can follow these instructions on how to train your own sequence labeling model.

There are a number of word embeddings you can try, such as

fasttext_wiki_embedding = WordEmbeddings('zh')
fasttext_crawl_embedding = WordEmbeddings('zh-crawl')
byte_pair_embedding = BytePairEmbeddings('zh')
bert_embedding = BertEmbeddings('bert-base-chinese')

For best performance, we generally recommend using FlairEmbeddings. We don't yet have them pre-trained for Chinese, but here is a tutorial on how to train your own. Should you train a NER model or FlairEmbeddings for Chinese, we would appreciate a contribution :)

Many thanks @alanakbik !
I will take a try and get back to you if there was any update.
Best

@yangxg I am wondering whether it makes sense to use flair for Chinese.
The core idea of flair is to use character-level language model. The word representation is a simple concatenation of the hidden states before and after a word. However, in Chinese, there are no obvious word boundaries. Even if you segment the text into words, the word usually has a much shorter average length than Indo-European languages such as English. Besides, the alphabet of Chinese is much larger.

There are modeling alternatives though. For example, we could model Chinese characters using sub-character units, say, strokes. I did do experiments following this idea but without much success.

hi @qiuwei, actually I didn't work with flair to much because the our project was interrupted in July. However we found that flair performances well in NER task for medical chief complaint data (we use the Chinese bert embbeding). So I think that character level model are also sensible, ideally with the support of glossary list.

Hope this helps although it is still very superficial.

hi @qiuwei, actually I didn't work with flair to much because the our project was interrupted in July. However we found that flair performances well in NER task for medical chief complaint data (we use the Chinese bert embedding). So I think that character-level model are also sensible, ideally with the support of glossary list.

I got confused. So basically you are using bert embedding or flair embedding?
To my knowledge, the vanilla flair embedding doesn't support Chinese.
Maybe you actually mean using bert embedding with the flair framework helps Chinese NER task

Maybe you actually mean using bert embedding with the flair framework helps Chinese NER task

Yes, that is exactly we were doing.
There is no Chinese flair embbeding yet, but the framework is very supportive.
I really hope someone could come up with a Chinese embbeding, although we don't have the capacity...

Was this page helpful?
0 / 5 - 0 ratings