Dear Expert,
I download Chinese model from http://download.tensorflow.org/models/parsey_universal/Chinese.zip
When I run the following, it output just "_ NUM CD fPOS=PUNCT++. 0 ROOT _ _" after the input sentence:
MODEL_DIRECTORY=/where/you/unzipped/the/model/files
cat sentences.txt | syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY > output.conll
It seems the parse taking the whole input sentence as a single token.
But when I run tokenizer_zh with the following, it could successfully segment the sentence into tokens by empty spaces:
cat sentences.txt | syntaxnet/models/parsey_universal/tokenize_zh.sh $MODEL_DIRECTORY > output.conll
I save the sentences.txt as UTF-8 format in both cases.
@calberti, could you please take a look?
To run tokenization and parsing on Chinese text you can pipe together the tokenization and the parsing example scripts as follows, eg
echo '็ถ่,้ๆจฃ็่็ไน่ก็ไบไธไบๅ้ก.' | \
syntaxnet/models/parsey_universal/tokenize_zh.sh $MODEL_DIRECTORY | \
syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY
should output
1 ็ถ่ _ ADV RB fPOS=ADV++RB 7 mark _ _
2 , _ PUNCT , fPOS=PUNCT++, 7 punct _ _
3 ้ๆจฃ _ PRON PRD fPOS=PRON++PRD 5 det _ _
4 ็ _ PART DEC Case=Gen|fPOS=PART++DEC 3 case:dec _ _
5 ่็ _ NOUN NN fPOS=NOUN++NN 7 nsubj _ _
6 ไน _ ADV RB fPOS=ADV++RB 7 mark _ _
7 ่ก็ _ VERB VV fPOS=VERB++VV 0 ROOT _ _
8 ไบ _ PART AS Aspect=Perf|fPOS=PART++AS 7 case:aspect _ _
9 ไธไบ _ ADJ JJ fPOS=ADJ++JJ 10 amod _ _
10 ๅ้ก _ NOUN NN fPOS=NOUN++NN 7 dobj _ _
11 . _ PUNCT . fPOS=PUNCT++. 7 punct _ _
Thanks for taking a look @calberti!
@calberti Thanks a lot! It works!
BTW, does the current parser support simplified Chinese? Is the parser's model trained with simplified Chinese data or traditional Chinese data? Now I can run both of them, but get slightly different outputs, the traditional Chinese's output seems better.
I also want to know that whether the parser's model trained with simplified Chinese data or traditional Chinese data .
Most helpful comment
To run tokenization and parsing on Chinese text you can pipe together the tokenization and the parsing example scripts as follows, eg
should output