Models: SyntaxNet failed to parse Chinese

Created on 22 Aug 2016 · 5Comments · Source: tensorflow/models

Dear Expert,
I download Chinese model from http://download.tensorflow.org/models/parsey_universal/Chinese.zip
When I run the following, it output just "_ NUM CD fPOS=PUNCT++. 0 ROOT _ _" after the input sentence:

  MODEL_DIRECTORY=/where/you/unzipped/the/model/files
  cat sentences.txt | syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY > output.conll

It seems the parse taking the whole input sentence as a single token.
But when I run tokenizer_zh with the following, it could successfully segment the sentence into tokens by empty spaces:
cat sentences.txt | syntaxnet/models/parsey_universal/tokenize_zh.sh $MODEL_DIRECTORY > output.conll
I save the sentences.txt as UTF-8 format in both cases.

awaiting model gardener

Source

gaoteng-git

👍1

Most helpful comment

To run tokenization and parsing on Chinese text you can pipe together the tokenization and the parsing example scripts as follows, eg

echo '然而,這樣的處理也衍生了一些問題.' | \
  syntaxnet/models/parsey_universal/tokenize_zh.sh $MODEL_DIRECTORY | \
  syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY

should output

1   然而  _   ADV RB  fPOS=ADV++RB    7   mark    _   _
2   ,   _   PUNCT   ,   fPOS=PUNCT++,   7   punct   _   _
3   這樣  _   PRON    PRD fPOS=PRON++PRD  5   det _   _
4   的 _   PART    DEC Case=Gen|fPOS=PART++DEC 3   case:dec    _   _
5   處理  _   NOUN    NN  fPOS=NOUN++NN   7   nsubj   _   _
6   也 _   ADV RB  fPOS=ADV++RB    7   mark    _   _
7   衍生  _   VERB    VV  fPOS=VERB++VV   0   ROOT    _   _
8   了 _   PART    AS  Aspect=Perf|fPOS=PART++AS   7   case:aspect _   _
9   一些  _   ADJ JJ  fPOS=ADJ++JJ    10  amod    _   _
10  問題  _   NOUN    NN  fPOS=NOUN++NN   7   dobj    _   _
11  .   _   PUNCT   .   fPOS=PUNCT++.   7   punct   _   _

calberti on 25 Aug 2016

👍3

All 5 comments

@calberti, could you please take a look?

aselle on 23 Aug 2016

To run tokenization and parsing on Chinese text you can pipe together the tokenization and the parsing example scripts as follows, eg

echo '然而,這樣的處理也衍生了一些問題.' | \
  syntaxnet/models/parsey_universal/tokenize_zh.sh $MODEL_DIRECTORY | \
  syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY

should output

1   然而  _   ADV RB  fPOS=ADV++RB    7   mark    _   _
2   ,   _   PUNCT   ,   fPOS=PUNCT++,   7   punct   _   _
3   這樣  _   PRON    PRD fPOS=PRON++PRD  5   det _   _
4   的 _   PART    DEC Case=Gen|fPOS=PART++DEC 3   case:dec    _   _
5   處理  _   NOUN    NN  fPOS=NOUN++NN   7   nsubj   _   _
6   也 _   ADV RB  fPOS=ADV++RB    7   mark    _   _
7   衍生  _   VERB    VV  fPOS=VERB++VV   0   ROOT    _   _
8   了 _   PART    AS  Aspect=Perf|fPOS=PART++AS   7   case:aspect _   _
9   一些  _   ADJ JJ  fPOS=ADJ++JJ    10  amod    _   _
10  問題  _   NOUN    NN  fPOS=NOUN++NN   7   dobj    _   _
11  .   _   PUNCT   .   fPOS=PUNCT++.   7   punct   _   _

calberti on 25 Aug 2016

👍3

Thanks for taking a look @calberti!

aselle on 25 Aug 2016

@calberti Thanks a lot! It works!
BTW, does the current parser support simplified Chinese? Is the parser's model trained with simplified Chinese data or traditional Chinese data? Now I can run both of them, but get slightly different outputs, the traditional Chinese's output seems better.

gaoteng-git on 6 Sep 2016

I also want to know that whether the parser's model trained with simplified Chinese data or traditional Chinese data .