Models: SyntaxNet failed to parse Chinese

Created on 22 Aug 2016  ยท  5Comments  ยท  Source: tensorflow/models

Dear Expert,
I download Chinese model from http://download.tensorflow.org/models/parsey_universal/Chinese.zip
When I run the following, it output just "_ NUM CD fPOS=PUNCT++. 0 ROOT _ _" after the input sentence:

  MODEL_DIRECTORY=/where/you/unzipped/the/model/files
  cat sentences.txt | syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY > output.conll

It seems the parse taking the whole input sentence as a single token.
But when I run tokenizer_zh with the following, it could successfully segment the sentence into tokens by empty spaces:
cat sentences.txt | syntaxnet/models/parsey_universal/tokenize_zh.sh $MODEL_DIRECTORY > output.conll
I save the sentences.txt as UTF-8 format in both cases.

awaiting model gardener

Most helpful comment

To run tokenization and parsing on Chinese text you can pipe together the tokenization and the parsing example scripts as follows, eg

echo '็„ถ่€Œ,้€™ๆจฃ็š„่™•็†ไนŸ่ก็”Ÿไบ†ไธ€ไบ›ๅ•้กŒ.' | \
  syntaxnet/models/parsey_universal/tokenize_zh.sh $MODEL_DIRECTORY | \
  syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY

should output

1   ็„ถ่€Œ  _   ADV RB  fPOS=ADV++RB    7   mark    _   _
2   ,   _   PUNCT   ,   fPOS=PUNCT++,   7   punct   _   _
3   ้€™ๆจฃ  _   PRON    PRD fPOS=PRON++PRD  5   det _   _
4   ็š„ _   PART    DEC Case=Gen|fPOS=PART++DEC 3   case:dec    _   _
5   ่™•็†  _   NOUN    NN  fPOS=NOUN++NN   7   nsubj   _   _
6   ไนŸ _   ADV RB  fPOS=ADV++RB    7   mark    _   _
7   ่ก็”Ÿ  _   VERB    VV  fPOS=VERB++VV   0   ROOT    _   _
8   ไบ† _   PART    AS  Aspect=Perf|fPOS=PART++AS   7   case:aspect _   _
9   ไธ€ไบ›  _   ADJ JJ  fPOS=ADJ++JJ    10  amod    _   _
10  ๅ•้กŒ  _   NOUN    NN  fPOS=NOUN++NN   7   dobj    _   _
11  .   _   PUNCT   .   fPOS=PUNCT++.   7   punct   _   _

All 5 comments

@calberti, could you please take a look?

To run tokenization and parsing on Chinese text you can pipe together the tokenization and the parsing example scripts as follows, eg

echo '็„ถ่€Œ,้€™ๆจฃ็š„่™•็†ไนŸ่ก็”Ÿไบ†ไธ€ไบ›ๅ•้กŒ.' | \
  syntaxnet/models/parsey_universal/tokenize_zh.sh $MODEL_DIRECTORY | \
  syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY

should output

1   ็„ถ่€Œ  _   ADV RB  fPOS=ADV++RB    7   mark    _   _
2   ,   _   PUNCT   ,   fPOS=PUNCT++,   7   punct   _   _
3   ้€™ๆจฃ  _   PRON    PRD fPOS=PRON++PRD  5   det _   _
4   ็š„ _   PART    DEC Case=Gen|fPOS=PART++DEC 3   case:dec    _   _
5   ่™•็†  _   NOUN    NN  fPOS=NOUN++NN   7   nsubj   _   _
6   ไนŸ _   ADV RB  fPOS=ADV++RB    7   mark    _   _
7   ่ก็”Ÿ  _   VERB    VV  fPOS=VERB++VV   0   ROOT    _   _
8   ไบ† _   PART    AS  Aspect=Perf|fPOS=PART++AS   7   case:aspect _   _
9   ไธ€ไบ›  _   ADJ JJ  fPOS=ADJ++JJ    10  amod    _   _
10  ๅ•้กŒ  _   NOUN    NN  fPOS=NOUN++NN   7   dobj    _   _
11  .   _   PUNCT   .   fPOS=PUNCT++.   7   punct   _   _

Thanks for taking a look @calberti!

@calberti Thanks a lot! It works!
BTW, does the current parser support simplified Chinese? Is the parser's model trained with simplified Chinese data or traditional Chinese data? Now I can run both of them, but get slightly different outputs, the traditional Chinese's output seems better.

I also want to know that whether the parser's model trained with simplified Chinese data or traditional Chinese data .

Was this page helpful?
0 / 5 - 0 ratings