Spacy: Add a way to use syntaxnet's engine as a parser

Created on 21 Jun 2016  Â·  8Comments  Â·  Source: explosion/spaCy

I've already installed syntaxnet but found that syntaxnet only provides POS and syntax parser, eg. it is not possible to find the lemma for words. Is there a way to connect spacy with syntaxnet as you mention in your blog post?

enhancement help wanted

Most helpful comment

Ps: anyone feeling a little exploratory can also find an implementation of
a feed-forward neural CRF parser in the july16 branch. This requires the
parser_nn_2016 branch of thinc. The script to train it is at
bin/parser/conll_train.py

This model is broadly similar to the most accurate syntaxnet model. I've
been using it to train very low memory models. So far I've managed to get a
parser down to around 60mb, beating the accuracy of the current linear
model. This would cut spaCy's memory use down to under 500mb.

The model naturally suports a cache, so users can control the space/speed
tradeoff at runtime. I haven't benchmarked for efficiency yet, but the
model is training in a few hours on a single CPU thread, so it can't be too
terrible.

There are really only a few outstanding issues. The main one is a memory
error --- it's crashing on exit, so I've stuffed something up somewhere.
Windows support will also be a huge mess. I've also found a couple of bugs,
and there are hyper-parameters to tune...

As I mentioned in another thread, I'll be mostly afk the next week, as I'm
out of town. And I've been making promises about this damn neural network
model for too long :). There will be tonnes to do when I get back, and some
annoying resource constraints to juggle, so I'm reluctant to promise a
release date. But I'll definitely be saying more about it when I get back
:).

On Wednesday, August 10, 2016, Matthew Honnibal [email protected]
wrote:

Honestly the BIST parser is much better. Yes it's less mature, but the
accuracy is the same and there are almost no hyper-parameters. Training on
the other Universal Dependencies corpora should be no problem.

In the meantime, it's probably easiest to pipe through SyntaxNet and set
the parses onto spaCy Doc objects with the doc.from_array method. A
deserialiser that reads syntaxnet's protobuf format would also be super
nice.

On Wednesday, August 10, 2016, Adam Mathias Bittlingmayer <
[email protected]

Even more tantalising now that they released models for 40 more
languages.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/spacy-io/spaCy/issues/432#issuecomment-238823465,
or mute
the thread
AHr7ZmZCr8bBu4glT5BiHj3ra7QsrJaYks5qeaMbgaJpZM4I6TWT>
.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/spacy-io/spaCy/issues/432#issuecomment-238838357, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAvrbWhVzsyUSjrq1terI6-QyLj8Owkuks5qebQqgaJpZM4I6TWT
.

All 8 comments

I tried to do this, but it's quite hard. I spent days trying to read SyntaxNet's code, and concluded I'd have to write some new C++ just to feed SyntaxNet a string from Python instead of a filename from which to read the text. Has this changed at all?

Hi, haven't read the complete source code yet. I'm piping in parser_eval.py from stdin as in the provided demo.

This would be incredible, just wondering how viable it actually is tho (since I have yet to see anyone using syntaxnet without piping between stdin and subprocess stuff)

Even more tantalising now that they released models for 40 more languages.

Honestly the BIST parser is much better. Yes it's less mature, but the
accuracy is the same and there are almost no hyper-parameters. Training on
the other Universal Dependencies corpora should be no problem.

In the meantime, it's probably easiest to pipe through SyntaxNet and set
the parses onto spaCy Doc objects with the doc.from_array method. A
deserialiser that reads syntaxnet's protobuf format would also be super
nice.

On Wednesday, August 10, 2016, Adam Mathias Bittlingmayer <
[email protected]> wrote:

Even more tantalising now that they released models for 40 more languages.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/spacy-io/spaCy/issues/432#issuecomment-238823465, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHr7ZmZCr8bBu4glT5BiHj3ra7QsrJaYks5qeaMbgaJpZM4I6TWT
.

Ps: anyone feeling a little exploratory can also find an implementation of
a feed-forward neural CRF parser in the july16 branch. This requires the
parser_nn_2016 branch of thinc. The script to train it is at
bin/parser/conll_train.py

This model is broadly similar to the most accurate syntaxnet model. I've
been using it to train very low memory models. So far I've managed to get a
parser down to around 60mb, beating the accuracy of the current linear
model. This would cut spaCy's memory use down to under 500mb.

The model naturally suports a cache, so users can control the space/speed
tradeoff at runtime. I haven't benchmarked for efficiency yet, but the
model is training in a few hours on a single CPU thread, so it can't be too
terrible.

There are really only a few outstanding issues. The main one is a memory
error --- it's crashing on exit, so I've stuffed something up somewhere.
Windows support will also be a huge mess. I've also found a couple of bugs,
and there are hyper-parameters to tune...

As I mentioned in another thread, I'll be mostly afk the next week, as I'm
out of town. And I've been making promises about this damn neural network
model for too long :). There will be tonnes to do when I get back, and some
annoying resource constraints to juggle, so I'm reluctant to promise a
release date. But I'll definitely be saying more about it when I get back
:).

On Wednesday, August 10, 2016, Matthew Honnibal [email protected]
wrote:

Honestly the BIST parser is much better. Yes it's less mature, but the
accuracy is the same and there are almost no hyper-parameters. Training on
the other Universal Dependencies corpora should be no problem.

In the meantime, it's probably easiest to pipe through SyntaxNet and set
the parses onto spaCy Doc objects with the doc.from_array method. A
deserialiser that reads syntaxnet's protobuf format would also be super
nice.

On Wednesday, August 10, 2016, Adam Mathias Bittlingmayer <
[email protected]

Even more tantalising now that they released models for 40 more
languages.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/spacy-io/spaCy/issues/432#issuecomment-238823465,
or mute
the thread
AHr7ZmZCr8bBu4glT5BiHj3ra7QsrJaYks5qeaMbgaJpZM4I6TWT>
.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/spacy-io/spaCy/issues/432#issuecomment-238838357, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAvrbWhVzsyUSjrq1terI6-QyLj8Owkuks5qebQqgaJpZM4I6TWT
.

Thanks for the guidance as always

Speaking for my own project, languages and convenience matter more than accuracy or underlying implementation, and reliable guidance as much as actual updates.

My team has tried SyntaxNet and it leaves much to be desired on these fronts (just ease of use and training on and parsing additional languages, decided to give up on reducing footprint size), so there is some opportunity here.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings