Spacy: Chunking with rule-based grammar in spacy

Created on 18 Apr 2016 · 4Comments · Source: explosion/spaCy

Hi,
Please correct me if this is not the place to ask this question (I did post this on SO but got no answer).
Link to SO post: HERE.

In summary my question is this:

How can I define a rule-based chunker after POS tagging as I would do in nltk?

NLTK way:
cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}"
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)

I'm stuck after the doing the coarse tagging in spacy and getting the token-tag pairs.

Any help is greatly appreciated.

Source

ben-aaron188

Most helpful comment

We sometimes miss SO posts, so no worries about posting here. There's also a user group on Reddit: http://reddit.com/r/spacynlp

There's a few ways to go about this. The closest functionality to that RegexpParser class is spaCy's Matcher. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the doc.noun_chunks iterator:

doc = nlp(text)
for np in doc.noun_chunks:
    print(np.text)

The basic way that this works is something like this:

for token in doc:
    if is_head_of_chunk(token)
        chunk_start = token.left_edge.i
        chunk_end = token.right_edge.i + 1
        yield doc[chunk_start : chunk_end]

You can define the hypothetical is_head_of function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy

honnibal on 20 Apr 2016

👍2 👎1

All 4 comments

We sometimes miss SO posts, so no worries about posting here. There's also a user group on Reddit: http://reddit.com/r/spacynlp

doc = nlp(text)
for np in doc.noun_chunks:
    print(np.text)

The basic way that this works is something like this:

for token in doc:
    if is_head_of_chunk(token)
        chunk_start = token.left_edge.i
        chunk_end = token.right_edge.i + 1
        yield doc[chunk_start : chunk_end]

honnibal on 20 Apr 2016

👍2 👎1

That's very helpful to me. Thanks a lot!

ben-aaron188 on 25 Apr 2016

This would have been more helpful if the answer would have show how is_head_of function works. Instead we're just left to wonder how the chunking would have been done (which was the whole reason for specifying the specific example regexp).