Spacy: Chunking with rule-based grammar in spacy

Created on 18 Apr 2016  路  4Comments  路  Source: explosion/spaCy

Hi,
Please correct me if this is not the place to ask this question (I did post this on SO but got no answer).
Link to SO post: HERE.

In summary my question is this:

How can I define a rule-based chunker after POS tagging as I would do in nltk?

NLTK way:
cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}"
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)

I'm stuck after the doing the coarse tagging in spacy and getting the token-tag pairs.

Any help is greatly appreciated.

Most helpful comment

We sometimes miss SO posts, so no worries about posting here. There's also a user group on Reddit: http://reddit.com/r/spacynlp

There's a few ways to go about this. The closest functionality to that RegexpParser class is spaCy's Matcher. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the doc.noun_chunks iterator:

doc = nlp(text)
for np in doc.noun_chunks:
    print(np.text)

The basic way that this works is something like this:

for token in doc:
    if is_head_of_chunk(token)
        chunk_start = token.left_edge.i
        chunk_end = token.right_edge.i + 1
        yield doc[chunk_start : chunk_end]

You can define the hypothetical is_head_of function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy

All 4 comments

We sometimes miss SO posts, so no worries about posting here. There's also a user group on Reddit: http://reddit.com/r/spacynlp

There's a few ways to go about this. The closest functionality to that RegexpParser class is spaCy's Matcher. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the doc.noun_chunks iterator:

doc = nlp(text)
for np in doc.noun_chunks:
    print(np.text)

The basic way that this works is something like this:

for token in doc:
    if is_head_of_chunk(token)
        chunk_start = token.left_edge.i
        chunk_end = token.right_edge.i + 1
        yield doc[chunk_start : chunk_end]

You can define the hypothetical is_head_of function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy

That's very helpful to me. Thanks a lot!

This would have been more helpful if the answer would have show how is_head_of function works. Instead we're just left to wonder how the chunking would have been done (which was the whole reason for specifying the specific example regexp).

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings