Hi,
Please correct me if this is not the place to ask this question (I did post this on SO but got no answer).
Link to SO post: HERE.
In summary my question is this:
How can I define a rule-based chunker after POS tagging as I would do in nltk?
NLTK way:
cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}"
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)
I'm stuck after the doing the coarse tagging in spacy and getting the token-tag pairs.
Any help is greatly appreciated.
We sometimes miss SO posts, so no worries about posting here. There's also a user group on Reddit: http://reddit.com/r/spacynlp
There's a few ways to go about this. The closest functionality to that RegexpParser class is spaCy's Matcher. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the doc.noun_chunks iterator:
doc = nlp(text)
for np in doc.noun_chunks:
print(np.text)
The basic way that this works is something like this:
for token in doc:
if is_head_of_chunk(token)
chunk_start = token.left_edge.i
chunk_end = token.right_edge.i + 1
yield doc[chunk_start : chunk_end]
You can define the hypothetical is_head_of function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy
That's very helpful to me. Thanks a lot!
This would have been more helpful if the answer would have show how is_head_of function works. Instead we're just left to wonder how the chunking would have been done (which was the whole reason for specifying the specific example regexp).
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
We sometimes miss SO posts, so no worries about posting here. There's also a user group on Reddit: http://reddit.com/r/spacynlp
There's a few ways to go about this. The closest functionality to that
RegexpParserclass is spaCy'sMatcher. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have thedoc.noun_chunksiterator:The basic way that this works is something like this:
You can define the hypothetical
is_head_offunction however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy