In doc.pyx' s line 590:
if not self.is_parsed:
raise ValueError(Errors.E029)
I can still do a good job of chunking by tokenization and pos tagging only, without the full parse. Also in some languages parse isn't available. This will leave more flexibilities to users. I can comment out this in my copy of spacy, but when I update spacy to a new release, I have to change it again.
It would be great if this error can be lifted.
I think it would be fine to move this check into each individual noun chunks iterator rather than having it in Doc.noun_chunks.
Would you like to submit a PR to make this change? It looks like there are 9 languages that would need to be modified.
How to submit a PR? Have never done that and thanks.
You can find some basic information here: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md#contributing-to-the-code-base
Basically, you'll have to fork (copy) the repo and build it from source. Set up a new virtual environment and git to do this. Then you can make changes on a new branch, test them, commit them, and if you're satisified with the final result, you can go to your branches (for me, this is https://github.com/svlandeg/spaCy/branches) and hit the "New pull request" button on the right next to your local branch (you will only see this button if you own that specific fork). This will create a PR against spacy's master branch which we can then review.
My implantation now actually depends on this updated release recently:
https://github.com/howl-anderson/Chinese_models_for_SpaCy
And I am talking to the author, and we think it would be nice to make that model official in Spacy for Chinese. Then we can improve quality of some models in it, when we start to use it. So I would rather wait until it's pushed into the codebase. My previous version depends on Jieba pos tagger, but I would rather like using the full-fledged Chinese model.
Hi @svlandeg and @adrianeboyd! I would like to pick this up if no one is actively working on it.
Sure, I don't think anyone is currently working on this. Moving the check to the individual noun chunks iterators, as Adriane suggested, should be straightforward to do, irrespective of potential other changes to Chinese.
I think Adriane is working on a full Chinese model release and that would be a better time to work on this after that release. Jieba's POS tagging is shaky.
This change doesn't really interact with the Chinese model development, so it would be totally fine to start working on it now.
Thanks, I am working on it and will keep you posted.
Hi @adrianeboyd and @svlandeg, I have created a PR #5396 to enable noun_chunks for specific languages.
Please review and share the feedback when you get a chance. :)
Hi @adrianeboyd, Thanks for reviewing and merging the PR! 馃憤 Is this issue good to be closed?
Yep, this can be closed, as the PR is merged :-)
By the way @vishnupriyavr, as a small tip for next time, if you put something like Fixes #5526 in the description of the PR, the corresponding issue would close automatically when the PR gets merged ;-)
Hi @svlandeg, that's a very informative tip! Will keep it in mind for the next time 馃槉