I'd like to implement a functionality which can correct tokenization errors (both boundaries and tags) by parser. With this error correction function, our Japanese language model will be able to resolve ambiguous POSs (such as サ変名詞 for NOUN or VERB) and merge over-segmented tokens.
I found a related mention in v2.1 release note.
Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
Could you please give me the links for the source codes doing this? @honnibal
Can we apply "joint word segmentation and parsing" for single (and possibly root) token span?
Have you had a look at retokenization in spaCy?
That allows you to update the attributes of tokens such as POS, using retokenization.merge
Sure. I've been using retokenization APIs.
In GiNZA, I'm using a logic which uses extended dependency labels; e.g. "obj_as_NOUN", to distinguish ambiguous POS, and also it uses a virtual root token appended after the last token in sentence to distinguish the POS of real root token; e.g. "root_as_VERB".
https://github.com/megagonlabs/ginza/blob/develop/ja_ginza/parse_tree.py#L445
https://github.com/megagonlabs/ginza/blob/feature/apply_spacy_v2.1/ja_ginza/parse_tree.py#L433
This tricky logic is much complicated and also reducing performance.
I'd try to refactor it.
Thanks, @BreakBB
@BreakBB Actually this refers to the parser-based mechanism, which uses the subtok label. This is a bit different from the retokenization.
@hiroshi-matsuda-rit In the command-line interface, it should be as simple as adding --learn-tokens. The mechanism works like this:
GoldParse class, we receive a pair (doc, annotations), where the annotations includes the gold-standard segmentation, and the doc object contains the predicted tokenization. We then do a Levenshtein alignment between the two. The alignment is called in spacy/gold.pyx, and the main logic is in spacy/_align.pyx.subtok. The head for these subtok tokens will be the next word. This occurs in spacy/gold.pyx.subtok labels. Additional constraints on this label ensure that the parser can only predict subtok for length-1 arcs, and that subtokens cannot cross sentence boundaries.doc.retokenize(). This should be occurring in the merge_subtokens pipeline component in v2.1.4. In the next release, this will be moved into parser.postprocesses, to make the system more self-contained.It sounds to me like your system would benefit from having several ROOT labels, which could be interpreted with different meanings. Currently the ROOT label is hard-coded, which prevents this.
@honnibal I simply shared what I have found in the docs. Thanks for the clarification!
@honnibal Thank you so much for you precise description around subtok concatenation procedure.
I decided to replace GiNZA's POS disambiguation and retokenization procedures with spaCy's POS tagger and --learn-tokens, respectively.
The spaCy's train command works well with -G option but does not work with SudachiTokenizer (without -G).
It seems that we should retokenize the dataset by the tokenizer in advance, to avoid the inconsistent situations.
I encounter an error at the beginning of the first evaluation phase (just after first training phase).
python -m spacy train ja ja_gsd-ud ja_gsd-ud-train.json ja_gsd-ud-dev.json -p tagger,parser -ne 2 -V 1.2.2 -pt dep,tag -v models/ja_gsd-1.2.1/ -VV
...
✔ Saved model to output directory
ja_gsd-ud/model-final
⠙ Creating best model...
Traceback (most recent call last):
File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/cli/train.py", line 257, in train
losses=losses,
File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/language.py", line 457, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "nn_parser.pyx", line 413, in spacy.syntax.nn_parser.Parser.update
File "nn_parser.pyx", line 519, in spacy.syntax.nn_parser.Parser._init_gold_batch
File "transition_system.pyx", line 86, in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence
File "arc_eager.pyx", line 592, in spacy.syntax.arc_eager.ArcEager.set_costs
ValueError: [E020] Could not find a gold-standard action to supervise the dependency parser. The tree is non-projective (i.e. it has crossing arcs - see spacy/syntax/nonproj.pyx for definitions). The ArcEager transition system only supports projective trees. To learn non-projective representations, transform the data before training and after parsing. Either pass `make_projective=True` to the GoldParse class, or use spacy.syntax.nonproj.preprocess_training_data.
I'd like to report how I'd solve this problem, soon.
Anyway, I think that a lot of applications of the world will be happy if they could use customized root labels.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
@BreakBB Actually this refers to the parser-based mechanism, which uses the
subtoklabel. This is a bit different from the retokenization.@hiroshi-matsuda-rit In the command-line interface, it should be as simple as adding
--learn-tokens. The mechanism works like this:GoldParseclass, we receive a pair(doc, annotations), where the annotations includes the gold-standard segmentation, and thedocobject contains the predicted tokenization. We then do a Levenshtein alignment between the two. The alignment is called inspacy/gold.pyx, and the main logic is inspacy/_align.pyx.subtok. The head for thesesubtoktokens will be the next word. This occurs inspacy/gold.pyx.subtoklabels. Additional constraints on this label ensure that the parser can only predictsubtokfor length-1 arcs, and that subtokens cannot cross sentence boundaries.doc.retokenize(). This should be occurring in themerge_subtokenspipeline component in v2.1.4. In the next release, this will be moved intoparser.postprocesses, to make the system more self-contained.It sounds to me like your system would benefit from having several
ROOTlabels, which could be interpreted with different meanings. Currently theROOTlabel is hard-coded, which prevents this.