Allennlp: Can PretrainedTransformerTokenizer track character offset like WordTokenizer?

Created on 16 Nov 2019  ·  11Comments  ·  Source: allenai/allennlp

Question

  • Can PretrainedTransformerTokenizer track character offset like WordTokenizer?
    Since character offset is important to calculate answer span after wordpiece tokenization?
Under Development

Most helpful comment

The new tokenizers library from huggingface tracks character offsets, so we don't need to add this ourselves. We have someone here who's going to be fixing this very soon. I'd recommend against picking up this issue.

All 11 comments

This is a TODO in the code. I know that the huggingface repo has code to train SQuAD models, so there must be a way to do this calculation in that repo, but I haven't looked at the code to figure it out. Contributions welcome!

@matt-gardner if this issue is still pending I would love to take this up. I might need your assistance as I am relatively new to the code base. As in if you could provide me a list of TODO's it will really help me.

The new tokenizers library from huggingface tracks character offsets, so we don't need to add this ourselves. We have someone here who's going to be fixing this very soon. I'd recommend against picking up this issue.

@matt-gardner Oh, thanks for the update.

I few weeks ago I added a parameter to PretrainedTransformerTokenizer that attempts to calculate offsets after the fact. It does so imperfectly, but it might get you going if you need this right away.

I have no context or intuition about what time label this one should get; @dirkgr, any ideas?

We already have the code for this (https://github.com/allenai/allennlp/pull/3868), so this task is to integrate the new huggingface tokenizers whenever those remaining bugs are fixed, and bring that PR up to date. I'll say that's a day's worth of work.

Just noting that #4018 integrated new huggingface tokenizers, so updating #3868 should be unblocked at this point.

I'm aware.

New huggingface tokenizers are still broken. I'm moving this to the bottom of the stack for 1.0. Maybe we'll bump it to 1.1.

Finally done!

Was this page helpful?
0 / 5 - 0 ratings