When modeling with BERT / other wordpiece-based contextualizers, you can have a _matched_ tokenization and a _mismatched_ tokenization.
Matched tokenization: your initial tokenizer produces a sequence of wordpieces, and your contextualizer operates on a sequence of wordpieces.
Mismatched tokenization: your initial tokenizer produces a sequence of words (like spacy's tokenizer), but your contextualizer operates on a sequence of wordpieces. Another way of saying this is that your overall _model_ operates on words, but your _contextualizer_ operates on wordpieces.
We only handle the mismatched case. We should add a WordpieceTokenizer (probably just a wrapper around the one available in pytorch-pretrained-bert), and corresponding matched Indexers, to more easily handle the matched case. Naming of the indexers is a bit unfortunate at this point, because you'd really want the matched case to have the simpler name, but our WordpieceIndexer is handling the mismatched case.
why do these need different token indexers? isn't the only difference that one returns the offsets and one doesn't?
(relatedly, I was thinking about whether it would be possible so create (say) a single SubwordIndexer that replaces the WordpieceIndexer, the OpenAITransformer byte pair indexer, and the forthcoming (?) XLNet SentencePiece indexer, pushing the differences into the tokenizers.)
You're right, maybe there's a way to just use one, and pass a flag that says whether to return offsets and mask and such.
But also the existing one does wordpiece tokenization itself, so it's more than just the return value.
couldn't you push those differences into the tokenizers themselves? I'd have to think about it.
we should also rethink if there's a cleaner way to do the embedder_to_indexer_map stuff, since e.g. if you're using BERT you always have to specify it, there's got to be a way to make it happen automatically
I can imagine ways of doing this that push a bunch of things to the tokenizer, but they imply that you should also push character tokenization into the tokenizer for TokenCharactersIndexer. I'd guess that you would end up with a monstrously large "tokenizer" abstraction that tries to do too many things.
For the mapping, the issue is that the model stuff needs to know about the data stuff. There is currently no object that is passed between the two, other than the vocabulary and the tensors themselves. So, either you put some extra information into the tensors, which might work, or you add another point of coordination between the model and the data. Another point of coordination would come at significant cost that I'm not sure would be worth it.
For putting the extra info into the tensors: the mapping could just be a key that TextField adds to the tensor dict, and the TextFieldEmbedder pulls it out and uses it.
self-assigned, because I'll be implementing the XLNet Tokenizer / Indexer / Embedder soon (whenever they make it into a pytorch-pretrained-bert release), and I'll think about this when I do that
The PretrainedTransformer classes implement this, and we're planning on adding analogous classes (or extended functionality to the existing classes) to handle the mismatched case. Closing this issue.
Most helpful comment
self-assigned, because I'll be implementing the XLNet Tokenizer / Indexer / Embedder soon (whenever they make it into a pytorch-pretrained-bert release), and I'll think about this when I do that