I have AssertionError error when running extract_features_aligned_to_words from XLMRModel. Is this a bug or there's difference between RoBERTa and XLMRModel ?
Here is my code:
from fairseq.models.roberta import XLMRModel
xlmr = XLMRModel.from_pretrained('xlmr.large.v0.tar.gz', checkpoint_file='model.pt')
xlmr.eval()
doc = xlmr.extract_features_aligned_to_words("hello RoBERTa")
And Error:
AssertionError Traceback (most recent call last)
<ipython-input-72-37f3a81eb496> in <module>
----> 1 doc = xlmr.extract_features_aligned_to_words("hello RoBERTa")
~/workspace/fairseq/fairseq/models/roberta/hub_interface.py in extract_features_aligned_to_words(self, sentence, return_all_hiddens)
120 spacy_toks = tokenizer(sentence)
121 spacy_toks_ws = [t.text_with_ws for t in tokenizer(sentence)]
--> 122 alignment = alignment_utils.align_bpe_to_words(self, bpe_toks, spacy_toks_ws)
123
124 # extract features and align them
~/workspace/fairseq/fairseq/models/roberta/alignment_utils.py in align_bpe_to_words(roberta, bpe_tokens, other_tokens)
33
34 # strip leading <s>
---> 35 assert bpe_tokens[0] == '<s>'
36 bpe_tokens = bpe_tokens[1:]
37 assert ''.join(bpe_tokens) == ''.join(other_tokens)
AssertionError:
I tested RobertaModel also failed in function extract_features_aligned_to_words
My current version is fairseq 0.8.0
Both model was loaded from_pretrained with a '.gz' file download in README
It's not supported. XLMR uses sentencepiece BPE whereas RoBERTa uses the GPT-2 BPE. Unfortunately the extract_features_aligned_to_words doesn't have support for sentencepiece BPE yet.
cc @ngoyal2707
@myleott Actually https://github.com/fairinternal/fairseq-py/commit/e8c0196e4927f77e980e4a15375bc6872066fb42#diff-c3ae106584251b0d35cc504bc481482e commit seems to have added stripping of bos token in string() call of dictionary.py. So kinda broken for both roberta and xlm-r.
Will send out a fix
Fix is merged to master
Most helpful comment
It's not supported. XLMR uses sentencepiece BPE whereas RoBERTa uses the GPT-2 BPE. Unfortunately the
extract_features_aligned_to_wordsdoesn't have support for sentencepiece BPE yet.cc @ngoyal2707