Threw errors when run the following commands to extract features aligned to words,
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()
ss = 'There were 28 apples in the house. There are 54 apples in the garden.'
roberta.extract_features_aligned_to_words(ss)
Error messages are as following,
~/.cache/torch/hub/pytorch_fairseq_master/fairseq/models/roberta/hub_interface.py in extract_features_aligned_to_words(self, sentence, return_all_hiddens)
125 features = self.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
126 features = features.squeeze(0)
--> 127 aligned_feats = alignment_utils.align_features_to_words(self, features, alignment)
128
129 # wrap in spaCy Doc
~/.cache/torch/hub/pytorch_fairseq_master/fairseq/models/roberta/alignment_utils.py in align_features_to_words(roberta, features, alignment)
92 output.append(weighted_features[j])
93 output = torch.stack(output)
---> 94 assert torch.all(torch.abs(output.sum(dim=0) - features.sum(dim=0)) < 1e-4)
95 return output
I am getting the same error, but I think it is related to the length of the input sentence because when I reduce it, the error disappear.
Thanks in advance.
Please is there any way to solve that?
The problem is the multiple spaces in the input:
# works
roberta.extract_features_aligned_to_words('There were 28 apples in the house. There are 54 apples in the garden.')
# doesn't work
roberta.extract_features_aligned_to_words('There were 28 apples in the house. There are 54 apples in the garden.')
The problem is that we assert that the sum of the "aligned" version matches the sum of the original BPE version. Since the BPE code models spaces explicitly, this assert fails. You can probably remove the assert if you don't care about the alignment matching exactly. Alternatively you can try to remove these extra spaces from your input.
I have the same issue and it's really hard to figure out which spaces are needed to be removed. And in my case, I do care about alignments as I'm looking to extract embeddings for some specific tokens.
I created a custom function because I don't want to use spacy tokens and I already have gold tokens available.
Consider the below code:
import torch
from fairseq.models.roberta import alignment_utils
from typing import Tuple, List
def extract_aligned_roberta(roberta, sentence: str,
tokens: List[str],
return_all_hiddens=False):
''' Code inspired from:
https://github.com/pytorch/fairseq/blob/master/fairseq/models/roberta/hub_interface.py
Aligns roberta embeddings for an input tokenization of words for a sentence
Inputs:
1. roberta: roberta fairseq class
2. sentence: sentence in string
3. tokens: tokens of the sentence in which the alignment is to be done
Outputs: Aligned roberta features
'''
# tokenize both with GPT-2 BPE and get alignment with given tokens
bpe_toks = roberta.encode(sentence)
alignment = alignment_utils.align_bpe_to_words(roberta, bpe_toks, tokens)
# extract features and align them
features = roberta.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
features = features.squeeze(0) #Batch-size = 1
aligned_feats = alignment_utils.align_features_to_words(roberta, features, alignment)
return aligned_feats[1:-1] #exclude <s> and </s> tokens
This code works for simple sentences:
sentence = 'There were 28 apples in the house. There are 54 apples in the garden.'
tokens = ['There','were', '28', 'apples', 'in', 'the', 'house', '.',
'There','are','54','apples','in','the','garden', '.']
print(extract_aligned_roberta(roberta, sentence, tokens).shape)
Outputs:
torch.Size([16, 1024])
But when I use another sentence such as:
sentence1 = "DPA : Iraqi authorities announced that they had busted up 3 terrorist cells operating in Baghdad. Two of them were being run by 2 officials of the Ministry of the Interior! The MoI in Iraq is equivalent to the US FBI, so this would be like having J. Edgar Hoover unwittingly employ at a high level members of the Weathermen bombers back in the 1960s."
tokens1 = ['DPA', ':', 'Iraqi', 'authorities', 'announced', 'that', 'they', 'had', 'busted', 'up', '3', 'terrorist', 'cells', 'operating', 'in', 'Baghdad', '.', 'Two', 'of', 'them', 'were', 'being', 'run', 'by', '2', 'officials', 'of', 'the', 'Ministry', 'of', 'the', 'Interior', '!', 'The', 'MoI', 'in', 'Iraq', 'is', 'equivalent', 'to', 'the', 'US', 'FBI', ',', 'so', 'this', 'would', 'be', 'like', 'having', 'J.', 'Edgar', 'Hoover', 'unwittingly', 'employ', 'at', 'a', 'high', 'level', 'members', 'of', 'the', 'Weathermen', 'bombers', 'back', 'in', 'the', '1960s', '.']
print(extract_aligned_roberta(roberta, sentence1, tokens1).shape)
Then I get the same error:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-23-a1b5dbdbcc62> in <module>
----> 1 extract_aligned_roberta(roberta, sentence, tokens).shape
<ipython-input-1-093c6979ba7b> in extract_aligned_roberta(roberta, sentence, tokens, return_all_hiddens)
28 features = roberta.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
29 features = features.squeeze(0) #Batch-size = 1
---> 30 aligned_feats = alignment_utils.align_features_to_words(roberta, features, alignment)
31
32 return aligned_feats[1:-1] #exclude <s> and </s> tokens
~/anaconda3/envs/allennlp/lib/python3.6/site-packages/fairseq/models/roberta/alignment_utils.py in align_features_to_words(roberta, features, alignment)
92 output.append(weighted_features[j])
93 output = torch.stack(output)
---> 94 assert torch.all(torch.abs(output.sum(dim=0) - features.sum(dim=0)) < 1e-4)
95 return output
96
AssertionError:
And it's not clear to me if there are any extra spaces in the sentence.
Any help here?
I face the same problem in my usage. I think the assertion is applied in order to make sure the weighted sum works well. But there might be some numerical error after all the calculations. And I think 1e-4 is just a threshold to ensure the numerical error is not too big.
In my case, I just enlarge the threshold from 1e-4 to 1e-3, and it fixes my problem.
Most helpful comment
I face the same problem in my usage. I think the assertion is applied in order to make sure the weighted sum works well. But there might be some numerical error after all the calculations. And I think 1e-4 is just a threshold to ensure the numerical error is not too big.
In my case, I just enlarge the threshold from 1e-4 to 1e-3, and it fixes my problem.