Fairseq: Error in roberta.extract_features_aligned_to_words()

Created on 3 Sep 2019  路  4Comments  路  Source: pytorch/fairseq

Threw errors when run the following commands to extract features aligned to words,

roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()

ss = 'There were 28 apples in the house.  There are 54 apples in the garden.'
roberta.extract_features_aligned_to_words(ss)

Error messages are as following,

~/.cache/torch/hub/pytorch_fairseq_master/fairseq/models/roberta/hub_interface.py in extract_features_aligned_to_words(self, sentence, return_all_hiddens)
    125         features = self.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
    126         features = features.squeeze(0)
--> 127         aligned_feats = alignment_utils.align_features_to_words(self, features, alignment)
    128 
    129         # wrap in spaCy Doc

~/.cache/torch/hub/pytorch_fairseq_master/fairseq/models/roberta/alignment_utils.py in align_features_to_words(roberta, features, alignment)
     92         output.append(weighted_features[j])
     93     output = torch.stack(output)
---> 94     assert torch.all(torch.abs(output.sum(dim=0) - features.sum(dim=0)) < 1e-4)
     95     return output
help wanted question

Most helpful comment

I face the same problem in my usage. I think the assertion is applied in order to make sure the weighted sum works well. But there might be some numerical error after all the calculations. And I think 1e-4 is just a threshold to ensure the numerical error is not too big.

In my case, I just enlarge the threshold from 1e-4 to 1e-3, and it fixes my problem.

All 4 comments

I am getting the same error, but I think it is related to the length of the input sentence because when I reduce it, the error disappear.

Thanks in advance.
Please is there any way to solve that?

The problem is the multiple spaces in the input:

# works
roberta.extract_features_aligned_to_words('There were 28 apples in the house. There are 54 apples in the garden.')

# doesn't work
roberta.extract_features_aligned_to_words('There were 28 apples in the house.  There are 54 apples in the garden.')

The problem is that we assert that the sum of the "aligned" version matches the sum of the original BPE version. Since the BPE code models spaces explicitly, this assert fails. You can probably remove the assert if you don't care about the alignment matching exactly. Alternatively you can try to remove these extra spaces from your input.

I have the same issue and it's really hard to figure out which spaces are needed to be removed. And in my case, I do care about alignments as I'm looking to extract embeddings for some specific tokens.

I created a custom function because I don't want to use spacy tokens and I already have gold tokens available.

Consider the below code:

import torch
from fairseq.models.roberta import alignment_utils
from typing import Tuple, List

def extract_aligned_roberta(roberta, sentence: str, 
                            tokens: List[str], 
                            return_all_hiddens=False):
    ''' Code inspired from: 
       https://github.com/pytorch/fairseq/blob/master/fairseq/models/roberta/hub_interface.py

    Aligns roberta embeddings for an input tokenization of words for a sentence

    Inputs:
    1. roberta: roberta fairseq class
    2. sentence: sentence in string
    3. tokens: tokens of the sentence in which the alignment is to be done

    Outputs: Aligned roberta features 
    '''

    # tokenize both with GPT-2 BPE and get alignment with given tokens
    bpe_toks = roberta.encode(sentence)
    alignment = alignment_utils.align_bpe_to_words(roberta, bpe_toks, tokens)


    # extract features and align them
    features = roberta.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
    features = features.squeeze(0)   #Batch-size = 1
    aligned_feats = alignment_utils.align_features_to_words(roberta, features, alignment)

    return aligned_feats[1:-1]  #exclude <s> and </s> tokens

This code works for simple sentences:

sentence = 'There were 28 apples in the house. There are 54 apples in the garden.'
tokens = ['There','were', '28', 'apples', 'in', 'the', 'house', '.',  
              'There','are','54','apples','in','the','garden', '.']
print(extract_aligned_roberta(roberta, sentence, tokens).shape)

Outputs:
torch.Size([16, 1024])

But when I use another sentence such as:

sentence1 = "DPA : Iraqi authorities announced that they had busted up 3 terrorist cells operating in Baghdad. Two of them were being run by 2 officials of the Ministry of the Interior! The MoI in Iraq is equivalent to the US FBI, so this would be like having J. Edgar Hoover unwittingly employ at a high level members of the Weathermen bombers back in the 1960s."

tokens1 = ['DPA', ':', 'Iraqi', 'authorities', 'announced', 'that', 'they', 'had', 'busted', 'up', '3', 'terrorist', 'cells', 'operating', 'in', 'Baghdad', '.', 'Two', 'of', 'them', 'were', 'being', 'run', 'by', '2', 'officials', 'of', 'the', 'Ministry', 'of', 'the', 'Interior', '!', 'The', 'MoI', 'in', 'Iraq', 'is', 'equivalent', 'to', 'the', 'US', 'FBI', ',', 'so', 'this', 'would', 'be', 'like', 'having', 'J.', 'Edgar', 'Hoover', 'unwittingly', 'employ', 'at', 'a', 'high', 'level', 'members', 'of', 'the', 'Weathermen', 'bombers', 'back', 'in', 'the', '1960s', '.']

print(extract_aligned_roberta(roberta, sentence1, tokens1).shape)

Then I get the same error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-23-a1b5dbdbcc62> in <module>
----> 1 extract_aligned_roberta(roberta, sentence, tokens).shape

<ipython-input-1-093c6979ba7b> in extract_aligned_roberta(roberta, sentence, tokens, return_all_hiddens)
     28     features = roberta.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
     29     features = features.squeeze(0)   #Batch-size = 1
---> 30     aligned_feats = alignment_utils.align_features_to_words(roberta, features, alignment)
     31 
     32     return aligned_feats[1:-1]  #exclude <s> and </s> tokens

~/anaconda3/envs/allennlp/lib/python3.6/site-packages/fairseq/models/roberta/alignment_utils.py in align_features_to_words(roberta, features, alignment)
     92         output.append(weighted_features[j])
     93     output = torch.stack(output)
---> 94     assert torch.all(torch.abs(output.sum(dim=0) - features.sum(dim=0)) < 1e-4)
     95     return output
     96 

AssertionError: 

And it's not clear to me if there are any extra spaces in the sentence.

Any help here?

I face the same problem in my usage. I think the assertion is applied in order to make sure the weighted sum works well. But there might be some numerical error after all the calculations. And I think 1e-4 is just a threshold to ensure the numerical error is not too big.

In my case, I just enlarge the threshold from 1e-4 to 1e-3, and it fixes my problem.

Was this page helpful?
0 / 5 - 0 ratings