Transformers: Tokenization issue with RoBERTa and DistilRoBERTa.

Created on 20 Apr 2020 · 2Comments · Source: huggingface/transformers

🐛 Bug

Information

Model I am using (Bert, XLNet ...):
RoBERTa (roberta-base), DistilRoBERTa (distilroberta-base)
Language I am using the model on (English, Chinese ...):
English
The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

I am trying to encode the embeddings for the sentences, and I found a tokenization issue with a certain (type of) sentence which ends with ").". I noticed that the tokenizer cannot tokenize ')' from '.' and further causes issues with the sentence length.

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

Dataset: SemEval 2016 Task 5, SB1 EN-REST

To reproduce

Steps to reproduce the behavior:

See in the following codes:

import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer

text = '(Besides that there should be more restaurants like it around the city).'
for model_name in ['roberta-base', 'distilroberta-base']:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    token_dict = tokenizer.encode_plus(text, None, return_tensors='pt')

    print('model_name: {}'.format(model_name))
    print("Token (str): {}".format(
        tokenizer.convert_ids_to_tokens(token_dict['input_ids'][0])))
    print("Token (int): {}".format(token_dict['input_ids']))
    print("Type: {}".format(
        token_dict['token_type_ids']))
    print('Output Embeddings: {}\n'.format(
        model(token_dict['input_ids'])[0].shape))

Expected behavior

Expected output:

model_name: roberta-base
Token (str): ['<s>', 'Ġ(', 'Besides', 'Ġthat', 'Ġthere', 'Ġshould', 'Ġbe', 'Ġmore', 'Ġrestaurants', 'Ġlike', 'Ġit', 'Ġaround', 'Ġthe', 'Ġcity', ')', 'Ġ.', '</s>']
Token (int): tensor([[    0,    36, 41107,    14,    89,   197,    28,    55,  4329,   101,
            24,   198,     5,   343,    43,   479,     2]])
Type: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Output Embeddings: torch.Size([1, 17, 768])

model_name: distilroberta-base
Token (str): ['<s>', 'Ġ(', 'Besides', 'Ġthat', 'Ġthere', 'Ġshould', 'Ġbe', 'Ġmore', 'Ġrestaurants', 'Ġlike', 'Ġit', 'Ġaround', 'Ġthe', 'Ġcity', ')', 'Ġ.', '</s>']
Token (int): tensor([[    0,    36, 41107,    14,    89,   197,    28,    55,  4329,   101,
            24,   198,     5,   343,    43,   479,     2]])
Type: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Output Embeddings: torch.Size([1, 17, 768])

Basically, the expected behavior is to tokenize ')' and '.' separately. Furthermore, I am also curious about what these 'Ġ' characters are in the RoBERTa encoding? I checked the vocabulary and I found both the normal words and the words starting with this 'Ġ' character so I am a bit confused.

Environment info

transformers version: 2.5.1
Platform: Windows-10-10.0.18362-SP0
Python version: 3.7.6
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: False
Using distributed or parallel set-up in script?: False

Tokenization wontfix

Source

vincentwen1995

Most helpful comment

Furthermore, I am also curious about what these 'Ġ' characters are in the RoBERTa encoding?

It's a feature of byte-level BPE (an encoded space character)
Ref-bart-fairseq, Ref-openai-gpt

AdityaSoni19031997 on 21 Apr 2020

👍4

All 2 comments

Furthermore, I am also curious about what these 'Ġ' characters are in the RoBERTa encoding?

It's a feature of byte-level BPE (an encoded space character)
Ref-bart-fairseq, Ref-openai-gpt

AdityaSoni19031997 on 21 Apr 2020

👍4

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.