Transformers: Tokenization issue with RoBERTa and DistilRoBERTa.

Created on 20 Apr 2020  路  2Comments  路  Source: huggingface/transformers

馃悰 Bug

Information

Model I am using (Bert, XLNet ...):
RoBERTa (roberta-base), DistilRoBERTa (distilroberta-base)
Language I am using the model on (English, Chinese ...):
English
The problem arises when using:

  • [ ] the official example scripts: (give details below)
  • [x] my own modified scripts: (give details below)

I am trying to encode the embeddings for the sentences, and I found a tokenization issue with a certain (type of) sentence which ends with ").". I noticed that the tokenizer cannot tokenize ')' from '.' and further causes issues with the sentence length.

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [x] my own task or dataset: (give details below)

Dataset: SemEval 2016 Task 5, SB1 EN-REST

To reproduce

Steps to reproduce the behavior:

See in the following codes:

import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer

text = '(Besides that there should be more restaurants like it around the city).'
for model_name in ['roberta-base', 'distilroberta-base']:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    token_dict = tokenizer.encode_plus(text, None, return_tensors='pt')

    print('model_name: {}'.format(model_name))
    print("Token (str): {}".format(
        tokenizer.convert_ids_to_tokens(token_dict['input_ids'][0])))
    print("Token (int): {}".format(token_dict['input_ids']))
    print("Type: {}".format(
        token_dict['token_type_ids']))
    print('Output Embeddings: {}\n'.format(
        model(token_dict['input_ids'])[0].shape))

Expected behavior

Expected output:

model_name: roberta-base
Token (str): ['<s>', '臓(', 'Besides', '臓that', '臓there', '臓should', '臓be', '臓more', '臓restaurants', '臓like', '臓it', '臓around', '臓the', '臓city', ')', '臓.', '</s>']
Token (int): tensor([[    0,    36, 41107,    14,    89,   197,    28,    55,  4329,   101,
            24,   198,     5,   343,    43,   479,     2]])
Type: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Output Embeddings: torch.Size([1, 17, 768])

model_name: distilroberta-base
Token (str): ['<s>', '臓(', 'Besides', '臓that', '臓there', '臓should', '臓be', '臓more', '臓restaurants', '臓like', '臓it', '臓around', '臓the', '臓city', ')', '臓.', '</s>']
Token (int): tensor([[    0,    36, 41107,    14,    89,   197,    28,    55,  4329,   101,
            24,   198,     5,   343,    43,   479,     2]])
Type: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Output Embeddings: torch.Size([1, 17, 768])


Basically, the expected behavior is to tokenize ')' and '.' separately. Furthermore, I am also curious about what these '臓' characters are in the RoBERTa encoding? I checked the vocabulary and I found both the normal words and the words starting with this '臓' character so I am a bit confused.

Environment info

  • transformers version: 2.5.1
  • Platform: Windows-10-10.0.18362-SP0
  • Python version: 3.7.6
  • PyTorch version (GPU?): 1.4.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: False
  • Using distributed or parallel set-up in script?: False
Tokenization wontfix

Most helpful comment

Furthermore, I am also curious about what these '臓' characters are in the RoBERTa encoding?

It's a feature of byte-level BPE (an encoded space character)
Ref-bart-fairseq, Ref-openai-gpt

All 2 comments

Furthermore, I am also curious about what these '臓' characters are in the RoBERTa encoding?

It's a feature of byte-level BPE (an encoded space character)
Ref-bart-fairseq, Ref-openai-gpt

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings