Model I am using (Bert, XLNet ...):
RoBERTa (roberta-base), DistilRoBERTa (distilroberta-base)
Language I am using the model on (English, Chinese ...):
English
The problem arises when using:
I am trying to encode the embeddings for the sentences, and I found a tokenization issue with a certain (type of) sentence which ends with ").". I noticed that the tokenizer cannot tokenize ')' from '.' and further causes issues with the sentence length.
The tasks I am working on is:
Dataset: SemEval 2016 Task 5, SB1 EN-REST
Steps to reproduce the behavior:
See in the following codes:
import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer
text = '(Besides that there should be more restaurants like it around the city).'
for model_name in ['roberta-base', 'distilroberta-base']:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
token_dict = tokenizer.encode_plus(text, None, return_tensors='pt')
print('model_name: {}'.format(model_name))
print("Token (str): {}".format(
tokenizer.convert_ids_to_tokens(token_dict['input_ids'][0])))
print("Token (int): {}".format(token_dict['input_ids']))
print("Type: {}".format(
token_dict['token_type_ids']))
print('Output Embeddings: {}\n'.format(
model(token_dict['input_ids'])[0].shape))
Expected output:
model_name: roberta-base
Token (str): ['<s>', '臓(', 'Besides', '臓that', '臓there', '臓should', '臓be', '臓more', '臓restaurants', '臓like', '臓it', '臓around', '臓the', '臓city', ')', '臓.', '</s>']
Token (int): tensor([[ 0, 36, 41107, 14, 89, 197, 28, 55, 4329, 101,
24, 198, 5, 343, 43, 479, 2]])
Type: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Output Embeddings: torch.Size([1, 17, 768])
model_name: distilroberta-base
Token (str): ['<s>', '臓(', 'Besides', '臓that', '臓there', '臓should', '臓be', '臓more', '臓restaurants', '臓like', '臓it', '臓around', '臓the', '臓city', ')', '臓.', '</s>']
Token (int): tensor([[ 0, 36, 41107, 14, 89, 197, 28, 55, 4329, 101,
24, 198, 5, 343, 43, 479, 2]])
Type: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Output Embeddings: torch.Size([1, 17, 768])
Basically, the expected behavior is to tokenize ')' and '.' separately. Furthermore, I am also curious about what these '臓' characters are in the RoBERTa encoding? I checked the vocabulary and I found both the normal words and the words starting with this '臓' character so I am a bit confused.
transformers version: 2.5.1Furthermore, I am also curious about what these '臓' characters are in the RoBERTa encoding?
It's a feature of byte-level BPE (an encoded space character)
Ref-bart-fairseq, Ref-openai-gpt
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
It's a feature of byte-level BPE (an encoded space character)
Ref-bart-fairseq, Ref-openai-gpt