Describe the bug
While running sequence tagger with stacked embeddings: bytePairEmbeddings and Flair embeddings, an error occurs:
>>>CUDA_VISIBLE_DEVICES=1 python train_seq_tagger.py
PyTorch version 1.3.1 available.
TensorFlow version 2.0.0 available.
2020-10-07 11:58:39,499 Reading data from data
2020-10-07 11:58:39,499 Train: data/train.txt
2020-10-07 11:58:39,499 Dev: data/dev.txt
2020-10-07 11:58:39,500 Test: data/test.txt
2020-10-07 12:00:20,735 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,736 Model: "SequenceTagger(
(embeddings): StackedEmbeddings(
(list_embedding_0): FlairEmbeddings(
(lm): LanguageModel(
(drop): Dropout(p=0.5, inplace=False)
(encoder): Embedding(275, 100)
(rnn): LSTM(100, 1024)
(decoder): Linear(in_features=1024, out_features=275, bias=True)
)
)
(list_embedding_1): FlairEmbeddings(
(lm): LanguageModel(
(drop): Dropout(p=0.5, inplace=False)
(encoder): Embedding(275, 100)
(rnn): LSTM(100, 1024)
(decoder): Linear(in_features=1024, out_features=275, bias=True)
)
)
(list_embedding_2): BytePairEmbeddings(model=2-bpe-custom-100000-200)
)
(word_dropout): WordDropout(p=0.05)
(locked_dropout): LockedDropout(p=0.5)
(embedding2nn): Linear(in_features=2448, out_features=2448, bias=True)
(rnn): LSTM(2448, 256, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
(linear): Linear(in_features=512, out_features=47, bias=True)
(beta): 1.0
(weights): None
(weight_tensor) None
)"
2020-10-07 12:00:20,736 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,736 Corpus: "Corpus: 183145 train + 24944 dev + 21721 test sentences"
2020-10-07 12:00:20,736 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,737 Parameters:
2020-10-07 12:00:20,737 - learning_rate: "0.1"
2020-10-07 12:00:20,737 - mini_batch_size: "32"
2020-10-07 12:00:20,737 - patience: "2"
2020-10-07 12:00:20,737 - anneal_factor: "0.5"
2020-10-07 12:00:20,737 - max_epochs: "100"
2020-10-07 12:00:20,737 - shuffle: "True"
2020-10-07 12:00:20,737 - train_with_dev: "False"
2020-10-07 12:00:20,737 - batch_growth_annealing: "False"
2020-10-07 12:00:20,737 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,737 Model training base path: "outputs/models/bpemb_flair"
2020-10-07 12:00:20,737 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,738 Device: cuda:0
2020-10-07 12:00:20,738 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,738 Embeddings storage mode: gpu
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
2020-10-07 12:00:20,744 ----------------------------------------------------------------------------------------------------
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
2020-10-07 12:05:33,604 epoch 1 - iter 572/5724 - loss 3.94088415 - samples/sec: 58.52 - lr: 0.100000
Traceback (most recent call last):
File "train_seq_tagger.py", line 104, in <module>
trainer.train(**params["train"])
File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/trainers/trainer.py", line 371, in train
loss = self.model.forward_loss(batch_step)
File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 603, in forward_loss
features = self.forward(data_points)
File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 608, in forward
self.embeddings.embed(sentences)
File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/token.py", line 71, in embed
embedding.embed(sentences)
File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/base.py", line 60, in embed
self._add_embeddings_internal(sentences)
File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/token.py", line 1580, in _add_embeddings_internal
(embeddings[0], embeddings[len(embeddings) - 1])
IndexError: index 0 is out of bounds for axis 0 with size 0
ERROR: CUDA_VISIBLE_DEVICES=1 python train_seq_tagger.py, exited with 1
To Reproduce
Run the sequence tagger trainer with stacked flair embeddings and custom bpembeddings
Expected behavior
Training sequence tagger
Environment (please complete the following information):
Additional context
from flair.embeddings import BytePairEmbeddings
embedding = BytePairEmbeddings(model_file_path='./m.model', embedding_file_path='bpemb-vocab100k-dim200.w2v')
from flair.data import Sentence
# create a sentence
sentence = Sentence('M. Tutu a été présent le 31 octobre 2020 à la séance de yoga.')
# embed words in sentence
embedding.embed(sentence)
What can be wrong with the tuned bpemdeddings?
Hello Urszula,
I encountered the same issue while using custom BytePairEmbeddings, and found some insights about the issue, see below.
https://github.com/flairNLP/flair/blob/master/flair/embeddings/token.py
l. 1745
For some tokens, self.embedder.embed(word.lower()) returns an empty list, which next raises the IndexError.
The likely reason for that is the normalization rules of the underlying sentencepiece model for subword tokenization:
nfkc.nmt_nfkc. It deletes some "whitespace/invalid" characters while tokenizing.These two different schemes therefore give different results for some tokens:
# This embbeds a BPEmb model with underlying sentencepiece tokenization with nmt_nfkc normalization
>>> bpe_custom = BytePairEmbeddings(model_file_path='sentencepiece.model', embedding_file_path=embeddings.bin)
>>> bpe_custom.embedder.spm.encode("�", out_type=str)
[]
>>> bpe_custom.embedder.embed("�")
array([], shape=(0, 50), dtype=float32)
>>> bpe_custom.embedder.spm.encode("\n", out_type=str)
[]
>>> bpe_custom.embedder.embed("\n")
array([], shape=(0, 50), dtype=float32)
VS
# This embbeds a BPEmb model with underlying sentencepiece tokenization with nfkc normalization
>>> bpe_fr = BytePairEmbeddings('fr')
>>> bpe_fr.embedder.spm.encode("�", out_type=str)
['▁', '�']
>>> bpe_fr.embedder.embed("�")
array([[ 0.863682, 0.623915, -0.255492, 1.228884, -0.246349, -0.235584,
0.924933, 1.468551, -1.046001, -0.313229, 0.924974, -0.26374 ,
-0.215517, 0.310154, -0.281002, 0.127435, 0.297852, -1.035336,
0.656995, 0.740548, 0.324117, 0.571423, -0.735685, 0.262373,
0.174549, -0.070397, -0.137978, 0.774121, -0.859513, 0.846455,
-0.30908 , -0.048569, 0.431066, 0.530602, 0.025365, 0.018068,
-0.215856, 0.038948, -0.724266, 0.74875 , 0.269831, -0.273661,
0.426436, 0.597654, 0.568705, -0.111608, -0.125169, 0.067656,
0.385495, 0.18757 ],
[ 0.979594, 0.57784 , -0.222435, 1.486768, -0.380972, -0.35193 ,
0.901553, 2.116044, -1.18345 , -0.272132, 0.808096, -0.297339,
-0.288387, 0.523385, -0.516331, 0.409378, -0.363651, -0.650074,
0.860095, 0.524136, 0.130684, 0.801779, -0.371839, 0.486923,
-0.213825, 0.155632, 0.054518, 1.182699, -0.681333, 0.921612,
-0.430549, -0.413449, 0.555705, 0.517503, 0.166901, 0.01226 ,
-0.426171, 0.016401, -1.095436, 0.761773, 0.123491, -0.225711,
0.342072, 0.871307, 0.517205, -0.289836, -0.101698, -0.039496,
0.589295, 0.276277]], dtype=float32)
>>> bpe_fr.embedder.spm.encode("\n", out_type=str)
['▁', '\n']
>>> bpe_fr.embedder.embed("\n")
array([[ 0.863682, 0.623915, -0.255492, 1.228884, -0.246349, -0.235584,
0.924933, 1.468551, -1.046001, -0.313229, 0.924974, -0.26374 ,
-0.215517, 0.310154, -0.281002, 0.127435, 0.297852, -1.035336,
0.656995, 0.740548, 0.324117, 0.571423, -0.735685, 0.262373,
0.174549, -0.070397, -0.137978, 0.774121, -0.859513, 0.846455,
-0.30908 , -0.048569, 0.431066, 0.530602, 0.025365, 0.018068,
-0.215856, 0.038948, -0.724266, 0.74875 , 0.269831, -0.273661,
0.426436, 0.597654, 0.568705, -0.111608, -0.125169, 0.067656,
0.385495, 0.18757 ],
[ 0.979594, 0.57784 , -0.222435, 1.486768, -0.380972, -0.35193 ,
0.901553, 2.116044, -1.18345 , -0.272132, 0.808096, -0.297339,
-0.288387, 0.523385, -0.516331, 0.409378, -0.363651, -0.650074,
0.860095, 0.524136, 0.130684, 0.801779, -0.371839, 0.486923,
-0.213825, 0.155632, 0.054518, 1.182699, -0.681333, 0.921612,
-0.430549, -0.413449, 0.555705, 0.517503, 0.166901, 0.01226 ,
-0.426171, 0.016401, -1.095436, 0.761773, 0.123491, -0.225711,
0.342072, 0.871307, 0.517205, -0.289836, -0.101698, -0.039496,
0.589295, 0.276277]], dtype=float32)
As you can see, bpe_custom.embedder.embed can give an empty embeddings list.
I haven't tested the behavior with other characters and tokens.
To set the embeddings to zero for these tokens, you can replace :
if word.strip() == "":
with
if word.strip() == "" or self.embedder.encode(word) == []:
Thank you @elliotbart, I will check it out !
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Hello Urszula,
I encountered the same issue while using custom BytePairEmbeddings, and found some insights about the issue, see below.
Bug
https://github.com/flairNLP/flair/blob/master/flair/embeddings/token.py
l. 1745
For some tokens,
self.embedder.embed(word.lower())returns an empty list, which next raises theIndexError.Additional context
The likely reason for that is the normalization rules of the underlying sentencepiece model for subword tokenization:
nfkc.nmt_nfkc. It deletes some "whitespace/invalid" characters while tokenizing.More details can be found at https://github.com/google/sentencepiece/blob/master/doc/normalization.md
These two different schemes therefore give different results for some tokens:
VS
As you can see,
bpe_custom.embedder.embedcan give an empty embeddings list.I haven't tested the behavior with other characters and tokens.
Temporary fix:
l. 1738
To set the embeddings to zero for these tokens, you can replace :
if word.strip() == "":with
if word.strip() == "" or self.embedder.encode(word) == []:Environment