Flair: BPEmbeddings with sequence tagger

Created on 7 Oct 2020  ·  3Comments  ·  Source: flairNLP/flair

Describe the bug
While running sequence tagger with stacked embeddings: bytePairEmbeddings and Flair embeddings, an error occurs:

>>>CUDA_VISIBLE_DEVICES=1 python train_seq_tagger.py

PyTorch version 1.3.1 available.
TensorFlow version 2.0.0 available.
2020-10-07 11:58:39,499 Reading data from data
2020-10-07 11:58:39,499 Train: data/train.txt
2020-10-07 11:58:39,499 Dev: data/dev.txt
2020-10-07 11:58:39,500 Test: data/test.txt
2020-10-07 12:00:20,735 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,736 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.5, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 1024)
        (decoder): Linear(in_features=1024, out_features=275, bias=True)
      )
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.5, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 1024)
        (decoder): Linear(in_features=1024, out_features=275, bias=True)
      )
    )
    (list_embedding_2): BytePairEmbeddings(model=2-bpe-custom-100000-200)
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=2448, out_features=2448, bias=True)
  (rnn): LSTM(2448, 256, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=512, out_features=47, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2020-10-07 12:00:20,736 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,736 Corpus: "Corpus: 183145 train + 24944 dev + 21721 test sentences"
2020-10-07 12:00:20,736 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,737 Parameters:
2020-10-07 12:00:20,737  - learning_rate: "0.1"
2020-10-07 12:00:20,737  - mini_batch_size: "32"
2020-10-07 12:00:20,737  - patience: "2"
2020-10-07 12:00:20,737  - anneal_factor: "0.5"
2020-10-07 12:00:20,737  - max_epochs: "100"
2020-10-07 12:00:20,737  - shuffle: "True"
2020-10-07 12:00:20,737  - train_with_dev: "False"
2020-10-07 12:00:20,737  - batch_growth_annealing: "False"
2020-10-07 12:00:20,737 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,737 Model training base path: "outputs/models/bpemb_flair"
2020-10-07 12:00:20,737 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,738 Device: cuda:0
2020-10-07 12:00:20,738 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,738 Embeddings storage mode: gpu
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
2020-10-07 12:00:20,744 ----------------------------------------------------------------------------------------------------
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
2020-10-07 12:05:33,604 epoch 1 - iter 572/5724 - loss 3.94088415 - samples/sec: 58.52 - lr: 0.100000
Traceback (most recent call last):
  File "train_seq_tagger.py", line 104, in <module>
    trainer.train(**params["train"])
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/trainers/trainer.py", line 371, in train
    loss = self.model.forward_loss(batch_step)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 603, in forward_loss
    features = self.forward(data_points)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 608, in forward
    self.embeddings.embed(sentences)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/token.py", line 71, in embed
    embedding.embed(sentences)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/base.py", line 60, in embed
    self._add_embeddings_internal(sentences)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/token.py", line 1580, in _add_embeddings_internal
    (embeddings[0], embeddings[len(embeddings) - 1])
IndexError: index 0 is out of bounds for axis 0 with size 0
ERROR: CUDA_VISIBLE_DEVICES=1 python train_seq_tagger.py, exited with 1

To Reproduce
Run the sequence tagger trainer with stacked flair embeddings and custom bpembeddings

Expected behavior
Training sequence tagger

Environment (please complete the following information):

Additional context

  • Used vectors work correctly with
from flair.embeddings import BytePairEmbeddings
embedding = BytePairEmbeddings(model_file_path='./m.model', embedding_file_path='bpemb-vocab100k-dim200.w2v')
from flair.data import Sentence
# create a sentence
sentence = Sentence('M. Tutu a été présent le 31 octobre 2020 à la séance de yoga.')

# embed words in sentence
embedding.embed(sentence)
  • The code for sequence tagger was previously tested with other embeddings (fasttext) and worked correctly.
  • Applying the same code with BytePairEmbeddings(language='fr') does not throws any error.

What can be wrong with the tuned bpemdeddings?

bug wontfix

Most helpful comment

Hello Urszula,

I encountered the same issue while using custom BytePairEmbeddings, and found some insights about the issue, see below.

Bug

https://github.com/flairNLP/flair/blob/master/flair/embeddings/token.py

l. 1745
For some tokens, self.embedder.embed(word.lower()) returns an empty list, which next raises the IndexError.

Additional context

The likely reason for that is the normalization rules of the underlying sentencepiece model for subword tokenization:

  • In previous versions of sentencepiece (the ones used for pre-defined BPEmb languages offered by Flair), the normalization scheme was nfkc.
  • In the current version of sentencepiece (likely the one you use to train your custom BPEmb), the normalization scheme is nmt_nfkc. It deletes some "whitespace/invalid" characters while tokenizing.
    More details can be found at https://github.com/google/sentencepiece/blob/master/doc/normalization.md

These two different schemes therefore give different results for some tokens:

# This embbeds a BPEmb model with underlying sentencepiece tokenization with nmt_nfkc normalization
>>> bpe_custom = BytePairEmbeddings(model_file_path='sentencepiece.model', embedding_file_path=embeddings.bin)
>>> bpe_custom.embedder.spm.encode("�", out_type=str)
[]
>>> bpe_custom.embedder.embed("�")
array([], shape=(0, 50), dtype=float32)
>>> bpe_custom.embedder.spm.encode("\n", out_type=str)
[]
>>> bpe_custom.embedder.embed("\n")
array([], shape=(0, 50), dtype=float32)

VS

# This embbeds a BPEmb model with underlying sentencepiece tokenization with nfkc normalization
>>> bpe_fr = BytePairEmbeddings('fr')
>>> bpe_fr.embedder.spm.encode("�", out_type=str)
['▁', '�']
>>> bpe_fr.embedder.embed("�")
array([[ 0.863682,  0.623915, -0.255492,  1.228884, -0.246349, -0.235584,
         0.924933,  1.468551, -1.046001, -0.313229,  0.924974, -0.26374 ,
        -0.215517,  0.310154, -0.281002,  0.127435,  0.297852, -1.035336,
         0.656995,  0.740548,  0.324117,  0.571423, -0.735685,  0.262373,
         0.174549, -0.070397, -0.137978,  0.774121, -0.859513,  0.846455,
        -0.30908 , -0.048569,  0.431066,  0.530602,  0.025365,  0.018068,
        -0.215856,  0.038948, -0.724266,  0.74875 ,  0.269831, -0.273661,
         0.426436,  0.597654,  0.568705, -0.111608, -0.125169,  0.067656,
         0.385495,  0.18757 ],
       [ 0.979594,  0.57784 , -0.222435,  1.486768, -0.380972, -0.35193 ,
         0.901553,  2.116044, -1.18345 , -0.272132,  0.808096, -0.297339,
        -0.288387,  0.523385, -0.516331,  0.409378, -0.363651, -0.650074,
         0.860095,  0.524136,  0.130684,  0.801779, -0.371839,  0.486923,
        -0.213825,  0.155632,  0.054518,  1.182699, -0.681333,  0.921612,
        -0.430549, -0.413449,  0.555705,  0.517503,  0.166901,  0.01226 ,
        -0.426171,  0.016401, -1.095436,  0.761773,  0.123491, -0.225711,
         0.342072,  0.871307,  0.517205, -0.289836, -0.101698, -0.039496,
         0.589295,  0.276277]], dtype=float32)
>>> bpe_fr.embedder.spm.encode("\n", out_type=str)
['▁', '\n']
>>> bpe_fr.embedder.embed("\n")
array([[ 0.863682,  0.623915, -0.255492,  1.228884, -0.246349, -0.235584,
         0.924933,  1.468551, -1.046001, -0.313229,  0.924974, -0.26374 ,
        -0.215517,  0.310154, -0.281002,  0.127435,  0.297852, -1.035336,
         0.656995,  0.740548,  0.324117,  0.571423, -0.735685,  0.262373,
         0.174549, -0.070397, -0.137978,  0.774121, -0.859513,  0.846455,
        -0.30908 , -0.048569,  0.431066,  0.530602,  0.025365,  0.018068,
        -0.215856,  0.038948, -0.724266,  0.74875 ,  0.269831, -0.273661,
         0.426436,  0.597654,  0.568705, -0.111608, -0.125169,  0.067656,
         0.385495,  0.18757 ],
       [ 0.979594,  0.57784 , -0.222435,  1.486768, -0.380972, -0.35193 ,
         0.901553,  2.116044, -1.18345 , -0.272132,  0.808096, -0.297339,
        -0.288387,  0.523385, -0.516331,  0.409378, -0.363651, -0.650074,
         0.860095,  0.524136,  0.130684,  0.801779, -0.371839,  0.486923,
        -0.213825,  0.155632,  0.054518,  1.182699, -0.681333,  0.921612,
        -0.430549, -0.413449,  0.555705,  0.517503,  0.166901,  0.01226 ,
        -0.426171,  0.016401, -1.095436,  0.761773,  0.123491, -0.225711,
         0.342072,  0.871307,  0.517205, -0.289836, -0.101698, -0.039496,
         0.589295,  0.276277]], dtype=float32)

As you can see, bpe_custom.embedder.embed can give an empty embeddings list.

I haven't tested the behavior with other characters and tokens.

Temporary fix:

l. 1738

To set the embeddings to zero for these tokens, you can replace :

if word.strip() == "":

with

if word.strip() == "" or self.embedder.encode(word) == []:

Environment

  • Flair version: 0.7
  • Python version: 3.7.9
  • PyTorch version: 1.7.0 + Cuda 9.2
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: No

All 3 comments

Hello Urszula,

I encountered the same issue while using custom BytePairEmbeddings, and found some insights about the issue, see below.

Bug

https://github.com/flairNLP/flair/blob/master/flair/embeddings/token.py

l. 1745
For some tokens, self.embedder.embed(word.lower()) returns an empty list, which next raises the IndexError.

Additional context

The likely reason for that is the normalization rules of the underlying sentencepiece model for subword tokenization:

  • In previous versions of sentencepiece (the ones used for pre-defined BPEmb languages offered by Flair), the normalization scheme was nfkc.
  • In the current version of sentencepiece (likely the one you use to train your custom BPEmb), the normalization scheme is nmt_nfkc. It deletes some "whitespace/invalid" characters while tokenizing.
    More details can be found at https://github.com/google/sentencepiece/blob/master/doc/normalization.md

These two different schemes therefore give different results for some tokens:

# This embbeds a BPEmb model with underlying sentencepiece tokenization with nmt_nfkc normalization
>>> bpe_custom = BytePairEmbeddings(model_file_path='sentencepiece.model', embedding_file_path=embeddings.bin)
>>> bpe_custom.embedder.spm.encode("�", out_type=str)
[]
>>> bpe_custom.embedder.embed("�")
array([], shape=(0, 50), dtype=float32)
>>> bpe_custom.embedder.spm.encode("\n", out_type=str)
[]
>>> bpe_custom.embedder.embed("\n")
array([], shape=(0, 50), dtype=float32)

VS

# This embbeds a BPEmb model with underlying sentencepiece tokenization with nfkc normalization
>>> bpe_fr = BytePairEmbeddings('fr')
>>> bpe_fr.embedder.spm.encode("�", out_type=str)
['▁', '�']
>>> bpe_fr.embedder.embed("�")
array([[ 0.863682,  0.623915, -0.255492,  1.228884, -0.246349, -0.235584,
         0.924933,  1.468551, -1.046001, -0.313229,  0.924974, -0.26374 ,
        -0.215517,  0.310154, -0.281002,  0.127435,  0.297852, -1.035336,
         0.656995,  0.740548,  0.324117,  0.571423, -0.735685,  0.262373,
         0.174549, -0.070397, -0.137978,  0.774121, -0.859513,  0.846455,
        -0.30908 , -0.048569,  0.431066,  0.530602,  0.025365,  0.018068,
        -0.215856,  0.038948, -0.724266,  0.74875 ,  0.269831, -0.273661,
         0.426436,  0.597654,  0.568705, -0.111608, -0.125169,  0.067656,
         0.385495,  0.18757 ],
       [ 0.979594,  0.57784 , -0.222435,  1.486768, -0.380972, -0.35193 ,
         0.901553,  2.116044, -1.18345 , -0.272132,  0.808096, -0.297339,
        -0.288387,  0.523385, -0.516331,  0.409378, -0.363651, -0.650074,
         0.860095,  0.524136,  0.130684,  0.801779, -0.371839,  0.486923,
        -0.213825,  0.155632,  0.054518,  1.182699, -0.681333,  0.921612,
        -0.430549, -0.413449,  0.555705,  0.517503,  0.166901,  0.01226 ,
        -0.426171,  0.016401, -1.095436,  0.761773,  0.123491, -0.225711,
         0.342072,  0.871307,  0.517205, -0.289836, -0.101698, -0.039496,
         0.589295,  0.276277]], dtype=float32)
>>> bpe_fr.embedder.spm.encode("\n", out_type=str)
['▁', '\n']
>>> bpe_fr.embedder.embed("\n")
array([[ 0.863682,  0.623915, -0.255492,  1.228884, -0.246349, -0.235584,
         0.924933,  1.468551, -1.046001, -0.313229,  0.924974, -0.26374 ,
        -0.215517,  0.310154, -0.281002,  0.127435,  0.297852, -1.035336,
         0.656995,  0.740548,  0.324117,  0.571423, -0.735685,  0.262373,
         0.174549, -0.070397, -0.137978,  0.774121, -0.859513,  0.846455,
        -0.30908 , -0.048569,  0.431066,  0.530602,  0.025365,  0.018068,
        -0.215856,  0.038948, -0.724266,  0.74875 ,  0.269831, -0.273661,
         0.426436,  0.597654,  0.568705, -0.111608, -0.125169,  0.067656,
         0.385495,  0.18757 ],
       [ 0.979594,  0.57784 , -0.222435,  1.486768, -0.380972, -0.35193 ,
         0.901553,  2.116044, -1.18345 , -0.272132,  0.808096, -0.297339,
        -0.288387,  0.523385, -0.516331,  0.409378, -0.363651, -0.650074,
         0.860095,  0.524136,  0.130684,  0.801779, -0.371839,  0.486923,
        -0.213825,  0.155632,  0.054518,  1.182699, -0.681333,  0.921612,
        -0.430549, -0.413449,  0.555705,  0.517503,  0.166901,  0.01226 ,
        -0.426171,  0.016401, -1.095436,  0.761773,  0.123491, -0.225711,
         0.342072,  0.871307,  0.517205, -0.289836, -0.101698, -0.039496,
         0.589295,  0.276277]], dtype=float32)

As you can see, bpe_custom.embedder.embed can give an empty embeddings list.

I haven't tested the behavior with other characters and tokens.

Temporary fix:

l. 1738

To set the embeddings to zero for these tokens, you can replace :

if word.strip() == "":

with

if word.strip() == "" or self.embedder.encode(word) == []:

Environment

  • Flair version: 0.7
  • Python version: 3.7.9
  • PyTorch version: 1.7.0 + Cuda 9.2
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: No

Thank you @elliotbart, I will check it out !

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alanakbik picture alanakbik  ·  3Comments

inyukwo1 picture inyukwo1  ·  3Comments

stefan-it picture stefan-it  ·  3Comments

jewl123 picture jewl123  ·  3Comments

Aditya715 picture Aditya715  ·  3Comments