Flair: Is Glove embedding going to be updated

Created on 27 Mar 2019  路  11Comments  路  Source: flairNLP/flair

From the following code, I'm not sure if the glove embedding is going to be updated or just simply stay as it is.

from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
                                        WordEmbeddings('glove'), 
                                        FlairEmbeddings('news-forward'), 
                                        FlairEmbeddings('news-backward'),
                                       ])
question

Most helpful comment

@allanj @Huijun-Cui sorry for the delayed reponse (still travelling) but you are correct: By default we put a fully connected layer on each embedding. The motivation here is that most implementations use standard word embeddings to initialize the embedding layer, i.e. the linear map that takes a one-hot encoded word and produces an embedding. So in most implementations this embedding layer is fine-tuned on the downstream task. In our implementation, we instead do a simple lookup in Gensim for the word embedding. This means that there is no linear map and so no fine-tuning is possible here. To address this, we add a fully connected layer on top that is trainable to achieve a similar effect. Hope this clarifies!

All 11 comments

no , it is fixed

Only CharacterEmbeddings (here) are updated, as proposed by Lample et al. :)

So is the CharacterEmbeddings also used to reproduce the number for the CoNLL-2003 NER task?

Hello @allanj the current best known configuration for CoNLL NER is listed here and uses only pooled flair embeddings and glove embeddings, i.e. no CharacterEmbeddings. In our COLING paper, we evaluated different settings and found that the CharacterEmbeddings are not really necessary when already using FlairEmbeddings.

W.r.t. updating embeddings: The base GloVe and Flair embeddings never get updated, but we by default have a fully connected layer on top of the embedding layer before passing the embeddings into the RNN. This 'reprojection' layer may function similarly to updating embeddings since it takes the original embeddings in and outputs a modified version.

What does it mean by a fully connected layer on top of embedding?

fully connected layer on top of the embedding layer before passing the embeddings into the RNN

I thought after we have the contextual embedding of each word, we then feed into the BiLSTM, then a CRF layer?

@allanj I don`t represent the official , I had read the code , in my opinion , both the CharacterEmbeddings and ohter embedding is fixed , before we feed the embedding into the neural , we use the Linear Layer , so we can treat the Linear Layer as a represent the embedding of word

I see. I think I understand what you mean. But it is the same architecture as traditional BiLSTM-CRF, am I right?

@allanj yeah , the default architecture is embedding layer + bilstm(1 layer) + crf , in the decoding step implement viterbi algorithm

After reading the code, I think I have understood what @alanakbik is saying:
the architecture should be:
embedding layer (Fixed with pre-trained contextualized embedding) + fully connected layer for each word + BiLSTM + CRF

It seems this kind of settings is not mentioned in neither the paper nor supplementary material.
I'm trying to reproduce the number 93 (for CoNLL) using the flair embeddings offline (e.g., using Flair embeddings and our own BiLSTM-CRF code). I will appreciate that if there is any specific configurations I should be aware of.

@allanj @Huijun-Cui sorry for the delayed reponse (still travelling) but you are correct: By default we put a fully connected layer on each embedding. The motivation here is that most implementations use standard word embeddings to initialize the embedding layer, i.e. the linear map that takes a one-hot encoded word and produces an embedding. So in most implementations this embedding layer is fine-tuned on the downstream task. In our implementation, we instead do a simple lookup in Gensim for the word embedding. This means that there is no linear map and so no fine-tuning is possible here. To address this, we add a fully connected layer on top that is trainable to achieve a similar effect. Hope this clarifies!

Only CharacterEmbeddings (here) are updated, as proposed by Lample et al. :)

But Lample et al update/fine-tune word embeddings as well. From the paper:

Embeddings are pretrained using skip-n-gram (Ling et al., 2015a), a variation of word2vec (Mikolov et al., 2013a) that accounts for word order. These embeddings are fine-tuned during training

@stefan-it

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gopalkalpande picture gopalkalpande  路  3Comments

happypanda5 picture happypanda5  路  3Comments

mnishant2 picture mnishant2  路  3Comments

jannenev picture jannenev  路  3Comments

mittalsuraj18 picture mittalsuraj18  路  3Comments