Flair: camemBERT

Created on 18 Nov 2019 · 9Comments · Source: flairNLP/flair

Hi,

A little question about French model.
Do you plan to include camemBERT (https://camembert-model.fr) model in a futur release soon ?

Thanks

Source

robinalexandre

Most helpful comment

I tested it locally for a French NER dataset and the results are really great (compared to the multilingual BERT model).

I'm going to add CamemBERT to Flair in the next days :)

stefan-it on 18 Nov 2019

👍3 ❤1

All 9 comments

I tested it locally for a French NER dataset and the results are really great (compared to the multilingual BERT model).

I'm going to add CamemBERT to Flair in the next days :)

stefan-it on 18 Nov 2019

👍3 ❤1

Great news! Let us know when you it's done :)

robinalexandre on 18 Nov 2019

Great news! Waiting and eager to use the model with flair!

hzitoun on 25 Nov 2019

Great news! I'm eager to use the model with flair!

Do you checkout onto the new branch to get it ?
I don't find it in the actual release.

@stefan-it, Can you confirm camemBERT with Flair is ready to be used ?

robinalexandre on 25 Nov 2019

@robinalexandre You just need to install a recent version of transformers, then you can use the camembert-embeddings branch of Flair.

Implementation is working, I just need to add some unit tests and a depedency update of transformers :)

stefan-it on 25 Nov 2019

@robinalexandre You just need to install a recent version of transformers, then you can use the camembert-embeddings branch of Flair.

Implementation is working, I just need to add some unit tests and a depedency update of transformers :)

Great work @stefan-it ! Could you please show how to get the embedding vector of a sentence with HF transformers lib please? (the equivalent of flair's embedding.embed([Sentence("Salut, ça va")])

hzitoun on 25 Nov 2019

To get embeddings for the last layer with transformers:

import torch

from transformers import CamembertTokenizer
from transformers import CamembertModel

tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = CamembertModel.from_pretrained('camembert-base')

sent = "J'aime le camembert !"

input_ids = torch.tensor(tokenizer.encode(sent, add_special_tokens=True)).unsqueeze(0)

outputs = model(input_ids)

last_hidden_states = outputs[0]

With:

In [22]: input_ids                                                                                                                                                                                                                                                                              
Out[22]: tensor([[    5,   121,    11,   660,    16,   730, 25543,   110,    83,     6]])

In [23]: last_hidden_states.shape                                                                                                                                                                                                                                                               
Out[23]: torch.Size([1, 10, 768])

stefan-it on 25 Nov 2019

To get embeddings for the last layer with transformers:

import torch

from transformers import CamembertTokenizer
from transformers import CamembertModel

tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = CamembertModel.from_pretrained('camembert-base')

sent = "J'aime le camembert !"

input_ids = torch.tensor(tokenizer.encode(sent, add_special_tokens=True)).unsqueeze(0)

outputs = model(input_ids)

last_hidden_states = outputs[0]

With:

In [22]: input_ids                                                                                                                                                                                                                                                                              
Out[22]: tensor([[    5,   121,    11,   660,    16,   730, 25543,   110,    83,     6]])

In [23]: last_hidden_states.shape                                                                                                                                                                                                                                                               
Out[23]: torch.Size([1, 10, 768])

Great thank you very much! I really appreciate your quick reply!

I got the idea, we need to pass the input_ids to the model and check the last hidden states :)

Based on your code, I defined a function

def embed(tokenizer, model, sentence):
    input_ids = torch.tensor(tokenizer.encode(sentence, add_special_tokens=True)).unsqueeze(0)
    outputs = model(input_ids)
    last_hidden_states = outputs[0]
    return last_hidden_states

That will return a vector where the shape depends on the number of tokens in sentence.

How to get a fixed embedding vector shape (like flair does) in order to do cosine distance for example between two sentences even if they have different tokens sizes?

Calculating vector mean on axis=1 for example? (since axis=0 and axis=2 have always the same size)

I've asked the question on SOF too if you could answer https://stackoverflow.com/questions/59030907/nlp-transformers-how-to-get-a-fixed-embedding-vector-size :)

Thanks in advance :)

hzitoun on 25 Nov 2019

To get a fixed embedding sentence vector, you could e.g. perform pooling operations, like mean or max pooling over all ouput vectors (from each subtoken).

I think a good reference is the Reimers and Gurevych paper, incl. their repository on GitHub for Transformer-based sentence embeddings :)

stefan-it on 25 Nov 2019

❤1

Was this page helpful?

0 / 5 - 0 ratings