Hi,
A little question about French model.
Do you plan to include camemBERT (https://camembert-model.fr) model in a futur release soon ?
Thanks
I tested it locally for a French NER dataset and the results are really great (compared to the multilingual BERT model).
I'm going to add CamemBERT to Flair in the next days :)
Great news! Let us know when you it's done :)
Great news! Waiting and eager to use the model with flair!
Great news! I'm eager to use the model with flair!
Do you checkout onto the new branch to get it ?
I don't find it in the actual release.
@stefan-it, Can you confirm camemBERT with Flair is ready to be used ?
@robinalexandre You just need to install a recent version of transformers, then you can use the camembert-embeddings branch of Flair.
Implementation is working, I just need to add some unit tests and a depedency update of transformers :)
@robinalexandre You just need to install a recent version of
transformers, then you can use thecamembert-embeddingsbranch of Flair.Implementation is working, I just need to add some unit tests and a depedency update of
transformers:)
Great work @stefan-it ! Could you please show how to get the embedding vector of a sentence with HF transformers lib please? (the equivalent of flair's embedding.embed([Sentence("Salut, 莽a va")])
To get embeddings for the last layer with transformers:
import torch
from transformers import CamembertTokenizer
from transformers import CamembertModel
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = CamembertModel.from_pretrained('camembert-base')
sent = "J'aime le camembert !"
input_ids = torch.tensor(tokenizer.encode(sent, add_special_tokens=True)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
With:
In [22]: input_ids
Out[22]: tensor([[ 5, 121, 11, 660, 16, 730, 25543, 110, 83, 6]])
In [23]: last_hidden_states.shape
Out[23]: torch.Size([1, 10, 768])
To get embeddings for the last layer with
transformers:import torch from transformers import CamembertTokenizer from transformers import CamembertModel tokenizer = CamembertTokenizer.from_pretrained("camembert-base") model = CamembertModel.from_pretrained('camembert-base') sent = "J'aime le camembert !" input_ids = torch.tensor(tokenizer.encode(sent, add_special_tokens=True)).unsqueeze(0) outputs = model(input_ids) last_hidden_states = outputs[0]With:
In [22]: input_ids Out[22]: tensor([[ 5, 121, 11, 660, 16, 730, 25543, 110, 83, 6]]) In [23]: last_hidden_states.shape Out[23]: torch.Size([1, 10, 768])
Great thank you very much! I really appreciate your quick reply!
I got the idea, we need to pass the input_ids to the model and check the last hidden states :)
Based on your code, I defined a function
def embed(tokenizer, model, sentence):
input_ids = torch.tensor(tokenizer.encode(sentence, add_special_tokens=True)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
return last_hidden_states
That will return a vector where the shape depends on the number of tokens in sentence.
How to get a fixed embedding vector shape (like flair does) in order to do cosine distance for example between two sentences even if they have different tokens sizes?
Calculating vector mean on axis=1 for example? (since axis=0 and axis=2 have always the same size)
I've asked the question on SOF too if you could answer https://stackoverflow.com/questions/59030907/nlp-transformers-how-to-get-a-fixed-embedding-vector-size :)
Thanks in advance :)
To get a fixed embedding sentence vector, you could e.g. perform pooling operations, like mean or max pooling over all ouput vectors (from each subtoken).
I think a good reference is the Reimers and Gurevych paper, incl. their repository on GitHub for Transformer-based sentence embeddings :)
Most helpful comment
I tested it locally for a French NER dataset and the results are really great (compared to the multilingual BERT model).
I'm going to add CamemBERT to Flair in the next days :)