Flair: How to use List of words as feature vectors along with Stacked Embedding ?

Created on 17 Mar 2021 · 6Comments · Source: flairNLP/flair

In my NLP Pipeline, I'm using multiple Embedding like BERT, Glove and combining them using Stacked Embedding.

Now, I have a list of words like Location names, Company Names, First Names, Last names etc which other embedding treat as unknown. I want to use feature vectors for these words in my pipeline.

My Intuition: I can create One Hot Embedding using the list of words as corpus and then use it with other embeddings.
Do you think this is right? @alanakbik
What is your suggestion?

Please comment for doubt or clarification
Thanks in advance :)

question

Source

KushalVijay

All 6 comments

Hello @KushalVijay do you mean you want to use these lists like gazetteers? If so, we currently do not have support for gazetteers, but this is something we are thinking to add in the near future.

alanakbik on 17 Mar 2021

Thank you for your response @alanakbik , Is there any workaround I can do? Suggestions please

KushalVijay on 18 Mar 2021

Like this, Paper Link

KushalVijay on 18 Mar 2021

Hi, I've done something similar for a Sequence Tagging task related to NER.

The procedure (for uni-grams) can be summed up as follows:

Create a 1-hot vector for each of your known named entity classes.
For each word in your text, create a _n_-dimensional zero vector where _n_ is the number of your known named entity classes. If that word is present in the known terms of named entity classes assign 1 to the entry corresponding to that named entity class.
Assign a unique key to each word in your text (or to each named entity class, it really depends on the use case).
Save a Gensim KeyedVector object corresponding to each word in your text (or to each named entity class). Something like this:

from gensim.models import KeyedVectors

kv = KeyedVectors(vector_size)

for sentence in sentences:
    for term in sentence:
         # You can add vectors in bulk using add_vectors
         # Note that different Gensim version use slightly different naming for this function (i.e., add)
         kv.add_vector(term["key"], term["features"])

kv.save(keyed_vectors_path)

Save your dataset in iob format with a third column for the keys.
Add column names when loading the corpus:

corpus: Corpus = ColumnCorpus(
    data_folder,
    columns={0: "text", 1: "label", 2: "key"},  # THIS
    train_file="train.iob",
    dev_file="dev.iob",
    test_file="test.iob",
)

Add Gensim KeyedVectors while defining you embedding strategy:

embedding_types = [
    WordEmbeddings(embeddings=keyed_vectors_path, field="key"),  # THIS
    ...
 ]

Note that the procedure can be extended to bi-grams (or n-grams) by simply taking into account the bi-grams of the known named entities for each classes when computing the word vectors.
You can use the same procedure to add anything you want actually.

In my case I only use embeddings as input to a CRF with no layers in the middle as it works better in my case.
I suspect projecting these kind of embeddings to perform poorely because the information get lost by mixing them with say BERT or FLAIR embeddings.

Hope this can be of help.

AmenRa on 18 Mar 2021

👀1 👍1

Thanks for sharing such amazing stuff @AmenRa, Even I'm working with Uni-Grams Only. I'll try to use your approach, Hope it will give better results.

We are working over some hypothesis so no matter if that performs well or not.

KushalVijay on 18 Mar 2021

The thread is still open, Please if there are any suggestions.

KushalVijay on 18 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings