Transformers: Getting started with the new 'FeatureExtractionPipeline' feature

Created on 13 Jan 2020 · 4Comments · Source: huggingface/transformers

Hi,

At the moment I'm trying to extract features of the second last layer using the "run_lm_finetuning.py" script in combination with the setting "output_hidden_sates=True".

I'm wondering if this new FeatureExtractionPipeline feature would be a good alternative and how to get started using this new feature. I have been trying to read the documentation and so far I have figured out I should do something along the lines of:

from transformers import pipeline
nlp = pipeline('feature-extraction', model='', config='', tokenizer='', binary_output=True,)

I'm pretty sure I'm missing some important parameters and details however. For example the input and output parameter. Only looking at the code alone makes me a little puzzled at the moment since I'm not very proficient yet with Python and Pytorch and the official documentation has not much documentation and examples on this new feature yet.

Can someone please help me get started using this new feature by giving some good example and point towards some important parameters to get started?

Pipeline

Source

Stuffooh

Most helpful comment

@gsasikiran, check out spacyface.

leungi on 1 Feb 2020

👍2

All 4 comments

@Stuffooh, the following is based on my understanding and experiment.

The default params for nlp = pipeline('feature-extraction') uses distilbert-base-uncased for both the model and tokenizer.

The nlp object takes as input a sentence and output token-level vectors; note that token-level doesn't necessarily equal word-level since BERT uses WordPiece tokenization. Below are examples to show this.

sent = nlp("This is a dog.")

# get length of output
print(len(sent[0]))
> 7

# it is seven because there's a [CLS] and [SEP] token added to the start and end of sentence, and the full stop `.` counts as a token.

sent = nlp("This is a untrained dog.")

# get length of output
print(len(sent[0]))
> 10

# similar to above example, with the addition of the word `untrained`, which in this case is broken up into three sub-pieces (tokens)

leungi on 14 Jan 2020

👍1

@leungi What I'm wondering though is how to finetune models using the pipeline feature-extraction. How to finetune 3 epochs with a certain set learning rate for example?

I feel like I am misisng something here. In the run_lm_finetuning.py script for example it is easy and clear to pass all these parameters while outputting the hidden states of the model.

Stuffooh on 16 Jan 2020

@leungi How to visualise the tokens, the embeddings has assigned to?