Fairseq: How to load a pretrained model from huggingface and use it in fairseq?

Created on 28 Sep 2020  Â·  7Comments  Â·  Source: pytorch/fairseq

I want to load bert-base-chinese in huggingface or google bert and use fairseq to finetune it, how to do? thanks a lot!

enhancement help wanted

All 7 comments

me too, hope for answers

It should be straightforward to wrap huggingface models in the corresponding fairseq abstractions. We've done this for the gpt2 language model implementation in huggingface: https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py

It'd be great to add more wrappers for other model types (e.g., FairseqEncoderModel for BERT-like models) and also to generalize it to load arbitrary pretrained models from huggingface (e.g., using AutoModel).

PRs are welcome! 😄

@myleott According to the suggested way can we use the pretrained huggingface checkpoint?

I feel like we need to specially change data preprocessing steps.

  1. Tokenization
  2. Fairseq-preprocess function. (Here I don't understand how to create a dict.txt)

Fairseq doesn’t really do any preprocessing. If you want to apply tokenization or BPE, that should happen outside of fairseq, then you can feed the resulting text into fairseq-preprocess/train.

Steps might be:
1) start with raw text training data
2) use huggingface to tokenize and apply BPE. Get back a text file with BPE tokens separated by spaces
3) feed step 2 into fairseq-preprocess, which will tensorize and generate dict.txt

@myleott Is it necessary to go through fairseq-preprocess ?
How about just use the output of the hugging face tokenizer(raw text like "您好,世界" as tokenizer's input, dict of tensors as output) as model's input ?

    from transformers import BertModel, BertTokenizer

    tokenizer = BertTokenizer.from_pretrained(model_path)
    model = BertModel.from_pretrained(model_path)
    input_texts = ["您好, 世界"]
    inputs = tokenizer(input_texts, padding=True, return_tensors='pt')
    print("inputs:{}".format(inputs))

got:


 inputs:{
'input_ids': tensor([[ 101, 2644, 1962,  117,  686, 4518,  102]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

Thank you!

You can do it. But it will slow down your training. Specially the data
feeding part.

On Tue, Oct 27, 2020, 21:17 CheungZee notifications@github.com wrote:

@myleott https://github.com/myleott Is it necessary to go through
fairseq-preprocess ?
How about just use the output of the hugging face tokenizer(raw text like
"您好,世界" as tokenizer's input, dict of tensors as output) as model's input ?
`
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained(model_path)

model = BertModel.from_pretrained(model_path)

input_texts = ["您好, 世界"]

inputs = tokenizer(input_texts, padding=True, return_tensors='pt')

print("inputs:{}".format(inputs))

got:

inputs:{
'input_ids': tensor([[ 101, 8701, 102, 0],
[ 101, 2644, 1962, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0],
[0, 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, 0],
[1, 1, 1, 1]])}
`

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/2666#issuecomment-717068560,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AEA4FGTV237YQGP55ROWBNDSMZ6YDANCNFSM4R4DTYOA
.

Is there an example of using the code in https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py ?
@myleott @shamanez

It seems like that this is only a wrap, but there are more should be done if we want to load the pretrained gpt2 model from hugging face?

Thank you!

Was this page helpful?
0 / 5 - 0 ratings