Fairseq: How to load a pretrained model from huggingface and use it in fairseq?

Created on 28 Sep 2020 · 7Comments · Source: pytorch/fairseq

I want to load bert-base-chinese in huggingface or google bert and use fairseq to finetune it, how to do? thanks a lot!

enhancement help wanted

Source

ttzHome

🚀1 👍1

All 7 comments

me too, hope for answers

jia-zhuang on 29 Sep 2020

It should be straightforward to wrap huggingface models in the corresponding fairseq abstractions. We've done this for the gpt2 language model implementation in huggingface: https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py

It'd be great to add more wrappers for other model types (e.g., FairseqEncoderModel for BERT-like models) and also to generalize it to load arbitrary pretrained models from huggingface (e.g., using AutoModel).

PRs are welcome! 😄

myleott on 29 Sep 2020

👍1

@myleott According to the suggested way can we use the pretrained huggingface checkpoint?

I feel like we need to specially change data preprocessing steps.

Tokenization
Fairseq-preprocess function. (Here I don't understand how to create a dict.txt)

shamanez on 6 Oct 2020

Fairseq doesn’t really do any preprocessing. If you want to apply tokenization or BPE, that should happen outside of fairseq, then you can feed the resulting text into fairseq-preprocess/train.

Steps might be:
1) start with raw text training data
2) use huggingface to tokenize and apply BPE. Get back a text file with BPE tokens separated by spaces
3) feed step 2 into fairseq-preprocess, which will tensorize and generate dict.txt

myleott on 10 Oct 2020

👍1

@myleott Is it necessary to go through fairseq-preprocess ?
How about just use the output of the hugging face tokenizer(raw text like "您好，世界" as tokenizer's input, dict of tensors as output) as model's input ?

    from transformers import BertModel, BertTokenizer

    tokenizer = BertTokenizer.from_pretrained(model_path)
    model = BertModel.from_pretrained(model_path)
    input_texts = ["您好, 世界"]
    inputs = tokenizer(input_texts, padding=True, return_tensors='pt')
    print("inputs:{}".format(inputs))

got:


 inputs:{
'input_ids': tensor([[ 101, 2644, 1962,  117,  686, 4518,  102]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

Thank you!

CheungZeeCn on 27 Oct 2020

You can do it. But it will slow down your training. Specially the data
feeding part.

On Tue, Oct 27, 2020, 21:17 CheungZee notifications@github.com wrote:

@myleott https://github.com/myleott Is it necessary to go through
fairseq-preprocess ?
How about just use the output of the hugging face tokenizer(raw text like
"您好，世界" as tokenizer's input, dict of tensors as output) as model's input ?
`
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained(model_path)

model = BertModel.from_pretrained(model_path)

input_texts = ["您好, 世界"]

inputs = tokenizer(input_texts, padding=True, return_tensors='pt')

print("inputs:{}".format(inputs))

got:

inputs:{
'input_ids': tensor([[ 101, 8701, 102, 0],
[ 101, 2644, 1962, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0],
[0, 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, 0],
[1, 1, 1, 1]])}
`

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/2666#issuecomment-717068560,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AEA4FGTV237YQGP55ROWBNDSMZ6YDANCNFSM4R4DTYOA
.