I want to load bert-base-chinese in huggingface or google bert and use fairseq to finetune it, how to do? thanks a lot!
me too, hope for answers
It should be straightforward to wrap huggingface models in the corresponding fairseq abstractions. We've done this for the gpt2 language model implementation in huggingface: https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py
It'd be great to add more wrappers for other model types (e.g., FairseqEncoderModel for BERT-like models) and also to generalize it to load arbitrary pretrained models from huggingface (e.g., using AutoModel).
PRs are welcome! 😄
@myleott According to the suggested way can we use the pretrained huggingface checkpoint?
I feel like we need to specially change data preprocessing steps.
Fairseq doesn’t really do any preprocessing. If you want to apply tokenization or BPE, that should happen outside of fairseq, then you can feed the resulting text into fairseq-preprocess/train.
Steps might be:
1) start with raw text training data
2) use huggingface to tokenize and apply BPE. Get back a text file with BPE tokens separated by spaces
3) feed step 2 into fairseq-preprocess, which will tensorize and generate dict.txt
@myleott Is it necessary to go through fairseq-preprocess ?
How about just use the output of the hugging face tokenizer(raw text like "您好,世界" as tokenizer's input, dict of tensors as output) as model's input ?
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained(model_path)
model = BertModel.from_pretrained(model_path)
input_texts = ["您好, 世界"]
inputs = tokenizer(input_texts, padding=True, return_tensors='pt')
print("inputs:{}".format(inputs))
inputs:{
'input_ids': tensor([[ 101, 2644, 1962, 117, 686, 4518, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
Thank you!
You can do it. But it will slow down your training. Specially the data
feeding part.
On Tue, Oct 27, 2020, 21:17 CheungZee notifications@github.com wrote:
@myleott https://github.com/myleott Is it necessary to go through
fairseq-preprocess ?
How about just use the output of the hugging face tokenizer(raw text like
"您好,世界" as tokenizer's input, dict of tensors as output) as model's input ?
`
from transformers import BertModel, BertTokenizertokenizer = BertTokenizer.from_pretrained(model_path)
model = BertModel.from_pretrained(model_path)
input_texts = ["您好, 世界"]
inputs = tokenizer(input_texts, padding=True, return_tensors='pt')
print("inputs:{}".format(inputs))
got:
inputs:{
'input_ids': tensor([[ 101, 8701, 102, 0],
[ 101, 2644, 1962, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0],
[0, 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, 0],
[1, 1, 1, 1]])}
`—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/2666#issuecomment-717068560,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AEA4FGTV237YQGP55ROWBNDSMZ6YDANCNFSM4R4DTYOA
.
Is there an example of using the code in https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py ?
@myleott @shamanez
It seems like that this is only a wrap, but there are more should be done if we want to load the pretrained gpt2 model from hugging face?
Thank you!