Hi, I noticed gpt2 arch is now available from #1264 . But it seems we can only train the GPT2 from zero. Is there any way to finetune the pretrained GPT2 for LM task?
Currently, I don't believe that we have any pre-trained GPT-2 models, but you should be able to use Roberta instead.
Hi I am also trying to fine-tune gpt2 on lm tasks. Have you tried using the gpt-2 model checkpoint from hugging face transformers?
@MichaelZhouwang I don't believe this will work since Fairseq has a different class structure than Pytorch transformers
nonetheless, to verify here's what I got
In [6]: custom_lm = TransformerLanguageModel.from_pretrained('./checkpoints/gpt2', 'gpt2-pytor
...: ch_model.bin', tokenizer='moses', bpe='gpt2')
KeyError Traceback (most recent call last)
<ipython-input-6-8e8df1ba448b> in <module>----> 1 custom_lm = TransformerLanguageModel.from_pretrained('./checkpoints/gpt2', 'gpt2-pytorch_model.bin', tokenizer='moses', bpe='gpt2')./fairseq/fairseq/models/fairseq_model.py in from_pretrained(cls, model_name_or_path, checkpoint_file, data_name_or_path, **kwargs)
216 data_name_or_path,
217 archive_map=cls.hub_models(),
--> 218 **kwargs,
219 )
220 logger.info(x["args"])
./fairseq/fairseq/hub_utils.py in from_pretrained(model_name_or_path, checkpoint_file, data_name_or_path, archive_map, **kwargs)
71 models, args, task = checkpoint_utils.load_model_ensemble_and_task(
72 [os.path.join(model_path, cpt) for cpt in checkpoint_file.split(os.pathsep)],
---> 73 arg_overrides=kwargs,
74 )
75
./fairseq/fairseq/checkpoint_utils.py in load_model_ensemble_and_task(filenames, arg_overrides, task, strict)
197 if not PathManager.exists(filename):
198 raise IOError("Model file not found: {}".format(filename))
--> 199 state = load_checkpoint_to_cpu(filename, arg_overrides)
200
201 args = state["args"]
./fairseq/fairseq/checkpoint_utils.py in load_checkpoint_to_cpu(path, arg_overrides)
167 )
168
--> 169 args = state["args"]
170 if arg_overrides is not None:
171 for arg_name, arg_val in arg_overrides.items():
KeyError: 'args'
In [7]: In [7]: In [7]:
seems that someone recently implemented a transformation of HF class structure to Fairseq one,
https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py
A quick and dirty way to make GPT-2 work on this is adapt this model to load a pretrained HF model instead of loading from a config file as in the fairseq example atm.
I've been trying to adapt Huggingface's GPT2 small model. There are several things I've done to get this to work:
--arch hf_gpt2 --embed-dim 768 --num-layers 12 --num-attention-heads 12 --max-target-positions 1024 \for name, param in hf_pretrained_gpt2_model.named_parameters():
if name == 'wte.weight':
fairseq_gpt2_small['model']['decoder.model.transformer.'+name][4:,:] = param[:]
fairseq_gpt2_small['model']['decoder.model.lm_head.weight'][:] = fairseq_gpt2_small['model']['decoder.model.transformer.'+name][:]
elif name == 'wpe.weight':
fairseq_gpt2_small['model']['decoder.model.transformer.'+name][1:,:] = param[:]
elif 'decoder.model.transformer.'+name in fairseq_gpt2_small['model'].keys():
fairseq_gpt2_small['model']['decoder.model.transformer.'+name][:] = param[:]
else:
print(name)
Nevertheless, it's still not working (I'm getting starting perplexities of like 10^40 -- so it's clearly doing something, but not what I want).
Has anyone been able to load Huggingface's checkpoints?
Funnily I have done the same procedure +
dictionary.py to use the same BOS symbols as HF \I ran a dummy epoch to save the model. Code works but generations are completely wrong.
Do you mean you're also getting insane perplexity (like 10^40), or is there some other issue?
I think I fixed my issue last night :), so look forward to adaptations of Huggingface's GPT2's very soon :)
(The issue was those 4 tokens at the start of the fairseq vocab I was leaving randomly initialized -- turns out they were being initialized to relatively extreme values, so GPT2 was always 99.9999% sure the next token would be one of them.)
Trying to do the same but where do I put this code
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
num_train_optimization_steps = len(train_dataloader) * args.num_train_epochs
optimizer = OpenAIAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
warmup=args.warmup_proportion,
max_grad_norm=args.max_grad_norm,
weight_decay=args.weight_decay,
t_total=num_train_optimization_steps)
@thak123 You do training with fairseq using their command-line interface. See their language modeling example
@dawndrain thanks for the pointer. Will check it out.