Fairseq: Anyway to finetune GPT2?

Created on 16 Jan 2020 · 10Comments · Source: pytorch/fairseq

Hi, I noticed gpt2 arch is now available from #1264 . But it seems we can only train the GPT2 from zero. Is there any way to finetune the pretrained GPT2 for LM task?

needs triage question

Source

sysu-zjw

👍1

All 10 comments

Currently, I don't believe that we have any pre-trained GPT-2 models, but you should be able to use Roberta instead.

lematt1991 on 29 Jan 2020

👎3 👍1

Hi I am also trying to fine-tune gpt2 on lm tasks. Have you tried using the gpt-2 model checkpoint from hugging face transformers?

MichaelZhouwang on 26 Feb 2020

@MichaelZhouwang I don't believe this will work since Fairseq has a different class structure than Pytorch transformers

nonetheless, to verify here's what I got

In [6]: custom_lm = TransformerLanguageModel.from_pretrained('./checkpoints/gpt2', 'gpt2-pytor
   ...: ch_model.bin', tokenizer='moses', bpe='gpt2')                                                                                    

KeyError                                  Traceback (most recent call last)
<ipython-input-6-8e8df1ba448b> in <module>----> 1 custom_lm = TransformerLanguageModel.from_pretrained('./checkpoints/gpt2', 'gpt2-pytorch_model.bin', tokenizer='moses', bpe='gpt2')./fairseq/fairseq/models/fairseq_model.py in from_pretrained(cls, model_name_or_path, checkpoint_file, data_name_or_path, **kwargs)
    216             data_name_or_path,
    217             archive_map=cls.hub_models(),
--> 218             **kwargs,
    219         )
    220         logger.info(x["args"])

./fairseq/fairseq/hub_utils.py in from_pretrained(model_name_or_path, checkpoint_file, data_name_or_path, archive_map, **kwargs)
     71     models, args, task = checkpoint_utils.load_model_ensemble_and_task(
     72         [os.path.join(model_path, cpt) for cpt in checkpoint_file.split(os.pathsep)],
---> 73         arg_overrides=kwargs,
     74     )
     75 

./fairseq/fairseq/checkpoint_utils.py in load_model_ensemble_and_task(filenames, arg_overrides, task, strict)
    197         if not PathManager.exists(filename):
    198             raise IOError("Model file not found: {}".format(filename))
--> 199         state = load_checkpoint_to_cpu(filename, arg_overrides)
    200 
    201         args = state["args"]

./fairseq/fairseq/checkpoint_utils.py in load_checkpoint_to_cpu(path, arg_overrides)
    167         )
    168 
--> 169     args = state["args"]
    170     if arg_overrides is not None:
    171         for arg_name, arg_val in arg_overrides.items():

KeyError: 'args'

In [7]:                                                                                                                                                                                                                                                 In [7]:                                             In [7]:

hadyelsahar on 23 Mar 2020

seems that someone recently implemented a transformation of HF class structure to Fairseq one,
https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py

A quick and dirty way to make GPT-2 work on this is adapt this model to load a pretrained HF model instead of loading from a config file as in the fairseq example atm.

hadyelsahar on 23 Mar 2020

I've been trying to adapt Huggingface's GPT2 small model. There are several things I've done to get this to work:

Use a dict.txt where the words are in the same order as their indices in HF's gpt2 tokenizer (i.e. '!' has index 0).
Train fairseq's hf_gpt2 for one dummy epoch to gain access to its checkpoint1.pt file. The special flags I use here are
--arch hf_gpt2 --embed-dim 768 --num-layers 12 --num-attention-heads 12 --max-target-positions 1024 \
Load Huggingface's pretrained gpt2 model
Use this code to adapt parameters (we need to skip the first four words in the vocab because they're reserved for bos, pad, eos, and unk, and we need to skip the first positional embedding due to padding)

for name, param in hf_pretrained_gpt2_model.named_parameters():
    if name == 'wte.weight':
        fairseq_gpt2_small['model']['decoder.model.transformer.'+name][4:,:] = param[:]
        fairseq_gpt2_small['model']['decoder.model.lm_head.weight'][:] = fairseq_gpt2_small['model']['decoder.model.transformer.'+name][:]
    elif name == 'wpe.weight':
        fairseq_gpt2_small['model']['decoder.model.transformer.'+name][1:,:] = param[:]
    elif 'decoder.model.transformer.'+name in fairseq_gpt2_small['model'].keys():
        fairseq_gpt2_small['model']['decoder.model.transformer.'+name][:] = param[:]
    else:
        print(name)

Nevertheless, it's still not working (I'm getting starting perplexities of like 10^40 -- so it's clearly doing something, but not what I want).

Has anyone been able to load Huggingface's checkpoints?

dawndrain on 22 Apr 2020

Funnily I have done the same procedure +

Editing the dictionary.py to use the same BOS symbols as HF \ I ran a dummy epoch to save the model. Code works but generations are completely wrong. I put my trials on pause for now until maybe more support is given by the Fairseq folks.

hadyelsahar on 23 Apr 2020

I ran a dummy epoch to save the model. Code works but generations are completely wrong.

Do you mean you're also getting insane perplexity (like 10^40), or is there some other issue?

I think I fixed my issue last night :), so look forward to adaptations of Huggingface's GPT2's very soon :)

(The issue was those 4 tokens at the start of the fairseq vocab I was leaving randomly initialized -- turns out they were being initialized to relatively extreme values, so GPT2 was always 99.9999% sure the next token would be one of them.)

dawndrain on 23 Apr 2020

Trying to do the same but where do I put this code


param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
            ]
        num_train_optimization_steps = len(train_dataloader) * args.num_train_epochs
        optimizer = OpenAIAdam(optimizer_grouped_parameters,
                               lr=args.learning_rate,
                               warmup=args.warmup_proportion,
                               max_grad_norm=args.max_grad_norm,
                               weight_decay=args.weight_decay,
                               t_total=num_train_optimization_steps)