Transformers: Using whole word masking on training LM from scratch

Created on 25 May 2020  ยท  6Comments  ยท  Source: huggingface/transformers

โ“ Questions & Help

Details

Hello everyone,
I wanted to use _whole-word-masking_ in training LM from scratch. I could not have found how to apply this option using Trainer.
I thought this option should be managed in "class DataCollatorForLanguageModeling", but I could not find options for _whole-word-masking._
Am I looking at wrong place OR it is not implemented yet?
If not, is it possible to do with run_language_modeling.py?

A link to original question on Stack Overflow: https://stackoverflow.com/questions/62061578/how-to-use-whole-word-masking-on-training-lm-from-scratch

Any help is appreciated!
Thanks

Most helpful comment

I think it's not implemented yet.

@julien-c any suggestion/thoughts for pretraining with wwm?

All 6 comments

I think it's not implemented yet.

@julien-c any suggestion/thoughts for pretraining with wwm?

NVIDIA/Megatron-LM does wwm on the fly in __ getitem __

We can do something similar in DataCollatorForLanguageModeling or in the dataset

https://github.com/NVIDIA/Megatron-LM/blob/22c0e300670672e4e0a8604bd6ab89bc28c970a6/megatron/data/bert_dataset.py#L148

Thanks for the suggestion, I'll look into it.

@usuyama The Megatron example is for the BERT dataset which uses wordpiece tokenization. Any suggestions how to do wwm for GPT2 tokenizer?

related #6491

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chuanmingliu picture chuanmingliu  ยท  3Comments

alphanlp picture alphanlp  ยท  3Comments

delip picture delip  ยท  3Comments

yspaik picture yspaik  ยท  3Comments

lemonhu picture lemonhu  ยท  3Comments