Hello everyone,
I wanted to use _whole-word-masking_ in training LM from scratch. I could not have found how to apply this option using Trainer.
I thought this option should be managed in "class DataCollatorForLanguageModeling", but I could not find options for _whole-word-masking._
Am I looking at wrong place OR it is not implemented yet?
If not, is it possible to do with run_language_modeling.py?
A link to original question on Stack Overflow: https://stackoverflow.com/questions/62061578/how-to-use-whole-word-masking-on-training-lm-from-scratch
Any help is appreciated!
Thanks
I think it's not implemented yet.
@julien-c any suggestion/thoughts for pretraining with wwm?
NVIDIA/Megatron-LM does wwm on the fly in __ getitem __
We can do something similar in DataCollatorForLanguageModeling or in the dataset
Thanks for the suggestion, I'll look into it.
@usuyama The Megatron example is for the BERT dataset which uses wordpiece tokenization. Any suggestions how to do wwm for GPT2 tokenizer?
related #6491
Check if still looking for an answer:
https://github.com/huggingface/transformers/blob/07708793f20ec3a949ccab32cc4fe0c7272dcc4c/src/transformers/data/data_collator.py#L301
Most helpful comment
I think it's not implemented yet.
@julien-c any suggestion/thoughts for pretraining with wwm?