Transformers: Using whole word masking on training LM from scratch

Created on 25 May 2020 · 6Comments · Source: huggingface/transformers

❓ Questions & Help

Details

Hello everyone,
I wanted to use _whole-word-masking_ in training LM from scratch. I could not have found how to apply this option using Trainer.
I thought this option should be managed in "class DataCollatorForLanguageModeling", but I could not find options for _whole-word-masking._
Am I looking at wrong place OR it is not implemented yet?
If not, is it possible to do with run_language_modeling.py?

A link to original question on Stack Overflow: https://stackoverflow.com/questions/62061578/how-to-use-whole-word-masking-on-training-lm-from-scratch

Any help is appreciated!
Thanks

Source

uunal

👍5

Most helpful comment

I think it's not implemented yet.

@julien-c any suggestion/thoughts for pretraining with wwm?

usuyama on 30 May 2020

👍3

All 6 comments

I think it's not implemented yet.

@julien-c any suggestion/thoughts for pretraining with wwm?

usuyama on 30 May 2020

👍3

NVIDIA/Megatron-LM does wwm on the fly in __ getitem __

We can do something similar in DataCollatorForLanguageModeling or in the dataset

https://github.com/NVIDIA/Megatron-LM/blob/22c0e300670672e4e0a8604bd6ab89bc28c970a6/megatron/data/bert_dataset.py#L148

usuyama on 14 Jun 2020

Thanks for the suggestion, I'll look into it.

uunal on 18 Jun 2020

@usuyama The Megatron example is for the BERT dataset which uses wordpiece tokenization. Any suggestions how to do wwm for GPT2 tokenizer?

luffycodes on 6 Aug 2020

related #6491

usuyama on 18 Sep 2020

Check if still looking for an answer:
https://github.com/huggingface/transformers/blob/07708793f20ec3a949ccab32cc4fe0c7272dcc4c/src/transformers/data/data_collator.py#L301

uunal on 7 Nov 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Problem about convert TF model and pretraining

zhezhaoa · 3Comments

Unseen Vocab

siddsach · 3Comments

Finetuning OpenAI GPT-2 for another language.

0x01h · 3Comments

_load_from_state_dict() takes 7 positional arguments but 8 were given

guanlongtianzi · 3Comments

GPT2 tokenizer is so slow because of sum()

iedmrc · 3Comments