Fairseq: Roberta whole word masking?

Created on 3 Jan 2020 · 6Comments · Source: pytorch/fairseq

🚀 Feature Request

Whole word masking version of Roberta

Motivation

BERT whole world masking brings improvement: https://github.com/google-research/bert

Pitch

Pretrain whole word masking version of roberta

Alternatives

Additional context

enhancement help wanted

Source

tbright17

Most helpful comment

We did experiment with whole word masking but it didn't seem to help. We would like to explore larger span-based masking (similar to SpanBERT), but haven't been able to train any full models with it yet.

myleott on 4 Jan 2020

👍2

All 6 comments

I think you should be able to use --mask-whole-words

lematt1991 on 3 Jan 2020

I think you should be able to use --mask-whole-words

Thanks for the reply. Are there pretrained models available to download?

tbright17 on 3 Jan 2020

Unfortunately, I don't believe the pre-trained models used this. CC: @ngoyal2707 and @myleott for confirmation.

lematt1991 on 3 Jan 2020

Unfortunately, I don't believe the pre-trained models used this. CC: @ngoyal2707 and @myleott for confirmation.

I believe so. I know it costs a lot but it could be good to have WWM versions of Roberta.

tbright17 on 3 Jan 2020

👍1

myleott on 4 Jan 2020

👍2

I think you should be able to use --mask-whole-words

@lematt1991 To train Roberta with this parameter should it be just added at the end of the command provided in the documentation with True as set value ?
Like this :

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin/wikitext-103

fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 \
    --mask-whole-words True