Fairseq: Roberta whole word masking?

Created on 3 Jan 2020  路  6Comments  路  Source: pytorch/fairseq

馃殌 Feature Request

Whole word masking version of Roberta

Motivation

BERT whole world masking brings improvement: https://github.com/google-research/bert

Pitch

Pretrain whole word masking version of roberta

Alternatives

NA

Additional context

NA

enhancement help wanted

Most helpful comment

We did experiment with whole word masking but it didn't seem to help. We would like to explore larger span-based masking (similar to SpanBERT), but haven't been able to train any full models with it yet.

All 6 comments

I think you should be able to use --mask-whole-words

I think you should be able to use --mask-whole-words

Thanks for the reply. Are there pretrained models available to download?

Unfortunately, I don't believe the pre-trained models used this. CC: @ngoyal2707 and @myleott for confirmation.

Unfortunately, I don't believe the pre-trained models used this. CC: @ngoyal2707 and @myleott for confirmation.

I believe so. I know it costs a lot but it could be good to have WWM versions of Roberta.

We did experiment with whole word masking but it didn't seem to help. We would like to explore larger span-based masking (similar to SpanBERT), but haven't been able to train any full models with it yet.

I think you should be able to use --mask-whole-words

@lematt1991 To train Roberta with this parameter should it be just added at the end of the command provided in the documentation with True as set value ?
Like this :

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin/wikitext-103

fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 \
    --mask-whole-words True
Was this page helpful?
0 / 5 - 0 ratings