Fairseq: Finetuning Roberta with custom data

Created on 2 Jan 2020 · 10Comments · Source: pytorch/fairseq

❓ Questions and Help

Hello,

I have couple of questions regarding finetuning RoBERTa model. I followed the instructions given in the RoBERTa on a custom classification task example using IMDB dataset. After running preprocess data step, with the following code:

fairseq-preprocess \
    --only-source \
    --trainpref "myfolder/train.input0.bpe" \
    --validpref "myfolder/dev.input0.bpe" \
    --destdir "myfolder/input0" \
    --workers 60 \
    --srcdict dict.txt

fairseq-preprocess \
    --only-source \
    --trainpref "myfolder/train.label" \
    --validpref "myfolder/dev.label" \
    --destdir "myfolder/label" \
    --workers 60

I see that the dict.txt files under both input0 and label folders include madeupword. Is that expected and normal? When I added --padding-factor 1 flag, I could get rid of madeupwords only in the label dictionary. I don't know if that was right thing to do. Would you recommend to use --padding-factor 1?

Second question:
After I run python train.py, it prints out that there are 13 types in label dictionary:

| [input] dictionary: 50265 types
| [label] dictionary: 13 types
| loaded 1500 examples from: /workspace/fairseq/data/input0/valid
| loaded 1500 examples from: /workspace/fairseq/data/label/valid

I check the the label dictionary in /workspace/fairseq/data/label/dict.txt, this is what I see:

if the first column corresponds to the labels in my dataset, and the second corresponds to the count of that label, then I'd expect all to be 3500 for 8 classes. But it is not. How can I interpret this dict.txt file? And where are the 13 data types?

Thanks.

question

Source

rnyak

All 10 comments

I see that the dict.txt files under both input0 and label folders include madeupword. Is that expected and normal?

When specifying --srcdict dict.txt, it will pre-process your data using that dictionary. This is crucial when doing fine-tuning on additional data, so that the new data gets indices consistent with the dictionary that it was originally trained on. This original dictionary that you downloaded with the wget command has these madeupwords because --padding-factor was used when it was originally trained.

if the first column corresponds to the labels in my dataset, and the second corresponds to the count of that label, then I'd expect all to be 3500 for 8 classes. But it is not.

Your understanding is correct, and the resulting dictionary is unexpected. Are you sure that you have exactly 3,500 instances of each label? Can you upload your file to a gist or something?

And where are the 13 data types?

The log saying the dictionary has 13 types is including special tokens that get included with all dictionaries (eos, unk, pad, etc.).

lematt1991 on 2 Jan 2020

@lematt1991 thanks for your quick response.

Yes, I am sure that there are 3500 samples for each class.
Just for the sanity check, the data type of the labels 0, 1, 2, ...,7 is string, right?

I also checked RoBERTa on a custom classification task example on IMDB dataset . I saw similar results in the IMDB-bin/label/dict.txt:

1 12497
0 12494
madeupword0000 0
madeupword0001 0

I assume this is also unexpected?

rnyak on 3 Jan 2020

Do you mind uploading your labels file to a gist or somewhere that you can share it? It's hard to say without being able to see it.

lematt1991 on 3 Jan 2020

@lematt1991 Please visit the link:

https://gist.github.com/rnyak

I generated label dict.txt with that command:

!fairseq-preprocess --only-source --trainpref "./data/train.label" --validpref "./data/dev.label" --destdir "./myfolder/label" --workers 60

Note that I did not add --padding-factor 1 flag this time.

rnyak on 3 Jan 2020

👍1

Can you make the gist public, or provide a link directly to the file? It says: "rnyak doesn’t have any public gists yet."

lematt1991 on 3 Jan 2020

👍1

@lematt1991 It is public now. I added labels and also labels dictionary (dict.txt). so there are 2 files.

rnyak on 3 Jan 2020

👍1

This is the output that I got:

(fairseq) mattle-mbp:labels mattle$ fairseq-preprocess --only-source --trainpref labels --validpref labels
Namespace(align_suffix=None, alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer='nag', padding_factor=8, seed=1, source_lang=None, srcdict=None, target_lang=None, task='translation', tensorboard_logdir='', testpref=None, tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='labels', user_dir=None, validpref='labels', workers=1)
| [None] Dictionary: 15 types
| [None] labels: 28000 sents, 56000 tokens, 0.0% replaced by <unk>
| [None] Dictionary: 15 types
| [None] labels: 28000 sents, 56000 tokens, 0.0% replaced by <unk>
| Wrote preprocessed data to data-bin
(fairseq) mattle-mbp:labels mattle$ cat data-bin/dict.txt
0 3500
1 3500
2 3500
3 3500
4 3500
5 3500
6 3500
7 3500
madeupword0000 0
madeupword0001 0
madeupword0002 0
madeupword0003 0

which seems correct to me. Not sure what train.label and dev.label are, maybe splitting them is what resulted in unexpected counts?

lematt1991 on 3 Jan 2020

@lematt1991 I am not splitting the data as train and val. they both coming from different buckets. so I have already have dev.tsv and train.tsv before doing BPE encoding and Preprocessing.

my dict.txt looks like this:

2 3500
0 3499
1 3499
4 3499
5 3499
7 3499
3 3498
6 3498
madeupword0000 0
madeupword0001 0
madeupword0002 0
madeupword0003 0

I see your command is slightly different than mine. would that be the reason?

rnyak on 3 Jan 2020

Is it simply this:

yes

lematt1991 on 3 Jan 2020

They won't be affected in any way.

lematt1991 on 3 Jan 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

AdaFactor to save GPU memory?

AranKomat · 3Comments

(wav2vec 2.0)Can you provide detailed hyperparameters for finetune?

zqs01 · 3Comments

Invalid syntax when running the pre-processing script

ajesujoba · 3Comments

Getting error while generating translations using fairseq-generate cli command.

kr-sundaram · 3Comments

Error during inference of model trained on fp16

Raghava14 · 3Comments