Hello,
I have couple of questions regarding finetuning RoBERTa model. I followed the instructions given in the RoBERTa on a custom classification task example using IMDB dataset. After running preprocess data step, with the following code:
fairseq-preprocess \
--only-source \
--trainpref "myfolder/train.input0.bpe" \
--validpref "myfolder/dev.input0.bpe" \
--destdir "myfolder/input0" \
--workers 60 \
--srcdict dict.txt
fairseq-preprocess \
--only-source \
--trainpref "myfolder/train.label" \
--validpref "myfolder/dev.label" \
--destdir "myfolder/label" \
--workers 60
I see that the dict.txt files under both input0 and label folders include madeupword. Is that expected and normal? When I added --padding-factor 1 flag, I could get rid of madeupwords only in the label dictionary. I don't know if that was right thing to do. Would you recommend to use --padding-factor 1?
Second question:
After I run python train.py, it prints out that there are 13 types in label dictionary:
| [input] dictionary: 50265 types
| [label] dictionary: 13 types
| loaded 1500 examples from: /workspace/fairseq/data/input0/valid
| loaded 1500 examples from: /workspace/fairseq/data/label/valid
I check the the label dictionary in /workspace/fairseq/data/label/dict.txt, this is what I see:
2 3500
4 3500
5 3500
3 3499
6 3499
7 3499
0 3498
1 3496
if the first column corresponds to the labels in my dataset, and the second corresponds to the count of that label, then I'd expect all to be 3500 for 8 classes. But it is not. How can I interpret this dict.txt file? And where are the 13 data types?
Thanks.
I see that the dict.txt files under both input0 and label folders include madeupword. Is that expected and normal?
When specifying --srcdict dict.txt, it will pre-process your data using that dictionary. This is crucial when doing fine-tuning on additional data, so that the new data gets indices consistent with the dictionary that it was originally trained on. This original dictionary that you downloaded with the wget command has these madeupwords because --padding-factor was used when it was originally trained.
if the first column corresponds to the labels in my dataset, and the second corresponds to the count of that label, then I'd expect all to be 3500 for 8 classes. But it is not.
Your understanding is correct, and the resulting dictionary is unexpected. Are you sure that you have exactly 3,500 instances of each label? Can you upload your file to a gist or something?
And where are the 13 data types?
The log saying the dictionary has 13 types is including special tokens that get included with all dictionaries (eos, unk, pad, etc.).
@lematt1991 thanks for your quick response.
I also checked RoBERTa on a custom classification task example on IMDB dataset . I saw similar results in the IMDB-bin/label/dict.txt:
1 12497
0 12494
madeupword0000 0
madeupword0001 0
I assume this is also unexpected?
Do you mind uploading your labels file to a gist or somewhere that you can share it? It's hard to say without being able to see it.
@lematt1991 Please visit the link:
I generated label dict.txt with that command:
!fairseq-preprocess --only-source --trainpref "./data/train.label" --validpref "./data/dev.label" --destdir "./myfolder/label" --workers 60
Note that I did not add --padding-factor 1 flag this time.
Can you make the gist public, or provide a link directly to the file? It says: "rnyak doesn’t have any public gists yet."
@lematt1991 It is public now. I added labels and also labels dictionary (dict.txt). so there are 2 files.
This is the output that I got:
(fairseq) mattle-mbp:labels mattle$ fairseq-preprocess --only-source --trainpref labels --validpref labels
Namespace(align_suffix=None, alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer='nag', padding_factor=8, seed=1, source_lang=None, srcdict=None, target_lang=None, task='translation', tensorboard_logdir='', testpref=None, tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='labels', user_dir=None, validpref='labels', workers=1)
| [None] Dictionary: 15 types
| [None] labels: 28000 sents, 56000 tokens, 0.0% replaced by <unk>
| [None] Dictionary: 15 types
| [None] labels: 28000 sents, 56000 tokens, 0.0% replaced by <unk>
| Wrote preprocessed data to data-bin
(fairseq) mattle-mbp:labels mattle$ cat data-bin/dict.txt
0 3500
1 3500
2 3500
3 3500
4 3500
5 3500
6 3500
7 3500
madeupword0000 0
madeupword0001 0
madeupword0002 0
madeupword0003 0
which seems correct to me. Not sure what train.label and dev.label are, maybe splitting them is what resulted in unexpected counts?
@lematt1991 I am not splitting the data as train and val. they both coming from different buckets. so I have already have dev.tsv and train.tsv before doing BPE encoding and Preprocessing.
my dict.txt looks like this:
2 3500
0 3499
1 3499
4 3499
5 3499
7 3499
3 3498
6 3498
madeupword0000 0
madeupword0001 0
madeupword0002 0
madeupword0003 0
I see your command is slightly different than mine. would that be the reason?
Is it simply this:
yes
They won't be affected in any way.