Transformers: examples/seq2seq/test_bash_script.py :: actually learn something

Created on 27 Jul 2020 · 23Comments · Source: huggingface/transformers

At the moment validation bleu barely gets above zero in the tests, so they don't really prove much about our code.

we could use a larger model like sshleifer/student_marian_6_3, and more data, and train for 10 minutes . This would allows us to test whether changing default parameters/batch techniques obviously degrades performance.

The github actions CI reuses it's own disk, so this will only run there and hopefully not have super slow downloads.

Help wanted Tests

Source

sshleifer

All 23 comments

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 26 Sep 2020

This is still an issue :)

sshleifer on 26 Sep 2020

This command runs in 3 mins (without including downloads)

# export WANDB_PROJECT=dmar
export MAX_LEN=64
python finetune.py \
  --learning_rate=3e-4 \
  --do_train \
  --do_predict \
  --fp16 \
  --val_check_interval 0.25 --n_train 100000 --n_val 500 --n_test 500 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --freeze_encoder --freeze_embeds \
  --train_batch_size=64 --eval_batch_size=64 \
  --tokenizer_name Helsinki-NLP/opus-mt-en-ro \
  --model_name_or_path sshleifer/mar_enro_6_3_student \
  --warmup_steps 500 --sortish_sampler \
  --gpus 1 --fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 --output_dir dmar_utest_1gpu --num_train_epochs=1 \
  --overwrite_output_dir

Test results

cat dmar_utest_1gpu/test_results.txt

bs: 32.000000
loss: 1.122035
src_pad_frac: 0.322266
src_pad_tok: 330.000000
step_count: 5.000000
test_avg_bleu: 20.660713
test_avg_gen_len: 63.750000
test_avg_gen_time: 0.033025
test_avg_loss: 2.215564
test_bleu: 20.660713
test_loss: 2.215564
tpb: 1475.000000
val_avg_bleu: 20.099000
val_avg_gen_len: 63.125000
val_avg_gen_time: 0.031409
val_avg_loss: 2.265883
val_bleu: 20.099001
val_loss: 2.265883



md5-3adbadab6c91b30d4b5b8786262a4ef5



 cat dmar_utest_1gpu/metrics.json | grep bleu



md5-70f3cd84d26193e1d1270980836dcef9



            "val_avg_bleu": 16.3561625,
            "val_avg_bleu": 19.0204625,
            "val_avg_bleu": 19.704875,
            "val_avg_bleu": 20.099,
            "test_avg_bleu": 20.660712500000002

So this would be a good template for the test.

Spec

convert the command into a unit-test (With programatic download of the right amount of data, possibly through via a new s3 .tgz file.
Replace existing test_bash_script marian test.
Try to cut n_train/further
Anything less than 15 mins is fine, but the faster the better.
Minimum learning requirement 1: BLEU improves over the course of training by more than 2 pts
Minimum learning requirement 2: BLEU finishes above 17
Minimum learning requirement 3: test bleu and val bleu within 1 pt.

(this command meets all 3 learning requirements).

Wdyt @stas00 ?

sshleifer on 4 Nov 2020

I will work on that, thank you.

stas00 on 4 Nov 2020

@sshleifer, could you please validate that this is the command you run?

I get very different (bad) results:

bs: 9.000000
loss: 6.701375
src_pad_frac: 0.118056
src_pad_tok: 68.000000
step_count: 2.000000
test_avg_bleu: 0.021700
test_avg_gen_len: 512.000000
test_avg_gen_time: 0.439663
test_avg_loss: 5.679669
test_bleu: 0.021700
test_loss: 5.679669
tpb: 1025.000000
val_avg_bleu: 0.082700
val_avg_gen_len: 512.000000
val_avg_gen_time: 0.483860
val_avg_loss: 5.404536
val_bleu: 0.082700
val_loss: 5.404536

In the OP you mentioned " sshleifer/student_marian_6_3" but here you used "sshleifer/mar_enro_6_3_student" - not sure if that's the difference.

stas00 on 4 Nov 2020

Also for the second time you use wmt_en_ro instead of test_data/wmt_en_ro - do you use a different dataset?

stas00 on 4 Nov 2020

Your spec on timing would be a small issue, since I get what you said 3min on your hw in 33secs (unoptimized rtx3090), so might have to re-test on CI. But again I'm not sure we are testing against the same dataset, since my results are terrible.

stas00 on 4 Nov 2020

Retested with sshleifer/student_marian_en_ro_6_3 and 5 epochs - still under < 1 bleu - so this is probably an issue of insufficient data and you must be using a different dataset.

stas00 on 4 Nov 2020

I am using full dataset (as in README.md)

sshleifer on 4 Nov 2020

Ah, that explains it.

So run the slow test with the full dataset downloaded at runtime, right?

stas00 on 4 Nov 2020

OK, I was able to reproduce your results with the full dataset, slightly under 3min and slightly better bleu scores.

stas00 on 4 Nov 2020

Not sure if there is a point to it, but 7zip shaves off about 35% in download size (but CI might not have it installed).

-rw-rw-r--  1 stas stas  58M Nov  4 11:40 wmt_en_ro.tar.gz
-rw-rw-r--  1 stas stas  37M Nov  4 11:39 wmt_en_ro.7z

stas00 on 4 Nov 2020

Another way to save download time would be to only zip up 100k (or fewer) training examples, 500 val examples, 500 test examples. Those are all we use given the --ntrain --nval --ntest flags.
I would also check whether 10k/25k/50k meet the learning requirements.

sshleifer on 4 Nov 2020

👍1

While trying to match the suggested hparams to the ones in train_mbart_cc25_enro.sh I've been wondering - I think I'm missing the point of this whole thing - if the intention is to test a bash script with specific fixed hparams, but the test replaces half of these presets and adds quite a few new ones, how are we testing this script?

stas00 on 5 Nov 2020

Why do we use "--foo=bar" and "--foo bar" both seemingly totally at random - half the args are set the first way, the other half the other way.

stas00 on 5 Nov 2020

question: do you want this as a new test or modify the existing test_train_mbart_cc25_enro_script - I'm trying to do the latter at the moment - perhaps that's why I'm questioning what do we test here.

stas00 on 5 Nov 2020

The high level goal originally was to test that the bash scripts we check in work.
I have a secondary goal of making sure the training code is actually good at training models.
I am fine with any setup that accomplishes both of those tasks, with bonus points for enough traceback control that a developer could tell that they have made performance/vs. speed worse or some such.

As I slacked, we want a test to detect if we've regressed the training code. For example, if you set dropout=0.95 or freeze all parameters, or set the LR too low, or mess up the special tokens logic, the test should fail. Does that make sense? I didn't test all these for my command line, but it would be nice.

Relatedly, we just had a 2x slowdown in the finetune_trainer.py code that was not detected by unit-tests.

I know this is something of a scope expansion, so feel free to break it up/ignore parts as you see fit. I trust you to make reasonable decisions.

sshleifer on 5 Nov 2020

Thank you for this useful brain dump. Let's take it point by point.

the bash scripts

If a test rewrites the script's guts before doing the testing should we not just modify those scripts themselves - we want to test that the script works, so we should test it as is, with the only changes allowed in some numerical settings to make tests faster.

If we want different pre-sets for different purposes - then have a set of scripts rather then do-them-all in one?
Best regression tests are written when an actual regression is discovered because then you know exactly which side of things to put under the "magnifier glass". When another regression is discovered down the road a new test should be made that focuses just on that part. Over time one ends up with a great coverage and the test suite becomes strong. Trying to accomplish all of these in one test will over time lose the very specific setups that exposed very specific side of things. It also helps to annotate that this test solves a regression in this git sha, so that it flags to future developers to not try to refactor or slip extra checks or slightly change things in the existing test.

It's very difficult to think of all the possible things that could regress in the void, but surely it is a great start.

Relatedly, we just had a 2x slowdown in the finetune_trainer.py code that was not detected by unit-tests.

That's great! Let's find a git sha before and after and write a test that detects that regression.

I hope this approach makes sense?

stas00 on 5 Nov 2020

Yeah you are right, let me try to isolate the bad commit https://github.com/huggingface/transformers/commits/master/examples/seq2seq

related issue: #8154

sshleifer on 5 Nov 2020

❤1

I don't think there was an actual regression, I think my command lines are subtly different.
I still think the current test in the linked PR is more aggressive/better and should be added to the test suite in some form, but I am open to other opinions.

sshleifer on 5 Nov 2020

👍1

edit: reusing the same ouput_dir during debug is a terrible idea - it gives total bogus test results - basically remembers the very first run and generates test reports based on it all the subsequent times, ignoring the actual test results. Why is that?

I am growing to dislike --overwrite_output_dir - it's so ambiguous - but I guess it was never meant to be used as a debug flag.

This works for debug:

        if DEBUG:
            output_dir = self.get_auto_remove_tmp_dir("./xxx", before=True, after=False)

So after re-evaluating:

Try to cut n_train/further

40k works.

25k w/ 2 epochs is almost there, but it's slower, than adding a bit more data, so went with 40k

going with a subset "tr40k-va0.5k-te0.5k"

stas00 on 6 Nov 2020

Created https://cdn-datasets.huggingface.co/translation/wmt_en_ro-tr40k-va0.5k-te0.5k.tar.gz - hope the name is intuitive - self-documenting. It's just 3.6M (vs 56M original)

I made it using this script:
https://github.com/stas00/porting/blob/master/transformers/translation/make-wmt_en_ro-subset.md

stas00 on 6 Nov 2020

In all these tests where we measure a relatively exact quality metrics - should we use a fixed seed?

stas00 on 6 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Unseen Vocab

siddsach · 3Comments

ValueError while using --optimize_on_cpu

rsanjaykamath · 3Comments

BERT tokenizer - set special tokens

adigoryl · 3Comments

What should be the label of sub-word units in Token Classification with Bert

ereday · 3Comments

if crf needed when do ner?

alphanlp · 3Comments