Transformers: examples/seq2seq/test_bash_script.py :: actually learn something

Created on 27 Jul 2020  路  23Comments  路  Source: huggingface/transformers

At the moment validation bleu barely gets above zero in the tests, so they don't really prove much about our code.

we could use a larger model like sshleifer/student_marian_6_3, and more data, and train for 10 minutes . This would allows us to test whether changing default parameters/batch techniques obviously degrades performance.

The github actions CI reuses it's own disk, so this will only run there and hopefully not have super slow downloads.

Help wanted Tests

All 23 comments

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This is still an issue :)

This command runs in 3 mins (without including downloads)

# export WANDB_PROJECT=dmar
export MAX_LEN=64
python finetune.py \
  --learning_rate=3e-4 \
  --do_train \
  --do_predict \
  --fp16 \
  --val_check_interval 0.25 --n_train 100000 --n_val 500 --n_test 500 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --freeze_encoder --freeze_embeds \
  --train_batch_size=64 --eval_batch_size=64 \
  --tokenizer_name Helsinki-NLP/opus-mt-en-ro \
  --model_name_or_path sshleifer/mar_enro_6_3_student \
  --warmup_steps 500 --sortish_sampler \
  --gpus 1 --fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 --output_dir dmar_utest_1gpu --num_train_epochs=1 \
  --overwrite_output_dir

Test results

cat dmar_utest_1gpu/test_results.txt
bs: 32.000000
loss: 1.122035
src_pad_frac: 0.322266
src_pad_tok: 330.000000
step_count: 5.000000
test_avg_bleu: 20.660713
test_avg_gen_len: 63.750000
test_avg_gen_time: 0.033025
test_avg_loss: 2.215564
test_bleu: 20.660713
test_loss: 2.215564
tpb: 1475.000000
val_avg_bleu: 20.099000
val_avg_gen_len: 63.125000
val_avg_gen_time: 0.031409
val_avg_loss: 2.265883
val_bleu: 20.099001
val_loss: 2.265883



md5-3adbadab6c91b30d4b5b8786262a4ef5



 cat dmar_utest_1gpu/metrics.json | grep bleu



md5-70f3cd84d26193e1d1270980836dcef9



            "val_avg_bleu": 16.3561625,
            "val_avg_bleu": 19.0204625,
            "val_avg_bleu": 19.704875,
            "val_avg_bleu": 20.099,
            "test_avg_bleu": 20.660712500000002

So this would be a good template for the test.

Spec

  • convert the command into a unit-test (With programatic download of the right amount of data, possibly through via a new s3 .tgz file.
  • Replace existing test_bash_script marian test.
  • Try to cut n_train/further
  • Anything less than 15 mins is fine, but the faster the better.
  • Minimum learning requirement 1: BLEU improves over the course of training by more than 2 pts
  • Minimum learning requirement 2: BLEU finishes above 17
  • Minimum learning requirement 3: test bleu and val bleu within 1 pt.

(this command meets all 3 learning requirements).

Wdyt @stas00 ?

I will work on that, thank you.

@sshleifer, could you please validate that this is the command you run?

I get very different (bad) results:

bs: 9.000000
loss: 6.701375
src_pad_frac: 0.118056
src_pad_tok: 68.000000
step_count: 2.000000
test_avg_bleu: 0.021700
test_avg_gen_len: 512.000000
test_avg_gen_time: 0.439663
test_avg_loss: 5.679669
test_bleu: 0.021700
test_loss: 5.679669
tpb: 1025.000000
val_avg_bleu: 0.082700
val_avg_gen_len: 512.000000
val_avg_gen_time: 0.483860
val_avg_loss: 5.404536
val_bleu: 0.082700
val_loss: 5.404536

In the OP you mentioned " sshleifer/student_marian_6_3" but here you used "sshleifer/mar_enro_6_3_student" - not sure if that's the difference.

Also for the second time you use wmt_en_ro instead of test_data/wmt_en_ro - do you use a different dataset?

Your spec on timing would be a small issue, since I get what you said 3min on your hw in 33secs (unoptimized rtx3090), so might have to re-test on CI. But again I'm not sure we are testing against the same dataset, since my results are terrible.

Retested with sshleifer/student_marian_en_ro_6_3 and 5 epochs - still under < 1 bleu - so this is probably an issue of insufficient data and you must be using a different dataset.

I am using full dataset (as in README.md)

Ah, that explains it.

So run the slow test with the full dataset downloaded at runtime, right?

OK, I was able to reproduce your results with the full dataset, slightly under 3min and slightly better bleu scores.

Not sure if there is a point to it, but 7zip shaves off about 35% in download size (but CI might not have it installed).

-rw-rw-r--  1 stas stas  58M Nov  4 11:40 wmt_en_ro.tar.gz
-rw-rw-r--  1 stas stas  37M Nov  4 11:39 wmt_en_ro.7z

Another way to save download time would be to only zip up 100k (or fewer) training examples, 500 val examples, 500 test examples. Those are all we use given the --ntrain --nval --ntest flags.
I would also check whether 10k/25k/50k meet the learning requirements.

While trying to match the suggested hparams to the ones in train_mbart_cc25_enro.sh I've been wondering - I think I'm missing the point of this whole thing - if the intention is to test a bash script with specific fixed hparams, but the test replaces half of these presets and adds quite a few new ones, how are we testing this script?

Why do we use "--foo=bar" and "--foo bar" both seemingly totally at random - half the args are set the first way, the other half the other way.

question: do you want this as a new test or modify the existing test_train_mbart_cc25_enro_script - I'm trying to do the latter at the moment - perhaps that's why I'm questioning what do we test here.

The high level goal originally was to test that the bash scripts we check in work.
I have a secondary goal of making sure the training code is actually good at training models.
I am fine with any setup that accomplishes both of those tasks, with bonus points for enough traceback control that a developer could tell that they have made performance/vs. speed worse or some such.

As I slacked, we want a test to detect if we've regressed the training code. For example, if you set dropout=0.95 or freeze all parameters, or set the LR too low, or mess up the special tokens logic, the test should fail. Does that make sense? I didn't test all these for my command line, but it would be nice.

Relatedly, we just had a 2x slowdown in the finetune_trainer.py code that was not detected by unit-tests.

I know this is something of a scope expansion, so feel free to break it up/ignore parts as you see fit. I trust you to make reasonable decisions.

Thank you for this useful brain dump. Let's take it point by point.

  1. the bash scripts

    If a test rewrites the script's guts before doing the testing should we not just modify those scripts themselves - we want to test that the script works, so we should test it as is, with the only changes allowed in some numerical settings to make tests faster.

    If we want different pre-sets for different purposes - then have a set of scripts rather then do-them-all in one?

  2. Best regression tests are written when an actual regression is discovered because then you know exactly which side of things to put under the "magnifier glass". When another regression is discovered down the road a new test should be made that focuses just on that part. Over time one ends up with a great coverage and the test suite becomes strong. Trying to accomplish all of these in one test will over time lose the very specific setups that exposed very specific side of things. It also helps to annotate that this test solves a regression in this git sha, so that it flags to future developers to not try to refactor or slip extra checks or slightly change things in the existing test.

    It's very difficult to think of all the possible things that could regress in the void, but surely it is a great start.

Relatedly, we just had a 2x slowdown in the finetune_trainer.py code that was not detected by unit-tests.

That's great! Let's find a git sha before and after and write a test that detects that regression.

I hope this approach makes sense?

Yeah you are right, let me try to isolate the bad commit https://github.com/huggingface/transformers/commits/master/examples/seq2seq

related issue: #8154

I don't think there was an actual regression, I think my command lines are subtly different.
I still think the current test in the linked PR is more aggressive/better and should be added to the test suite in some form, but I am open to other opinions.

edit: reusing the same ouput_dir during debug is a terrible idea - it gives total bogus test results - basically remembers the very first run and generates test reports based on it all the subsequent times, ignoring the actual test results. Why is that?

I am growing to dislike --overwrite_output_dir - it's so ambiguous - but I guess it was never meant to be used as a debug flag.

This works for debug:

        if DEBUG:
            output_dir = self.get_auto_remove_tmp_dir("./xxx", before=True, after=False)

So after re-evaluating:

Try to cut n_train/further

40k works.

25k w/ 2 epochs is almost there, but it's slower, than adding a bit more data, so went with 40k

going with a subset "tr40k-va0.5k-te0.5k"

Created https://cdn-datasets.huggingface.co/translation/wmt_en_ro-tr40k-va0.5k-te0.5k.tar.gz - hope the name is intuitive - self-documenting. It's just 3.6M (vs 56M original)

I made it using this script:
https://github.com/stas00/porting/blob/master/transformers/translation/make-wmt_en_ro-subset.md

In all these tests where we measure a relatively exact quality metrics - should we use a fixed seed?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zhezhaoa picture zhezhaoa  路  3Comments

hsajjad picture hsajjad  路  3Comments

guanlongtianzi picture guanlongtianzi  路  3Comments

adigoryl picture adigoryl  路  3Comments

rsanjaykamath picture rsanjaykamath  路  3Comments