Dali: Is reinstantiating FileReader before an epoch ends safe to do?

Created on 24 Jun 2020  路  3Comments  路  Source: NVIDIA/DALI

Hi!
Let's say I don't want to iterate over all my data during an epoch (too time consuming), but only a subset of 5% of data randomly chosen = sample.
When I finished iterating over 5% of data, an "epoch" elapsed in my code, but not for dali. Dali will consider the end of an epoch when it finishes iterating over 100% of data. I would like Dali to start a new epoch when I iterated over the 5%.
One way I found to force Dali to restart as if an epoch just ended was to reinstantiate the Reader object. But I think this renders shuffle_after_epoch and stick_to_shard parameters useless (because Dali never sees an epoch ending, it starts over before that could happen).
I use:

  • random_shuffle=True
  • stick_to_shard=False/True - does not matter if using reinstantiation of Reader at each epoch?
  • shuffle_after_epoch=False - does not matter if using reinstantiation of Reader at each epoch?
  • I reinstantiate every epoch
    Could this reinstantiation mess up something inside Dali that I am not aware of and that could affect training results?
    Thanks in advance!
question

All 3 comments

Hi,
by reinstantiating the Reader object, do you mean you create a new Pipeline for every "5% epoch"? Can you show what are you doing exactly?
Creating a new Pipeline would mean you're freeing the resources every epoch and starting from scratch (the buffers grow to accommodate new samples they are seeing till they reach a steady state).

If you want your Pipeline to be running on subset of the data, and you're using FileReader I see two options:

  • prepare a specific file list with only a subset of your data - this way you can choose what data is sampled on your own and prevent any skew between labels,
  • You could specify that you have 20 shards and only use one. To not visit any other parts of the dataset, the stick_to_shard=True could be used. There will be no guarantees on the data distribution AFAIK. Maybe @JanuszL can confirm.

Randomly reading only the "prefix" of the dataset may lead to similar problems with distribution of samples as in option two.

Regards,
Krzysztof.

You could specify that you have 20 shards and only use one. To not visit any other parts of the dataset, the stick_to_shard=True could be used. There will be no guarantees on the data distribution AFAIK. Maybe @JanuszL can confirm.

That sounds like a doable workaround. Make the number os shards 20 times bigger, set shuffle_after_epoch=True. So after each epoch, the whole data set would be reshuffled globally.

what I was doing was reinstantiating the Pipeline at each epoch, but this sharding trick sounds more elegant. I will do that, thanks a lot!

Was this page helpful?
0 / 5 - 0 ratings