Keras: shuffle is applied after validation_split

Created on 27 Apr 2017  路  3Comments  路  Source: keras-team/keras

When using validation_split and shuffle with the fit() function, the shuffle is applied after the data is separated into training and validation sets. That's probably not intended.

stale

Most helpful comment

thanks for linking the quote. if this is how it should be, so be it. i just want to say that i don't agree with it.

  1. if you use a feature called "shuffle" you expect it to intentionally destroy any order that you input. that's the whole purpose of using it. the current behavior acts against this intention and leads to hard-to-spot bugs.
  2. fchollet's example seems like the only and exclusive situation where this behavior might possibly be helpful. keras is (probably) more often used for any other kind of problem, where now the combination of shuffle and validation_split is useless.
  3. if you work on a project where the sequence is essential wouldn't you engineer the precise distribution by hand anyway? if i wanted a specific part of the data as validation, human intuition would forbid me to use a feature called "shuffle".
  4. the current workaround for non-sequential data includes more operations and computations than my proposed solution for sequential data.

All 3 comments

That is intended. To quote fchollet,

No; in many cases this would lead to user errors, e.g. any case where the data is generated by a sequential process. I've seen many cases where this feature of Keras saved a user from validating on past data when they should been using only future data.

See this issue comment.

thanks for linking the quote. if this is how it should be, so be it. i just want to say that i don't agree with it.

  1. if you use a feature called "shuffle" you expect it to intentionally destroy any order that you input. that's the whole purpose of using it. the current behavior acts against this intention and leads to hard-to-spot bugs.
  2. fchollet's example seems like the only and exclusive situation where this behavior might possibly be helpful. keras is (probably) more often used for any other kind of problem, where now the combination of shuffle and validation_split is useless.
  3. if you work on a project where the sequence is essential wouldn't you engineer the precise distribution by hand anyway? if i wanted a specific part of the data as validation, human intuition would forbid me to use a feature called "shuffle".
  4. the current workaround for non-sequential data includes more operations and computations than my proposed solution for sequential data.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

Was this page helpful?
0 / 5 - 0 ratings