Keras: __getitem__ gets called too many times

Created on 20 Jan 2019  路  14Comments  路  Source: keras-team/keras

Hi,
I was trying to write a sequential generator for videos. Instead of using while True inside __getitem__ I decided to rely on 'inidex', as __getitem__ is defined to have a parameter list of (self,index).
I specified in my model the number of validation and the number of train steps.
What is surprising to me was that the indices which keras calls __getitem__ at some point get far above the defined for instance steps per epoch. Is that expected behavior or I am doing something wrong?

I just cannot understand the reasoning in this case, because if I am concurrently forever looping around more than 1 videos, then this may read a video in randomised order, not sequential and even worse, I may repeat batches of frames. A video for a hundred frames I would read as 10 batches of 10 frames, for instance. Simply counting on how many times __getitem__ returned tuples to the model in such a setup will not not ensure proper training or will it? Am I wrong or is there a bug?

Enhancement Good first issue contributions welcome

Most helpful comment

The training Sequence is shuffled, the validation sequence is not. That's normal and desirable in 99 % of use cases. I don't know if it's documented somewhere.

All 14 comments

Please post a minimal code example to reproduce this will help us understand what you mean.

     `classifier.fit_generator(
            trainGen,
            steps_per_epoch=numTrainFrames//framesBatch,
            epochs=1,
            validation_data=validGen,
            workers = 1,
            use_multiprocessing = False,
            validation_steps=numValidationFrames//framesBatch)`


     `def __getitem__(self, index):
            batch_x = self.videos[index*self.batch_size : (index + 1)*self.self.batch_size]
            print("The log bellow comes from here")
            ...
            return x, y`

(Log) FULL INFO:
id: Train Set ; index: 6901 ; batch_x=[] ; full_set_size: (18416,)
batch_size: 16 ; fromIndex: 110416 ; toIndex: 110432 ; maxIndices: 1151

Where 'maxIndices' is the highest index which can be looked at in my set. index is the current index with which Keras calls my __getitem__(self,index) with.
steps_per_epoch and maxIndices agree as values. The former is in the parameterList of fit_generator, the later - in the particular instance of the generator.

Hence, seems like there is no connection between steps_per_epoch and the index values which are fed into __getitem__(self,index), which Keras controls.

You shouldn't use step_per_epoch with a sequence. Keras will use the function __len__ of your sequence to know when to stop. But I agree that this is confusing and we should have an error message for when users use step_per_epoch with a Sequence. PR welcome to add the error message.

Thank you a lot. Now, it is almost working.
The only issue is that:

     ` id: Train Set ; index: 1382 ; 
       id: Train Set ; index: 1361 ; 
       id: Train Set ; index: 282 ; 
       id: Train Set ; index: 648 ; 
       id: Train Set ; index: 1548 ;
       id: Train Set ; index: 726 ; 
       id: Train Set ; index: 980 ;
       id: Train Set ; index: 874 ; 
       id: Train Set ; index: 1009 ; 
       id: Train Set ; index: 1680 ; 
       id: Train Set ; index: 1168 ;

`
Whereas the same thing for Valid set results in perfect ascending sequential indexing: 0,1,2...60.

@kdx2 I too noticed this when I built my project. The reason is keras starts a new thread to load batches of data. It doesn't matter much but yes it can be improved when using steps_per_epoch to only load batches that are important.

Does anyone know why the first dataset gets loaded in such a scrambled manner whereas the other one in a perfectly sequential? I ran the program around 7-8 times, I didn't see __getitem__ on the train set to be called with index 0 even once and is always scrambled like above, whereas the validation generator loads perfectly from 0,1,2... to the end?

The training Sequence is shuffled, the validation sequence is not. That's normal and desirable in 99 % of use cases. I don't know if it's documented somewhere.

And yes, keras uses multithreading and multiprocessing to load the data faster and it also confuses users because it's asynchronous but it's perfectly normal.

@gabrieldemarmiesse Yeah... it's normal but my question is why to load 5 batches if you only need 1? This can be improved, right?

Because it's a queue system. In this way, the GPU never waits for the next batch of data. Does that make sense?

At most, we will queue max_queue_size + 1 items. In most case, this is not significant.

I'm trying to build a wiki page for multiprocessing here. It's not completed yet.

Any proposition would be appreciated.

@gabrieldemarmiesse Yes.. I know it. But if the GPU doesn't need batches then why to load it?

For ex: steps_per_epoch=1 and epochs=1 will require only one batch to train the model but still it will load many. Can't we make dynamic queue taking into account the total number of batches required multiplied by the batch_size?

@Dref360 Would like to help..

@kdx2 seems that I answered a same question here. Notice the arg: "max_queue_size" in fit_generator
https://github.com/keras-team/keras/issues/11878#issuecomment-450834985

Was this page helpful?
0 / 5 - 0 ratings