Keras: Meaning of fit_generator() Epoch numbers?

Created on 16 Feb 2018  路  4Comments  路  Source: keras-team/keras

  • [x] Check that you are up-to-date with the master branch of Keras. You can update with:
    pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps

  • [x] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.

Hi, I am training a model with fit_generator() method. Here is my high level setup:

Generator:

def generate_arrays_from_file(feature_file, label_file, batch_size):
    """
    Generate 1 batch of (X, Y) of size == BATCH_SIZE
    """
    while 1:
        x, y = [], []
        i = 0
        with open(label_file, 'r') as f1, open(feature_file, 'r') as f2:
            for index, label_feature in enumerate(zip(f1, f2)):
                label = label_feature[0] # assume, utils.to_categorical() like sample
                feature = label_feature[1] # assume, numpy array type sample
                y.append(label)
                x.append(feature)
                i += 1
                if i == batch_size:
                    yield (np.array(x), np.array(y))
                    i = 0
                    x, y = [], []
BATCH_SIZE = 512
train_gen = generate_arrays_from_file(TRAIN_FEATURES, TRAIN_LABELS, BATCH_SIZE)
valid_gen = generate_arrays_from_file(VALID_FEATURES, VALID_LABELS, BATCH_SIZE)

train_data_count = 65160000
valid_data_count = 7240000

ST_PR_EP_TR = int(train_data_count // BATCH_SIZE) # total train count / batch size
ST_PR_EP_VA = int(valid_data_count // BATCH_SIZE) # total valid count / batch



md5-aea06d63c9e21c89bac4ce73de5a3fc8



history = model.fit_generator(
    train_gen,
    steps_per_epoch=ST_PR_EP_TR,
    epochs=50,
    verbose=1,
    validation_data=valid_gen,
    validation_steps=ST_PR_EP_VA,
    callbacks=callback_list,
    max_queue_size=100,
    use_multiprocessing=False,
    shuffle=True,
    initial_epoch=0)



md5-e149967c82df16182c26baaa24126811



Epoch 1/50
 39591/127265 [========>.....................] - ETA: 1:05:27 - loss: 2.3666 - acc: 0.5520

My questions are,

  1. Am I using steps_per_epoch, validation_steps correctly?
  2. Is my generate_arrays_from_file() generator correct? The out put from the generator is a tuple(x, y) of shape x = (512, 100) and y=(512, 20)
  3. If yes, the numbers 39591/127265 from epoch output, are those numbers represent total batches yielded from generator? or total samples? I'm confused because model.fit() gives you sample counts. right?

Thanks!

Most helpful comment

  1. Yes.
  2. Yes. (Though you could implement better as a sequence/return length/allow indexing, and be able actually use shuffle=true) right now you go through the same way, front to back every time.
  3. The numbers represent current batch/total batches (training).
    int(train_data_count // BATCH_SIZE) = 127265 in your case

All 4 comments

  1. Yes.
  2. Yes. (Though you could implement better as a sequence/return length/allow indexing, and be able actually use shuffle=true) right now you go through the same way, front to back every time.
  3. The numbers represent current batch/total batches (training).
    int(train_data_count // BATCH_SIZE) = 127265 in your case

@BrashEndeavours Thank you for the reply and suggestion. Will you please validate this approach? From the generator method, I'm thinking about using, from sklearn.utils import shuffle which, i'm assuming will shuffle each batch before feeding it to fit_generator() method.

                ......
                if i == batch_size:
                    x, y = shuffle(x, y, random_state=0)
                    yield (np.array(x), np.array(y))
                    i = 0
                    x, y = [], []
                ......

I think you did it wrong. It makes no sense to shuffle within a batch. What you should do is to shuffle the entire training data.

For example, if you only have 4 sample points [a, b, c, d], and batch size is 2,

  • shuffling within batch only decides whether you see sample (a,b) or (b,a) in the 1st batch
  • shuffling across batch, or shuffling the entire dataset, means you will first change [a,b,c,d] to [a,c,d,b], and generate (a,c) as the 1st batch

@rex-yue-wu Well, that perfectly make sense. Thank you!

Was this page helpful?
0 / 5 - 0 ratings