[x] Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps
[x] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
Hi, I am training a model with fit_generator() method. Here is my high level setup:
Generator:
def generate_arrays_from_file(feature_file, label_file, batch_size):
"""
Generate 1 batch of (X, Y) of size == BATCH_SIZE
"""
while 1:
x, y = [], []
i = 0
with open(label_file, 'r') as f1, open(feature_file, 'r') as f2:
for index, label_feature in enumerate(zip(f1, f2)):
label = label_feature[0] # assume, utils.to_categorical() like sample
feature = label_feature[1] # assume, numpy array type sample
y.append(label)
x.append(feature)
i += 1
if i == batch_size:
yield (np.array(x), np.array(y))
i = 0
x, y = [], []
BATCH_SIZE = 512
train_gen = generate_arrays_from_file(TRAIN_FEATURES, TRAIN_LABELS, BATCH_SIZE)
valid_gen = generate_arrays_from_file(VALID_FEATURES, VALID_LABELS, BATCH_SIZE)
train_data_count = 65160000
valid_data_count = 7240000
ST_PR_EP_TR = int(train_data_count // BATCH_SIZE) # total train count / batch size
ST_PR_EP_VA = int(valid_data_count // BATCH_SIZE) # total valid count / batch
md5-aea06d63c9e21c89bac4ce73de5a3fc8
history = model.fit_generator(
train_gen,
steps_per_epoch=ST_PR_EP_TR,
epochs=50,
verbose=1,
validation_data=valid_gen,
validation_steps=ST_PR_EP_VA,
callbacks=callback_list,
max_queue_size=100,
use_multiprocessing=False,
shuffle=True,
initial_epoch=0)
md5-e149967c82df16182c26baaa24126811
Epoch 1/50
39591/127265 [========>.....................] - ETA: 1:05:27 - loss: 2.3666 - acc: 0.5520
My questions are,
steps_per_epoch, validation_steps correctly? generate_arrays_from_file() generator correct? The out put from the generator is a tuple(x, y) of shape x = (512, 100) and y=(512, 20)39591/127265 from epoch output, are those numbers represent total batches yielded from generator? or total samples? I'm confused because model.fit() gives you sample counts. right?Thanks!
@BrashEndeavours Thank you for the reply and suggestion. Will you please validate this approach? From the generator method, I'm thinking about using, from sklearn.utils import shuffle which, i'm assuming will shuffle each batch before feeding it to fit_generator() method.
......
if i == batch_size:
x, y = shuffle(x, y, random_state=0)
yield (np.array(x), np.array(y))
i = 0
x, y = [], []
......
I think you did it wrong. It makes no sense to shuffle within a batch. What you should do is to shuffle the entire training data.
For example, if you only have 4 sample points [a, b, c, d], and batch size is 2,
@rex-yue-wu Well, that perfectly make sense. Thank you!
Most helpful comment
int(train_data_count // BATCH_SIZE) = 127265 in your case