Keras: Cross-validation or validation split for ImageDataGenerator/flow_from_directory?

Created on 30 Apr 2017  路  7Comments  路  Source: keras-team/keras

I'm working on a classifier. Currently my data structure looks like this:

data/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001.jpg
            cat002.jpg

I want to use flow_from_directory to do the training and validation.. But it is a bit clumsy to move the files into a different directory each time. Is there any option that can help that?

stale

Most helpful comment

Yes there is a inbuilt way to do it using Keras.
train_datagen = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
validation_split=0.2)#for validation

train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical',
subset='training',seed=42) # set as training data

validation_generator = train_datagen.flow_from_directory(
train_data_dir,#same train directory
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical',
subset='validation',seed=42) # set as validation data

All 7 comments

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

Extremely relevant issue.

Super late response, but for whoever finds this: flow_from_directory() is a generator which means it doesn't have knowledge of the entire directory, so can't create a training and validation set.

I was reading another post where they made a script that splits the folders into training/testing splits here:

def split_dataset_into_test_and_train_sets(all_data_dir, training_data_dir, testing_data_dir, testing_data_pct):
    # Recreate testing and training directories
    if testing_data_dir.count('/') > 1:
        shutil.rmtree(testing_data_dir, ignore_errors=False)
        os.makedirs(testing_data_dir)
        print("Successfully cleaned directory " + testing_data_dir)
    else:
        print("Refusing to delete testing data directory " + testing_data_dir + " as we prevent you from doing stupid things!")

    if training_data_dir.count('/') > 1:
        shutil.rmtree(training_data_dir, ignore_errors=False)
        os.makedirs(training_data_dir)
        print("Successfully cleaned directory " + training_data_dir)
    else:
        print("Refusing to delete testing data directory " + training_data_dir + " as we prevent you from doing stupid things!")

    num_training_files = 0
    num_testing_files = 0

    for subdir, dirs, files in os.walk(all_data_dir):
        category_name = os.path.basename(subdir)

        # Don't create a subdirectory for the root directory
        print(category_name + " vs " + os.path.basename(all_data_dir))
        if category_name == os.path.basename(all_data_dir):
            continue

        training_data_category_dir = training_data_dir + '/' + category_name
        testing_data_category_dir = testing_data_dir + '/' + category_name

        if not os.path.exists(training_data_category_dir):
            os.mkdir(training_data_category_dir)

        if not os.path.exists(testing_data_category_dir):
            os.mkdir(testing_data_category_dir)

        for file in files:
            input_file = os.path.join(subdir, file)
            if np.random.rand(1) < testing_data_pct:
                shutil.copy(input_file, testing_data_dir + '/' + category_name + '/' + file)
                num_testing_files += 1
            else:
                shutil.copy(input_file, training_data_dir + '/' + category_name + '/' + file)
                num_training_files += 1

    print("Processed " + str(num_training_files) + " training files.")
    print("Processed " + str(num_testing_files) + " testing files.")

source:

https://github.com/keras-team/keras/issues/5862

@atoaster As far as I observed this code snippet is for hold-out split not for cross validation. So far I could not find any example with fit_generator and cross-validation using keras.

Yes there is a inbuilt way to do it using Keras.
train_datagen = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
validation_split=0.2)#for validation

train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical',
subset='training',seed=42) # set as training data

validation_generator = train_datagen.flow_from_directory(
train_data_dir,#same train directory
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical',
subset='validation',seed=42) # set as validation data

Note that this provides a _single_ validation split and does not perform cross-validation.

To anyone, who bumped into this problem: to the date, at which this answer was posted - there's no (at least, relatively) simple out-of-the-box solution in my opinion and deciding by the result of my own searches.

The only solution, that I came up with, resolving similar problem in my project, was to make partitions in my dataset, with number of partitions equal to number of folds, and saving them as dictionary with number of partition as a key and file paths list as a value for partition. After that, you still have to sort your files into class folders for train and validation subsets respectively. Thus, you still require two generators, validation generator processes one of the partitions, train - processes the rest of them.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nryant picture nryant  路  3Comments

amityaffliction picture amityaffliction  路  3Comments

harishkrishnav picture harishkrishnav  路  3Comments

Imorton-zd picture Imorton-zd  路  3Comments

anjishnu picture anjishnu  路  3Comments