Keras: generalize flow_from_directory(directory) method to include regression models

Created on 24 Jan 2017  路  13Comments  路  Source: keras-team/keras

The flow_from_directory(directory) method of the ImageDataGenerator is currently designed to be used with classification models. To use it with regression models, the following hack is necessary:

http://stackoverflow.com/questions/41749398/using-keras-imagedatagenerator-in-a-regression-model?noredirect=1#comment70692649_41749398

I suggest to include functionality to support a mapping between image file names and target values. A non-breaking and highly flexible method would be to include an additional callback function to the flow_from_directory(directory) signature, with a filename as parameter and the target value as return value.

Most helpful comment

Can we please re-open this?

All 13 comments

@fera0013 I've implemented something similar for my own use. In this branch, I add a custom class mode which calls an argument custom_output_fn(class_name, filename, idx). I also added a file filter predicate, included_file_filter which is provided the same arguments, see this branch.

If there's interest in this functionality in Keras, I can make a PR which includes the documentation updates.

If there's interest in this functionality in Keras, I can make a PR which includes the documentation updates.

Since I have created this issue , I am definitely interested in that kind of functionality.

@davidvetrano looks like a useful PR but will need some work

  • Documentation (as you mentioned)
  • Error checking, like the assertion that if mode is custom, custom_output_fn must be provided and must be callable.
  • Write out some unit tests to make sure nothing gets broken.

If you take point on the PR we can chip in on testing and cleanup.

Cheers

@bstriner I added some documentation, errors checks, and simple tests, submitting PR #5172. Thoughts?

Can we please re-open this?

@aditya17a Agreed, it is useful for a large set of problems

I am extremely interested in this. I have 100k+ images(X_train) and associated floating point properties(Y_train). I would like to perform regression with this and I don't think I can do this easily without the flow_from_directory method.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

Another request to look into this.

I am also interested in this and came up with my own solution of flow_from_dataframe where you specify a path column (to load the image from) and a y column (to use as the target value). This also solves the problem of loading from paths where the folder structure is not as prescribed. A demo is on kaggle kernels

def flow_from_dataframe(img_data_gen, in_df, path_col, y_col, **dflow_args):
    base_dir = os.path.dirname(in_df[path_col].values[0])
    print('## Ignore next message from keras, values are replaced anyways')
    df_gen = img_data_gen.flow_from_directory(base_dir, 
                                     class_mode = 'sparse',
                                    **dflow_args)
    df_gen.filenames = in_df[path_col].values
    df_gen.classes = np.stack(in_df[y_col].values)
    df_gen.samples = in_df.shape[0]
    df_gen.n = in_df.shape[0]
    df_gen._set_index_array()
    df_gen.directory = '' # since we have the full path
    print('Reinserting dataframe: {} images'.format(in_df.shape[0]))
    return df_gen

@kmader, What is the purpose of _set_index_array() function? here and this function is not present in keras 2.0.5.

it is on the master branch as part of the iterator class: https://github.com/keras-team/keras/blob/master/keras/preprocessing/image.py#L807

@kmader: thanks for your flow_from_dataframe function above that I used so far for a few trainings. I was happy to try to switch to the official keras_preprocessing's flow_from_dataframe. However I saw it has a different behavior compared to yours.

  • Your flow_from_dataframe allows to have a folder with lots of images, some of which are specified in the parameter dataframe. The flow_from_dataframe function will take the filenames from the dataframe paths and does not care if more images are available on disk.

  • The official implementation is less flexible. It takes the filenames from all valid files on disk:

    # image.py, line 2134
    filenames = _list_valid_filenames_in_directory(
        directory,
        white_list_formats,
        self.split,
        class_indices=self.class_indices,
        follow_links=follow_links,
        df=True)

If the parameter dataframe contains less images, it throws an exception when trying to reindex with the full on-disk filenames (if not, it does not make sense, adding rows with NaN values):

        # image.py, line 2151
        temp_df = pd.DataFrame({x_col: filenames}, dtype=str)
        temp_df = self.df.merge(temp_df, how='right', on=x_col)
        temp_df = temp_df.set_index(x_col)
        # following line does not make sense if filenames is a superset of self.df[x_col].values
        temp_df = temp_df.reindex(filenames)
        classes = temp_df[y_col].values

@all: any reason why flow_from_dataframe was implemented that way? I like the flexibility of Kevin's implementation.

Was this page helpful?
0 / 5 - 0 ratings