The flow_from_directory(directory)
method of the ImageDataGenerator
is currently designed to be used with classification models. To use it with regression models, the following hack is necessary:
I suggest to include functionality to support a mapping between image file names and target values. A non-breaking and highly flexible method would be to include an additional callback function to the flow_from_directory(directory)
signature, with a filename as parameter and the target value as return value.
@fera0013 I've implemented something similar for my own use. In this branch, I add a custom
class mode which calls an argument custom_output_fn(class_name, filename, idx)
. I also added a file filter predicate, included_file_filter
which is provided the same arguments, see this branch.
If there's interest in this functionality in Keras, I can make a PR which includes the documentation updates.
If there's interest in this functionality in Keras, I can make a PR which includes the documentation updates.
Since I have created this issue , I am definitely interested in that kind of functionality.
@davidvetrano looks like a useful PR but will need some work
If you take point on the PR we can chip in on testing and cleanup.
Cheers
@bstriner I added some documentation, errors checks, and simple tests, submitting PR #5172. Thoughts?
Can we please re-open this?
@aditya17a Agreed, it is useful for a large set of problems
I am extremely interested in this. I have 100k+ images(X_train) and associated floating point properties(Y_train). I would like to perform regression with this and I don't think I can do this easily without the flow_from_directory method.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Another request to look into this.
I am also interested in this and came up with my own solution of flow_from_dataframe
where you specify a path column (to load the image from) and a y column (to use as the target value). This also solves the problem of loading from paths where the folder structure is not as prescribed. A demo is on kaggle kernels
def flow_from_dataframe(img_data_gen, in_df, path_col, y_col, **dflow_args):
base_dir = os.path.dirname(in_df[path_col].values[0])
print('## Ignore next message from keras, values are replaced anyways')
df_gen = img_data_gen.flow_from_directory(base_dir,
class_mode = 'sparse',
**dflow_args)
df_gen.filenames = in_df[path_col].values
df_gen.classes = np.stack(in_df[y_col].values)
df_gen.samples = in_df.shape[0]
df_gen.n = in_df.shape[0]
df_gen._set_index_array()
df_gen.directory = '' # since we have the full path
print('Reinserting dataframe: {} images'.format(in_df.shape[0]))
return df_gen
@kmader, What is the purpose of _set_index_array() function? here and this function is not present in keras 2.0.5.
it is on the master branch as part of the iterator class: https://github.com/keras-team/keras/blob/master/keras/preprocessing/image.py#L807
@kmader: thanks for your flow_from_dataframe function above that I used so far for a few trainings. I was happy to try to switch to the official keras_preprocessing's flow_from_dataframe. However I saw it has a different behavior compared to yours.
Your flow_from_dataframe allows to have a folder with lots of images, some of which are specified in the parameter dataframe. The flow_from_dataframe function will take the filenames from the dataframe paths and does not care if more images are available on disk.
The official implementation is less flexible. It takes the filenames from all valid files on disk:
# image.py, line 2134
filenames = _list_valid_filenames_in_directory(
directory,
white_list_formats,
self.split,
class_indices=self.class_indices,
follow_links=follow_links,
df=True)
If the parameter dataframe contains less images, it throws an exception when trying to reindex with the full on-disk filenames (if not, it does not make sense, adding rows with NaN values):
# image.py, line 2151
temp_df = pd.DataFrame({x_col: filenames}, dtype=str)
temp_df = self.df.merge(temp_df, how='right', on=x_col)
temp_df = temp_df.set_index(x_col)
# following line does not make sense if filenames is a superset of self.df[x_col].values
temp_df = temp_df.reindex(filenames)
classes = temp_df[y_col].values
@all: any reason why flow_from_dataframe was implemented that way? I like the flexibility of Kevin's implementation.
Most helpful comment
Can we please re-open this?