Scikit-learn: API for returning datasets as DataFrames

Created on 1 Mar 2018  Â·  78Comments  Â·  Source: scikit-learn/scikit-learn

Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age, for those datasets otherwise providing a Bunch with feature_names. This would be controlled with an as_frame parameter (though return_X_y='frame' would mean the common usage is more succinct).

Easy Enhancement Sprint good first issue

Most helpful comment

Just an FYI for everybody: the decision was to include the target into the dataframe.

All 78 comments

I would like to work upon this. Seems like a good first issue. If I understand it correctly, the task is to add an argument to the load function for returning a DataFrame. So in case of iris dataset, this can be dataset.load_iris(as_frame=True). Am I right?

yes, i think it would be good, but there may not be consensus on making
this change. So if you make a pull request, it is with some risk that other
core devs will object to it

Okay, but I might as well create a PR and see how the conversation turns out.

you are welcome to

@layog are you still working on this? if not, I would like to give it a try.

@nsorros No, I have not started working on this one. You can take it.

@jnothman It would be helpful to get some thoughts in the PR which simply implements what you suggested for one dataset.

@jnotham is there agreement on the desired interface? as_frame=True/False ? (or return_frame to be consistent with return_X_y?)

On the PR it was suggested to have a return_X_y='pandas' instead.
For openml, I was thinking that a return='array' could also make sense to ensure a numerical array is returned (drop string features, https://github.com/scikit-learn/scikit-learn/issues/11819). A return='array'|'pandas'|... might be more flexible future-wise to add new things without the need to add new keywords, but the question is of course this would be needed.

I believe it should be possible to get a dataframe as the bunch's data
attribute, rather than this being an alternative value of return_X_y. And
that's why as_frame or frame is better than return_frame.

On 18 August 2018 at 03:39, Joris Van den Bossche notifications@github.com
wrote:

@jnotham is there agreement on the desired interface? as_frame=True/False
? (or return_frame to be consistent with return_X_y?)

On the PR it was suggested to have a return_X_y='pandas' instead.
For openml, I was thinking that a return='array' could also make sense to
ensure a numerical array is returned (drop string features, #11819
https://github.com/scikit-learn/scikit-learn/issues/11819). A
return='array'|'pandas'|... might be more flexible future-wise to add new
things without the need to add new keywords, but the question is of course
this would be needed.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10733#issuecomment-413938083,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6_q1x5Sx7p-r78D8dAt2gyfX2YNLks5uRv_OgaJpZM4SXlNK
.

@jnothman Yes, that is how I understood it, and also what I meant above with return_frame (to return a frame instead of an array; both in the bunch or as direct value depending on return_X_yis True or False)
(I meant return_frame only as a name suggestion compared to as_frame, not a different behaviour)

Hey y'all,

Seems like this has gone cold. I'd be keen to take care of the last bit.

Currently, this change requires modifying each load_*() function. What are y'alls thoughts on creating an abstract class DataSetLoader that is subclassed for each type of dataset like:

  • SimpleDataSetLoader (only a single data file is loaded from)
  • TargetDataSetLoader (sources from a target data file)
  • ImageDataSetLoader (image data)
  • and so on

Then those classes could hide dataset specific implementation in classes like:

  • Iris
  • Digits
  • BreastCancer
  • and so on

The abstract class DataSetLoader would have a method load(return_X_y=False, as_frame=False) -> Bunch, which would call overrideable getter methods (I'm not married to 'getter' method patterns, but it might do) to modify the behavior of DataSetLoader.load like:

  • get_data_file_basename() -> str
  • get_feature_names() -> List[str]
  • etc

Just spitballing here but while the SimpleDataSetLoader, TargetDataSetLoader, ImageDataSetLoader etc. classes could include some implementation, thye could also include overrideable methods of like:

  • preprocess_data()
  • preprocess_target()
  • preprocess_images()
  • get_target_file_basename() -> str
  • get_image_file_basename() -> str

to leave open for classes like Iris or Digits to take care of any dataset-specific implementation.

My thought is that by adopting an OO approach it will be easier to add datasets in the future, or add features to dataset loading that make it compatible with other Python packages (like this issue originally was trying to do with Pandas).

I'd really appreciate y'alls thoughts and feedback, as this is my first attempt to contribute to SciKit-Learn (aka Noobie here!).

Thanks for reading :)

p.s I'd be keen to collaborate with anyone interested (especially if they're experienced in OO design) - it would be great to learn from y'all

And here's a preview of the suggested API - which could replace the load_wine and load_iris functions:

NOTE: this preview is outdated. A more recent version of the API preview can be found in the comments below

class DataSetLoader:
    from typing import Union, Tuple, Any

    descr_string = 'descr'
    csv_ext = '.csv'
    rst_ext = '.rst'

    def _load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        raise NotImplementedError


class SimpleDataSetLoader(DataSetLoader):
    from typing import Union, Tuple, Any, List

    def _load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        module_path = dirname(__file__)
        base_filename = self.get_data_file_basename()
        data, target, target_names = load_data(module_path, base_filename + self.csv_ext)
        dataset_csv_filename = join(module_path, 'data', base_filename + self.csv_ext)

        with open(join(module_path, self.descr_string, base_filename + self.rst_ext)) as rst_file:
            fdescr = rst_file.read()

        if return_X_y:
            return data, target

        return Bunch(data=data, target=target,
                     target_names=target_names,
                     DESCR=fdescr,
                     feature_names=self.get_feature_names(),
                     filename=dataset_csv_filename)

    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]::
        return self._load(return_X_y=return_X_y)

    def get_data_file_basename(self) -> str:
        raise NotImplementedError

    def get_feature_names(self) -> List[str]:
        raise NotImplementedError


class Wine(SimpleDataSetLoader):
    from typing import Union, Tuple, Any, List

    def get_data_file_basename(self) -> str:
        return 'wine_data'

    def get_feature_names(self) -> List[str]:
        return ['alcohol',
                'malic_acid',
                'ash',
                'alcalinity_of_ash',
                'magnesium',
                'total_phenols',
                'flavanoids',
                'nonflavanoid_phenols',
                'proanthocyanins',
                'color_intensity',
                'hue',
                'od280/od315_of_diluted_wines',
                'proline']

    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the wine dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 80, and 140, and want to
        know their class name.

        >>> from sklearn.datasets import Wine
        >>> data = Wine.load()
        >>> data.target[[10, 80, 140]]
        array([0, 1, 2])
        >>> list(data.target_names)
        ['class_0', 'class_1', 'class_2']
        """
        return self._load(return_X_y=return_X_y)


class Iris(SimpleDataSetLoader):
    from typing import Union, Tuple, Any, List

    def get_data_file_basename(self) -> str:
        return 'iris'

    def get_feature_names(self) -> List[str]:
        return ['sepal length (cm)', 'sepal width (cm)',
                'petal length (cm)', 'petal width (cm)']

    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the iris dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 25, and 50, and want to
        know their class name.

        >>> from sklearn.datasets import Iris
        >>> data = Iris.load()
        >>> data.target[[10, 25, 50]]
        array([0, 0, 1])
        >>> list(data.target_names)
        ['setosa', 'versicolor', 'virginica']
        """
        return self._load(return_X_y=return_X_y)


# Backward compatibility
load_wine = Wine().load
load_iris = Iris().load

It would be great that target_column was an argument of load for every dataset (as in fetch_openml). That allows for using the same dataset for different tasks, or to have all the data (including the target) in the same array. Also, are you proposing something similar for the fetchers?

@vnmabus I agree that being able to set the target_column would be a helpful feature - I've run into the same issue with wanting all of the data in the same array. But I'm hesitant to add it as I'm wary of 'feature creep'. Is there a reasoning for how a target_column param would ease API restructuring?

I really like the idea of bringing the fetch_openml and openml API into this restructuring. In the code below, the abstract _fetch_dataset method could be overridden in RemoteDataSetLoader to work with openml remote datasets. I've also added an entrypoint and psuedocode for the original as_frame param in AbstractDataSetLoader:

from abc import ABC, abstractmethod
from typing import Union, Tuple, Any, List


class AbstractDataSetLoader(ABC):

    descr_string = 'descr'
    csv_ext = '.csv'
    rst_ext = '.rst'

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Override this method to insert the docstring of the dataset"""

        fetched_dataset = self._fetch_dataset()

        if as_frame:
            # Pseudo Code: Convert 'data' and 'target' of fetched_dataset bunch to pandas DataFrames
            # fetched_dataset['data'] = pd.DataFrame(fetched_dataset['data'])
            # fetched_dataset['target'] = pd.DataFrame(fetched_dataset['target'])
            pass

        if return_X_y:
            return fetched_dataset['data'], fetched_dataset['target']

        return fetched_dataset

    @abstractmethod
    def _fetch_dataset(self) -> Bunch:
        pass


class RemoteDataSetLoader(AbstractDataSetLoader):
    pass


class LocalDataSetLoader(AbstractDataSetLoader):
    def _fetch_dataset(self) -> Union[Bunch, Tuple[Any, Any]]:
        module_path = dirname(__file__)
        base_filename = self._get_data_file_basename()
        data, target, target_names = load_data(module_path, base_filename + self.csv_ext)
        dataset_csv_filename = join(module_path, 'data', base_filename + self.csv_ext)

        with open(join(module_path, self.descr_string, base_filename + self.rst_ext)) as rst_file:
            fdescr = rst_file.read()

        return Bunch(data=data, target=target,
                     target_names=target_names,
                     DESCR=fdescr,
                     feature_names=self._get_feature_names(),
                     filename=dataset_csv_filename)

    @abstractmethod
    def _get_data_file_basename(self) -> str:
        pass

    @abstractmethod
    def _get_feature_names(self) -> List[str]:
        pass


class Wine(LocalDataSetLoader):
    def _get_data_file_basename(self):
        return 'wine_data'

    def _get_feature_names(self):
        return ['alcohol',
                'malic_acid',
                'ash',
                'alcalinity_of_ash',
                'magnesium',
                'total_phenols',
                'flavanoids',
                'nonflavanoid_phenols',
                'proanthocyanins',
                'color_intensity',
                'hue',
                'od280/od315_of_diluted_wines',
                'proline']

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the wine dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 80, and 140, and want to
        know their class name.

        >>> from sklearn.datasets import Wine
        >>> data = Wine().load()
        >>> data.target[[10, 80, 140]]
        array([0, 1, 2])
        >>> list(data.target_names)
        ['class_0', 'class_1', 'class_2']
        """
        return super().load(return_X_y=return_X_y, as_frame=as_frame)


class Iris(LocalDataSetLoader):
    def _get_data_file_basename(self):
        return 'iris'

    def _get_feature_names(self):
        return ['sepal length (cm)', 'sepal width (cm)',
                'petal length (cm)', 'petal width (cm)']

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the iris dataset (classification).
        *** omitted some docstring for clarity ***
        >>> from sklearn.datasets import Iris
        >>> data = Iris().load()
        >>> data.target[[10, 25, 50]]
        array([0, 0, 1])
        >>> list(data.target_names)
        ['setosa', 'versicolor', 'virginica']
        """
        return super().load(return_X_y=return_X_y, as_frame=as_frame)


# Backward compatibility
load_wine = Wine().load
load_iris = Iris().load

I haven't yet had time to properly evaluate the benefits of an object
interface. I am not readily convinced that this particular feature
requirement necessitates the redesign: until we have to worry about
heterogeneous dtype, a DataFrame can be constructed from a Bunch, can't it?

@jnothman Yes I agree that the OO API isn't strictly necessary to implement returning dataframes. However, without an OO API, each function would need to be individually modified.

IMO Refactoring the API would make a change like this trivial, and I see it as a good opportunity to refactor the API rather than add weight and complexity to the current functions.

Refactoring the datasets API is getting away from the spirit of this issue, so I'll open up a new one with a reference here.

also see #13902

Is this solved by #13902 being merged? Do we want this for all dataset loaders?

I'm happy to see as_frame=True for other dataset loaders...

ok cool. these are actually nicely sprintable. should we tag "good first issue"?

I will take a look into it, NYC WiMLDS Sprint

@alankritaghosh which dataset loader do you want to work on?

I am checking sklearn/datasets/base.py and I see most dataset already has that feather (return_X_y=False) and the only dataset left is from function load_sample_images(). If I am tracing this issue correctly, I think that is the dataset I am going to look at.

@alankritaghosh hm I'm not sure if that actually makes sense for the images, let me check.

I saw 10 functions in total, and 8 of them already has feature return_X_y, the two remaining part is load_sample_images() and load_sample_image(). So I think we can close the issue.

It doesn't make sense for the images, but this is about adding an as_frame=True, which these functions don't have yet.

And I agree with you. It doesn't make sense for the image. I can add that paras so all load function will have similar structure, and show a warning message if anyone set return_X_y = True for load_sample_images() function, how does that sound?

I think that's fine as it is now. But we can still add the as_frame=True to other loaders.

@vmanisha and I'd like to work on this: adding as_frame=True to relevant functions.

@aditi9783 great, maybe pick one dataset loader to start with

Hi! I'd like to work on adding as_frame=True for the California housing dataset. It doesn't seem to have anyone working on it yet, so I'll just get started with it now at the sprint. :)

I'd like to work on this by adding as_frame=True to the load_diabetes loader.

I'll work on this by adding as_frame=True to the load_wine loader!

I will work on this by adding as_frame=True to the load_breast_cancer.

I'm working on this today by adding as_frame=True to load_linnerud(). Afterwards, I will look into the same for one of the real life datasets.

Hey guys I think we should write a general _bunch_to_dataframe() function that can be called in all of these loaders. Instead of writing separate code in each loader, are we on the same page?

Yes, I had thought that was the request originally.

I'd like to claim it if no one else that posted this morning before me doesn't (or doesn't respond soon).

I think we have made some progress already if you don't mind!

Okay

Referencing the example from the linked issue's implementation (#13902), which adds as_frame to fetch_openml -- it seems that some of these datasets have different formats that require custom handling. For example the openml dataset has an ARFF file format, while the one (_california_housing.py) I'm working on uses RST.

I agree that it'd be good to follow a standard and generalized naming convention and pattern, though I don't think we could simply call a shared _bunch_to_dataframe without shadowing it in individual loaders to handle (or at least check) these varying formats. Also, the existing merged openml implementation adds frame to the bunch, rather than converting the bunch to dataframe as the proposed name (_bunch_to_dataframe) implies. I propose an abstract method called _convert_data_dataframe, which is a generalization of openml.py's _convert_arff_data_dataframe method.

@gitsteph not sure I follow. I think for many of the cases (thought not for openml probably) we can use the existing bunch to create a dataframe. The bunch has numpy arrays and standardized ways to name features.

@amueller it seems cleaner to create the frame before the bunch, and add it to the bunch as I do in PR #15486 -- and more consistent with what happens in the existing merged example for openml.

Though I'm open to revising what I have there already, especially if you're already noticing that most cases can use the same method/are often in the same formats.

Erg, I'm noticing a couple things I want to change already... so going to temporarily close the PR for a few moments... whoops...

Just an FYI for everybody: the decision was to include the target into the dataframe.

It seems to me that each dataset loader massages the data into an appropriate format for a data bunch...a simple function to return a dataframe from this structure is trivial

def bunch_to_dataframe(bunch):
    data= np.c_[bunch.data, bunch.target]
    target_cols = ["target"]

    # Check if target is n-dimensional
    if len(bunch.target.shape) > 1:
        target_cols = ["target_" + col for col in bunch.target_names]

    columns = np.append(bunch.feature_names, target_cols)
    return pd.DataFrame(data, columns=columns)

So add the following code into the dataset loaders, not including image datasets:

bunch = Bunch(data=data,
                 target=target,
                 feature_names=feature_names,
                 DESCR=descr)
if as_frame:
    return bunch_to_dataframe(bunch)

This doesn't include metadata for feature_names or DESCR in the final data frame.

the target_names are names of the categories in classification, not different columns, right?

Yes, except for the linnerud datatset which is for multiple multivariate regression. There are 3 target columns which are named under target_names.

🤔 Hmm... I think I might be a bit confused.
Are we _only_ returning the dataframe, or a Bunch containing the frame if as_frame=True (as you see in the merged OpenML example)? I implemented it to be the latter, for consistency with that other merged example.

That's what I meant by saying it was confusing to me that we'd convert bunch_to_dataframe. The way I've implemented it in the above PR, when as_frame=True, I provide data and target as pandas objects (instead of np arrays) and also provide a frame that isn't None, all wrapped in a returned bunch. The bunch is still there, but with the data further altered/augmented if as_frame=True.

@wconnell yeah that sounds about right.

@gitsteph yes we should return a bunch containing a dataframe I think.

linnerud probably needs it's own thing, but most of the rest should be able to share a function.

15486 has a good helper

Waiting for #15486 to merge so I can apply this function to the builitin data loaders. Although I am curious if there is a good way to work with the new code from this PR locally so I can begin without this being officially merged? Would I have to clone @gitsteph branch and then set upstream? Or is this a bad idea and patience a better virtue in this case?

Here are some dataset fetchers that would still benefit from a treatment similar to #15980 and #15950.

  • fetch_covtype but we would need to hunt the original website to find the feature names (and data types);
  • fetch_kddcup99: this one is "interesting" as it can already return a recarray with a structured dtype with named int or float fields that should be mapped to similarly named and typed columns in a dataframe;
  • fetch_20newsgroups_vectorized that should use the tokenized words as column names;
  • fetch_rcv1 similar but I am not sure we can retrieve the words...

There are weirder dataset loaders though:

  • fetch_olivetti_faces, fetch_lfw_pairs and fetch_lfw_people load pixel values hence would not really benefit from being loaded as a dataframe.
  • fetch_species_distributions is weird and would require its own issue to specify what we would want to do with it.

@cmarmo sprint?

@cmarmo sprint?

Yes! Except that I would be happier with a more accessible list of the remaining tasks: @ogrisel do you mind to link your comment in the description? Thanks!

This may be slightly harder since we usually block the fetch_* from running on the CI in pull requests.

@lucyleeow and @cmarmo - do your comments about suggesting this issue for the sprint, or saying that you're planning to work on it?

@lucyleeow and @cmarmo - do your comments about suggesting this issue for the sprint, or saying that you're planning to work on it?

Suggesting for the sprint. :)

ok, then @tianchuliang and I would take this for the sprint

When testing be sure to set SKLEARN_SKIP_NETWORK_TESTS=0 and the fixtures in sklearn/datasets/tests/conftest.py.

take

@erodrago I believe @bsipocz and @tianchuliang is working on this.

@bsipocz, @erodrago there is a number of function left, do you mind to specify on which one you are working? Thanks!

@cmarmo @bsipocz and I will basically split 4 half and half; I will work on first 2, @bsipocz works on the second 2

@cmarmo can we each make a PR since our tasks is pretty well divided? Or do you suggest one combined PR for this issue?

@thomasjpfan sorry I picked it up since it was still on the to do pane.. okay will leave you guys to work on it

@tianchuliang @bsipocz I think that one PR per dataset fetcher will be easier to review. Thanks!

@erodrago - I've started to work on the 20newsgroup one, if you want to pick up fetch_rcv1 just let me know so I don't duplicate the effort

I've started workingon the first one, fetch_covtype

Hi, can we start working on fetch_rcv1? Just wanted to make sure there is no duplicate work being done before we begin.

@jaketae go for it!

Be sure to look at https://github.com/scikit-learn/scikit-learn/issues/10733#issuecomment-640086273 when working on this locally.

FYI: Started working on adding as_frame for fetch_kddcup99

@thomasjpfan Should the returned DataFrame have labeled columns? As noted in the comments above, I'm not entirely sure where to get the column labels (words) for the dataset.

cc: @amueller

Let's not work on fetch_rcv1 since the words are difficult to get.

Closing, all data fetchers suitable for this feature have been modified.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

amueller picture amueller  Â·  64Comments

jhennrich picture jhennrich  Â·  61Comments

MechCoder picture MechCoder  Â·  165Comments

jorisvandenbossche picture jorisvandenbossche  Â·  63Comments

mikeroberts3000 picture mikeroberts3000  Â·  102Comments