I would like to work upon this. Seems like a good first issue. If I understand it correctly, the task is to add an argument to the load function for returning a DataFrame. So in case of iris dataset, this can be dataset.load_iris(as_frame=True). Am I right?

layog on 1 Mar 2018

yes, i think it would be good, but there may not be consensus on making
this change. So if you make a pull request, it is with some risk that other
core devs will object to it

jnothman on 1 Mar 2018

Okay, but I might as well create a PR and see how the conversation turns out.

layog on 1 Mar 2018

you are welcome to

jnothman on 1 Mar 2018

👍1

@layog are you still working on this? if not, I would like to give it a try.

nsorros on 11 Apr 2018

@nsorros No, I have not started working on this one. You can take it.

layog on 11 Apr 2018

@jnothman It would be helpful to get some thoughts in the PR which simply implements what you suggested for one dataset.

nsorros on 13 Apr 2018

@jnotham is there agreement on the desired interface? as_frame=True/False ? (or return_frame to be consistent with return_X_y?)

On the PR it was suggested to have a return_X_y='pandas' instead.
For openml, I was thinking that a return='array' could also make sense to ensure a numerical array is returned (drop string features, https://github.com/scikit-learn/scikit-learn/issues/11819). A return='array'|'pandas'|... might be more flexible future-wise to add new things without the need to add new keywords, but the question is of course this would be needed.

jorisvandenbossche on 17 Aug 2018

I believe it should be possible to get a dataframe as the bunch's data
attribute, rather than this being an alternative value of return_X_y. And
that's why as_frame or frame is better than return_frame.

On 18 August 2018 at 03:39, Joris Van den Bossche notifications@github.com
wrote:

@jnotham is there agreement on the desired interface? as_frame=True/False
? (or return_frame to be consistent with return_X_y?)

On the PR it was suggested to have a return_X_y='pandas' instead.
For openml, I was thinking that a return='array' could also make sense to
ensure a numerical array is returned (drop string features, #11819
https://github.com/scikit-learn/scikit-learn/issues/11819). A
return='array'|'pandas'|... might be more flexible future-wise to add new
things without the need to add new keywords, but the question is of course
this would be needed.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10733#issuecomment-413938083,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6_q1x5Sx7p-r78D8dAt2gyfX2YNLks5uRv_OgaJpZM4SXlNK
.

jnothman on 18 Aug 2018

@jnothman Yes, that is how I understood it, and also what I meant above with return_frame (to return a frame instead of an array; both in the bunch or as direct value depending on return_X_yis True or False)
(I meant return_frame only as a name suggestion compared to as_frame, not a different behaviour)

jorisvandenbossche on 21 Aug 2018

Hey y'all,

Seems like this has gone cold. I'd be keen to take care of the last bit.

Currently, this change requires modifying each load_*() function. What are y'alls thoughts on creating an abstract class DataSetLoader that is subclassed for each type of dataset like:

SimpleDataSetLoader (only a single data file is loaded from)
TargetDataSetLoader (sources from a target data file)
ImageDataSetLoader (image data)
and so on

Then those classes could hide dataset specific implementation in classes like:

Iris
Digits
BreastCancer
and so on

The abstract class DataSetLoader would have a method load(return_X_y=False, as_frame=False) -> Bunch, which would call overrideable getter methods (I'm not married to 'getter' method patterns, but it might do) to modify the behavior of DataSetLoader.load like:

get_data_file_basename() -> str
get_feature_names() -> List[str]
etc

Just spitballing here but while the SimpleDataSetLoader, TargetDataSetLoader, ImageDataSetLoader etc. classes could include some implementation, thye could also include overrideable methods of like:

preprocess_data()
preprocess_target()
preprocess_images()
get_target_file_basename() -> str
get_image_file_basename() -> str

to leave open for classes like Iris or Digits to take care of any dataset-specific implementation.

My thought is that by adopting an OO approach it will be easier to add datasets in the future, or add features to dataset loading that make it compatible with other Python packages (like this issue originally was trying to do with Pandas).

I'd really appreciate y'alls thoughts and feedback, as this is my first attempt to contribute to SciKit-Learn (aka Noobie here!).

Thanks for reading :)

p.s I'd be keen to collaborate with anyone interested (especially if they're experienced in OO design) - it would be great to learn from y'all

daniel-cortez-stevenson on 2 Feb 2019

And here's a preview of the suggested API - which could replace the load_wine and load_iris functions:

NOTE: this preview is outdated. A more recent version of the API preview can be found in the comments below

class DataSetLoader:
    from typing import Union, Tuple, Any

    descr_string = 'descr'
    csv_ext = '.csv'
    rst_ext = '.rst'

    def _load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        raise NotImplementedError


class SimpleDataSetLoader(DataSetLoader):
    from typing import Union, Tuple, Any, List

    def _load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        module_path = dirname(__file__)
        base_filename = self.get_data_file_basename()
        data, target, target_names = load_data(module_path, base_filename + self.csv_ext)
        dataset_csv_filename = join(module_path, 'data', base_filename + self.csv_ext)

        with open(join(module_path, self.descr_string, base_filename + self.rst_ext)) as rst_file:
            fdescr = rst_file.read()

        if return_X_y:
            return data, target

        return Bunch(data=data, target=target,
                     target_names=target_names,
                     DESCR=fdescr,
                     feature_names=self.get_feature_names(),
                     filename=dataset_csv_filename)

    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]::
        return self._load(return_X_y=return_X_y)

    def get_data_file_basename(self) -> str:
        raise NotImplementedError

    def get_feature_names(self) -> List[str]:
        raise NotImplementedError


class Wine(SimpleDataSetLoader):
    from typing import Union, Tuple, Any, List

    def get_data_file_basename(self) -> str:
        return 'wine_data'

    def get_feature_names(self) -> List[str]:
        return ['alcohol',
                'malic_acid',
                'ash',
                'alcalinity_of_ash',
                'magnesium',
                'total_phenols',
                'flavanoids',
                'nonflavanoid_phenols',
                'proanthocyanins',
                'color_intensity',
                'hue',
                'od280/od315_of_diluted_wines',
                'proline']

    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the wine dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 80, and 140, and want to
        know their class name.

        >>> from sklearn.datasets import Wine
        >>> data = Wine.load()
        >>> data.target[[10, 80, 140]]
        array([0, 1, 2])
        >>> list(data.target_names)
        ['class_0', 'class_1', 'class_2']
        """
        return self._load(return_X_y=return_X_y)


class Iris(SimpleDataSetLoader):
    from typing import Union, Tuple, Any, List

    def get_data_file_basename(self) -> str:
        return 'iris'

    def get_feature_names(self) -> List[str]:
        return ['sepal length (cm)', 'sepal width (cm)',
                'petal length (cm)', 'petal width (cm)']

    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the iris dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 25, and 50, and want to
        know their class name.

        >>> from sklearn.datasets import Iris
        >>> data = Iris.load()
        >>> data.target[[10, 25, 50]]
        array([0, 0, 1])
        >>> list(data.target_names)
        ['setosa', 'versicolor', 'virginica']
        """
        return self._load(return_X_y=return_X_y)


# Backward compatibility
load_wine = Wine().load
load_iris = Iris().load

daniel-cortez-stevenson on 3 Feb 2019

It would be great that target_column was an argument of load for every dataset (as in fetch_openml). That allows for using the same dataset for different tasks, or to have all the data (including the target) in the same array. Also, are you proposing something similar for the fetchers?

vnmabus on 3 Feb 2019

@vnmabus I agree that being able to set the target_column would be a helpful feature - I've run into the same issue with wanting all of the data in the same array. But I'm hesitant to add it as I'm wary of 'feature creep'. Is there a reasoning for how a target_column param would ease API restructuring?

I really like the idea of bringing the fetch_openml and openml API into this restructuring. In the code below, the abstract _fetch_dataset method could be overridden in RemoteDataSetLoader to work with openml remote datasets. I've also added an entrypoint and psuedocode for the original as_frame param in AbstractDataSetLoader:

from abc import ABC, abstractmethod
from typing import Union, Tuple, Any, List


class AbstractDataSetLoader(ABC):

    descr_string = 'descr'
    csv_ext = '.csv'
    rst_ext = '.rst'

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Override this method to insert the docstring of the dataset"""

        fetched_dataset = self._fetch_dataset()

        if as_frame:
            # Pseudo Code: Convert 'data' and 'target' of fetched_dataset bunch to pandas DataFrames
            # fetched_dataset['data'] = pd.DataFrame(fetched_dataset['data'])
            # fetched_dataset['target'] = pd.DataFrame(fetched_dataset['target'])
            pass

        if return_X_y:
            return fetched_dataset['data'], fetched_dataset['target']

        return fetched_dataset

    @abstractmethod
    def _fetch_dataset(self) -> Bunch:
        pass


class RemoteDataSetLoader(AbstractDataSetLoader):
    pass


class LocalDataSetLoader(AbstractDataSetLoader):
    def _fetch_dataset(self) -> Union[Bunch, Tuple[Any, Any]]:
        module_path = dirname(__file__)
        base_filename = self._get_data_file_basename()
        data, target, target_names = load_data(module_path, base_filename + self.csv_ext)
        dataset_csv_filename = join(module_path, 'data', base_filename + self.csv_ext)

        with open(join(module_path, self.descr_string, base_filename + self.rst_ext)) as rst_file:
            fdescr = rst_file.read()

        return Bunch(data=data, target=target,
                     target_names=target_names,
                     DESCR=fdescr,
                     feature_names=self._get_feature_names(),
                     filename=dataset_csv_filename)

    @abstractmethod
    def _get_data_file_basename(self) -> str:
        pass

    @abstractmethod
    def _get_feature_names(self) -> List[str]:
        pass


class Wine(LocalDataSetLoader):
    def _get_data_file_basename(self):
        return 'wine_data'

    def _get_feature_names(self):
        return ['alcohol',
                'malic_acid',
                'ash',
                'alcalinity_of_ash',
                'magnesium',
                'total_phenols',
                'flavanoids',
                'nonflavanoid_phenols',
                'proanthocyanins',
                'color_intensity',
                'hue',
                'od280/od315_of_diluted_wines',
                'proline']

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the wine dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 80, and 140, and want to
        know their class name.

        >>> from sklearn.datasets import Wine
        >>> data = Wine().load()
        >>> data.target[[10, 80, 140]]
        array([0, 1, 2])
        >>> list(data.target_names)
        ['class_0', 'class_1', 'class_2']
        """
        return super().load(return_X_y=return_X_y, as_frame=as_frame)


class Iris(LocalDataSetLoader):
    def _get_data_file_basename(self):
        return 'iris'

    def _get_feature_names(self):
        return ['sepal length (cm)', 'sepal width (cm)',
                'petal length (cm)', 'petal width (cm)']

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the iris dataset (classification).
        *** omitted some docstring for clarity ***
        >>> from sklearn.datasets import Iris
        >>> data = Iris().load()
        >>> data.target[[10, 25, 50]]
        array([0, 0, 1])
        >>> list(data.target_names)
        ['setosa', 'versicolor', 'virginica']
        """
        return super().load(return_X_y=return_X_y, as_frame=as_frame)


# Backward compatibility
load_wine = Wine().load
load_iris = Iris().load

daniel-cortez-stevenson on 3 Feb 2019

I haven't yet had time to properly evaluate the benefits of an object
interface. I am not readily convinced that this particular feature
requirement necessitates the redesign: until we have to worry about
heterogeneous dtype, a DataFrame can be constructed from a Bunch, can't it?

jnothman on 4 Feb 2019

@jnothman Yes I agree that the OO API isn't strictly necessary to implement returning dataframes. However, without an OO API, each function would need to be individually modified.

IMO Refactoring the API would make a change like this trivial, and I see it as a good opportunity to refactor the API rather than add weight and complexity to the current functions.

Refactoring the datasets API is getting away from the spirit of this issue, so I'll open up a new one with a reference here.

daniel-cortez-stevenson on 4 Feb 2019

also see #13902

amueller on 12 Jul 2019

Is this solved by #13902 being merged? Do we want this for all dataset loaders?

amueller on 30 Jul 2019

I'm happy to see as_frame=True for other dataset loaders...

jnothman on 31 Jul 2019

ok cool. these are actually nicely sprintable. should we tag "good first issue"?

amueller on 1 Aug 2019

I will take a look into it, NYC WiMLDS Sprint

alankritaghosh on 24 Aug 2019

@alankritaghosh which dataset loader do you want to work on?

amueller on 24 Aug 2019

I am checking sklearn/datasets/base.py and I see most dataset already has that feather (return_X_y=False) and the only dataset left is from function load_sample_images(). If I am tracing this issue correctly, I think that is the dataset I am going to look at.

alankritaghosh on 24 Aug 2019

@alankritaghosh hm I'm not sure if that actually makes sense for the images, let me check.

amueller on 24 Aug 2019

I saw 10 functions in total, and 8 of them already has feature return_X_y, the two remaining part is load_sample_images() and load_sample_image(). So I think we can close the issue.

alankritaghosh on 24 Aug 2019

It doesn't make sense for the images, but this is about adding an as_frame=True, which these functions don't have yet.

amueller on 24 Aug 2019

And I agree with you. It doesn't make sense for the image. I can add that paras so all load function will have similar structure, and show a warning message if anyone set return_X_y = True for load_sample_images() function, how does that sound?

alankritaghosh on 24 Aug 2019

I think that's fine as it is now. But we can still add the as_frame=True to other loaders.

amueller on 24 Aug 2019

@vmanisha and I'd like to work on this: adding as_frame=True to relevant functions.

aditi9783 on 24 Aug 2019

@aditi9783 great, maybe pick one dataset loader to start with

amueller on 24 Aug 2019

Hi! I'd like to work on adding as_frame=True for the California housing dataset. It doesn't seem to have anyone working on it yet, so I'll just get started with it now at the sprint. :)

gitsteph on 2 Nov 2019

I'd like to work on this by adding as_frame=True to the load_diabetes loader.

ltcguthrie on 2 Nov 2019

I'll work on this by adding as_frame=True to the load_wine loader!

wconnell on 2 Nov 2019

I will work on this by adding as_frame=True to the load_breast_cancer.

reganconnell on 2 Nov 2019

I'm working on this today by adding as_frame=True to load_linnerud(). Afterwards, I will look into the same for one of the real life datasets.

tristan-paul on 2 Nov 2019

Hey guys I think we should write a general _bunch_to_dataframe() function that can be called in all of these loaders. Instead of writing separate code in each loader, are we on the same page?

wconnell on 2 Nov 2019

👍1

Yes, I had thought that was the request originally.

tristan-paul on 2 Nov 2019

I'd like to claim it if no one else that posted this morning before me doesn't (or doesn't respond soon).

tristan-paul on 2 Nov 2019

I think we have made some progress already if you don't mind!

wconnell on 2 Nov 2019

Okay

tristan-paul on 2 Nov 2019

Referencing the example from the linked issue's implementation (#13902), which adds as_frame to fetch_openml -- it seems that some of these datasets have different formats that require custom handling. For example the openml dataset has an ARFF file format, while the one (_california_housing.py) I'm working on uses RST.

I agree that it'd be good to follow a standard and generalized naming convention and pattern, though I don't think we could simply call a shared _bunch_to_dataframe without shadowing it in individual loaders to handle (or at least check) these varying formats. Also, the existing merged openml implementation adds frame to the bunch, rather than converting the bunch to dataframe as the proposed name (_bunch_to_dataframe) implies. I propose an abstract method called _convert_data_dataframe, which is a generalization of openml.py's _convert_arff_data_dataframe method.

gitsteph on 2 Nov 2019

👍1

@gitsteph not sure I follow. I think for many of the cases (thought not for openml probably) we can use the existing bunch to create a dataframe. The bunch has numpy arrays and standardized ways to name features.

amueller on 2 Nov 2019

👍1

@amueller it seems cleaner to create the frame before the bunch, and add it to the bunch as I do in PR #15486 -- and more consistent with what happens in the existing merged example for openml.

Though I'm open to revising what I have there already, especially if you're already noticing that most cases can use the same method/are often in the same formats.

Erg, I'm noticing a couple things I want to change already... so going to temporarily close the PR for a few moments... whoops...

gitsteph on 2 Nov 2019

Just an FYI for everybody: the decision was to include the target into the dataframe.

amueller on 2 Nov 2019

👍3

It seems to me that each dataset loader massages the data into an appropriate format for a data bunch...a simple function to return a dataframe from this structure is trivial

wconnell on 2 Nov 2019

def bunch_to_dataframe(bunch):
    data= np.c_[bunch.data, bunch.target]
    target_cols = ["target"]

    # Check if target is n-dimensional
    if len(bunch.target.shape) > 1:
        target_cols = ["target_" + col for col in bunch.target_names]

    columns = np.append(bunch.feature_names, target_cols)
    return pd.DataFrame(data, columns=columns)

So add the following code into the dataset loaders, not including image datasets:

bunch = Bunch(data=data,
                 target=target,
                 feature_names=feature_names,
                 DESCR=descr)
if as_frame:
    return bunch_to_dataframe(bunch)

This doesn't include metadata for feature_names or DESCR in the final data frame.

reganconnell on 2 Nov 2019

the target_names are names of the categories in classification, not different columns, right?

amueller on 2 Nov 2019

Yes, except for the linnerud datatset which is for multiple multivariate regression. There are 3 target columns which are named under target_names.

wconnell on 2 Nov 2019

🤔 Hmm... I think I might be a bit confused.
Are we _only_ returning the dataframe, or a Bunch containing the frame if as_frame=True (as you see in the merged OpenML example)? I implemented it to be the latter, for consistency with that other merged example.

gitsteph on 2 Nov 2019

That's what I meant by saying it was confusing to me that we'd convert bunch_to_dataframe. The way I've implemented it in the above PR, when as_frame=True, I provide data and target as pandas objects (instead of np arrays) and also provide a frame that isn't None, all wrapped in a returned bunch. The bunch is still there, but with the data further altered/augmented if as_frame=True.

gitsteph on 2 Nov 2019

@wconnell yeah that sounds about right.

@gitsteph yes we should return a bunch containing a dataframe I think.

amueller on 2 Nov 2019

👍1

linnerud probably needs it's own thing, but most of the rest should be able to share a function.

amueller on 2 Nov 2019

15486 has a good helper

amueller on 2 Nov 2019

Waiting for #15486 to merge so I can apply this function to the builitin data loaders. Although I am curious if there is a good way to work with the new code from this PR locally so I can begin without this being officially merged? Would I have to clone @gitsteph branch and then set upstream? Or is this a bad idea and patience a better virtue in this case?

wconnell on 4 Nov 2019

Here are some dataset fetchers that would still benefit from a treatment similar to #15980 and #15950.

fetch_covtype but we would need to hunt the original website to find the feature names (and data types);
fetch_kddcup99: this one is "interesting" as it can already return a recarray with a structured dtype with named int or float fields that should be mapped to similarly named and typed columns in a dataframe;
fetch_20newsgroups_vectorized that should use the tokenized words as column names;
~~fetch_rcv1 similar but I am not sure we can retrieve the words...~~

There are weirder dataset loaders though:

fetch_olivetti_faces, fetch_lfw_pairs and fetch_lfw_people load pixel values hence would not really benefit from being loaded as a dataframe.
fetch_species_distributions is weird and would require its own issue to specify what we would want to do with it.

ogrisel on 6 Mar 2020

👍1

@cmarmo sprint?

lucyleeow on 5 Jun 2020

@cmarmo sprint?

Yes! Except that I would be happier with a more accessible list of the remaining tasks: @ogrisel do you mind to link your comment in the description? Thanks!

cmarmo on 5 Jun 2020

This may be slightly harder since we usually block the fetch_* from running on the CI in pull requests.

thomasjpfan on 5 Jun 2020

@lucyleeow and @cmarmo - do your comments about suggesting this issue for the sprint, or saying that you're planning to work on it?

bsipocz on 6 Jun 2020

@lucyleeow and @cmarmo - do your comments about suggesting this issue for the sprint, or saying that you're planning to work on it?

Suggesting for the sprint. :)

cmarmo on 6 Jun 2020

👀1

ok, then @tianchuliang and I would take this for the sprint

bsipocz on 6 Jun 2020

👍2

When testing be sure to set SKLEARN_SKIP_NETWORK_TESTS=0 and the fixtures in sklearn/datasets/tests/conftest.py.

thomasjpfan on 6 Jun 2020

👀1

take

erodrago on 6 Jun 2020

@erodrago I believe @bsipocz and @tianchuliang is working on this.

thomasjpfan on 6 Jun 2020

@bsipocz, @erodrago there is a number of function left, do you mind to specify on which one you are working? Thanks!

cmarmo on 6 Jun 2020

@cmarmo @bsipocz and I will basically split 4 half and half; I will work on first 2, @bsipocz works on the second 2

tianchuliang on 6 Jun 2020

👍1

@cmarmo can we each make a PR since our tasks is pretty well divided? Or do you suggest one combined PR for this issue?

tianchuliang on 6 Jun 2020

@thomasjpfan sorry I picked it up since it was still on the to do pane.. okay will leave you guys to work on it

erodrago on 6 Jun 2020

👍1

@tianchuliang @bsipocz I think that one PR per dataset fetcher will be easier to review. Thanks!

cmarmo on 6 Jun 2020

👍1

@erodrago - I've started to work on the 20newsgroup one, if you want to pick up fetch_rcv1 just let me know so I don't duplicate the effort

bsipocz on 6 Jun 2020

I've started workingon the first one, fetch_covtype

tianchuliang on 6 Jun 2020

Hi, can we start working on fetch_rcv1? Just wanted to make sure there is no duplicate work being done before we begin.

jaketae on 6 Jun 2020

@jaketae go for it!

bsipocz on 6 Jun 2020

👍1

Be sure to look at https://github.com/scikit-learn/scikit-learn/issues/10733#issuecomment-640086273 when working on this locally.

thomasjpfan on 6 Jun 2020

👍1

FYI: Started working on adding as_frame for fetch_kddcup99

tianchuliang on 6 Jun 2020

@thomasjpfan Should the returned DataFrame have labeled columns? As noted in the comments above, I'm not entirely sure where to get the column labels (words) for the dataset.

cc: @amueller

jaketae on 6 Jun 2020

Let's not work on fetch_rcv1 since the words are difficult to get.

thomasjpfan on 7 Jun 2020

Closing, all data fetchers suitable for this feature have been modified.

cmarmo on 1 Nov 2020

Scikit-learn: API for returning datasets as DataFrames

Most helpful comment

All 78 comments

15486 has a good helper

Related issues