pandas 🚀 - API: .convert_objects is deprecated, do we want a .convert to replace?

cc @bashtage
@jorisvandenbossche @shoyer @TomAugspurger @sinhrks

jreback on 2 Oct 2015

There is already _convert which could be promoted.

On Fri, Oct 2, 2015, 10:16 AM Jeff Reback [email protected] wrote:

cc @bashtage https://github.com/bashtage
@jorisvandenbossche https://github.com/jorisvandenbossche @shoyer
https://github.com/shoyer @TomAugspurger
https://github.com/TomAugspurger @sinhrks https://github.com/sinhrks

—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/11221#issuecomment-145033282.

bashtage on 2 Oct 2015

The advantage of a well designed convert is that it works on DataFrames. All of to_* are only for 1-d types.

bashtage on 2 Oct 2015

👍3

@bashtage oh I agree.

The problem is with coerce, you have to basically not auto-coerce things partially and so leave ambiguous things up to the user (via a 1-d use of the pd.to_*). But assuming we do that then yes, you could make it work.

jreback on 2 Oct 2015

I was just thinking of the case where I imported data that should be numeric into a DF, but it has some mixed characters, and I want just numbers or NaNs. This type of conversion is what I ultimately wanted when I started looking at convert_objects when I was surprised that asking to coerce of all strings didn't coerce it to NaN.

bashtage on 2 Oct 2015

but the problem is that of a mixed boolean/nan one is ambiguous (so maybe just need to 'handle' that)

jreback on 2 Oct 2015

Some comments/observations:

I actually like convert_objects more than convert, because it more clearly says what it does: try to convert objects dtyped columns to a builtin dtype (convert is rather general).
If we decide that we something like current the convert objects _functionality_, I don't really see a reason to deprecate convert_objects for a new converts. I think it should be technically possible to deprecate the old _keywords_ (and not the function) in favor of new keywords. (actually the original approach in the reverted PR).
I think the functionality of convert_objects is useful (as already is stated above: that you can do something like to_datetime/to_numeric/.. on dataframes). Using the to_.. functions on each series separate will always be the preferable solution for robust code, but as long as convert_objects is very clearly defined (now there are some strange inconsistencies), I think it is useful to have this. It would be very nice if this could just be implemented in terms of the to_.. methods.
A bit simplified in pseudo code:

def convert_objects(self, numeric=False, datetime=False, timedelta=False, coerce=False): for each column: if numeric: pd.to_numeric(self, coerce=coerce) elif datetime: pd.to_datetime(self, coerce=coerce) elif timedelta: pd.to_timedelta(self, coerce=coerce)

But, the main problem with this is: the reason convert_objects is useful now, is precisely because it has an extra 'rule' that the to_.. methods don't have: _only convert the column if there is at least one value that can be converted_.
This is the reason that something like this works:

```
In [2]: df = pd.DataFrame({'int_str':['1', '2'], 'real_str':['a', 'b']})

In [3]: df.convert_objects(convert_numeric=True)
Out[3]:
int_str real_str
0 1 a
1 2 b

In [4]: df.convert_objects(convert_numeric=True).dtypes
Out[4]:
int_str int64
real_str object
dtype: object
```

and does not give:

Out[3]: int_str real_str 0 1 NaN 1 2 NaN

which would not be really useful (although maybe more predictable). The fact that is not always coerced to NaNs was considered as a bug, for which @bashtage did a PR (and for to_numeric, it is also logical that it returns NaNs). But this made convert_objects also less useful (so it was reverted in the end).
So I think that in this case, we will have to deviate from the to_.. behaviour

jorisvandenbossche on 2 Oct 2015

Maybe this could be an extra parameter to convert/convert_objects: whether to coerce non-convertible-columns to NaN or not (meaning: columns for which there is at least not one element convertible and would lead to a full NaN column). @bashtage then you could have the behaviour you want, but the method can still be used for dataframes were not all columns should be considered as numeric.

jorisvandenbossche on 2 Oct 2015

ok so the question is should we u deprecate convert_objects thrn?

I actually think convert is s much better name snd we certainly could add the options u describe to make it more useful

jreback on 2 Oct 2015

convert_objects just seems like a bad API feature since it has this path dependence where it

tries to convert to type a
tries to convert to type b if a fails, but not if a succeeds
tries to convert to type c is a and b fail, but not if either succeed

A better design would only convert a single type which removes any ambiguity if some data is ever convertible to more than one type. to to_* sort of get there, with the caveat that they operate column by column.

bashtage on 2 Oct 2015

Long live convert_objects!

hayd on 20 Nov 2015

👍11

maybe what we need in the docs are some examples showing:

df.apply(pd.to_numeric) and such (which effectively / more safely) replaces .convert_objects

jreback on 20 Nov 2015

👍2

Hi all,

I currently use convert_objects in many of my codes and I think this functionality is very useful when importing datasets that may differ every day in terms of columns composition. Is it really necessary to deprecate it or there's a chance to keep it alive?

Many thanks,
Umberto

usagliaschi on 14 Jul 2016

👍2

.convert_objects was inherently ambiguous and this was deprecate multiple versions ago. see the docs here for how to explicity do object conversion.

jreback on 14 Jul 2016

I agree with @jreback - convert_objects was full of magic and had difficult to guess behavior that was inconsistent across different conversion targets (e.g. numbers were not forces if all were not numbers even if told to coerce).

A well designed guesser with clear, simple rule and no option to coerce could be useful, but it isn't hard to write your own with your favorite set of rules.

bashtage on 14 Jul 2016

FYI, the convert all (errors='coerce') and ignore (errors='ignore') options in .to_numeric is a problem in data files containing columns of strings that you want to keep and columns of strings that are actually numbers expressed in scientific notation (e.g, 6.2e+15) which require 'coerce' to convert from strings to float64.

The (deprecated) convert.py file has a handy soft convert function that checks if a forced conversion produces all NaNs (such as a string that you want to keep) and then declines to convert the whole column.

A fourth error option, such as 'soft-coerce,' would catch scientific notation numbers while not forcing all strings to NaNs.

At the moment, my work around is:

    for col in df.columns:   
        converted = pd.to_numeric(df[col],errors='coerce')  
        df[col] = converted if not pd.isnull(converted).all() else df[col]

BKJackson on 10 Sep 2016

The great thing about convert_objects over the various to_* methods is that you don't need to know the datatypes in advance. As @usagliaschi said, you may have heterogeneous data coming in and want a single function to handle it. This is exactly my current situation.

Is there any replacement for a function that will match this functionality, in particular infer dates/datetimes?

abalter on 26 Sep 2016

👍2

xref https://github.com/pandas-dev/pandas/pull/15757#issuecomment-288090118

I think it would be worth exposing whatever the new soft convert api is 0.20 (I haven't looked at it in detail), referencing it in the convert_objects depr message, then deferring convert_objects to the next version, if possible.

I say this because I know there are people (for example, me) who have ignored the convert_objects depr message in a couple cases, in particular working with data where you don't necessarily know the columns. Real instance:

df = pd.read_html(source)[0]  # poorly formatted table, everything inferred to object
                              # exact columns can vary

df.columns = df.loc[0, :]
df = df.drop(0).dropna()

df = df.convert_objects()

Looking at this again, I realize df.apply(lambda x: pd.to_numeric(x, errors='ignore')) would also work fine in this case, but that wasn't immediately obvious, and I'm not sure we've done enough handholding (for lack of a better term) to help people transition.

chris-b1 on 21 Mar 2017

IF we decide to expose a 'soft convert objects', would we want this called .covert_objects()? or different name, maybe .convert()? (e.g. instead of removing the deprecation, we simply changed it - which is probably more in breaking back-compat).

jreback on 21 Mar 2017

xref #15550

so I think a resolution to this could be:

adding .to_* to Series (#15550)
adding .to_* to DataFrame
adding a soft option

then easy enough to do:

df.to_numeric(errors='soft')

if you really really want to actually convert things ala the original .convert_object().

df.to_datetime(errors='soft').to_timedelta(errors='soft').to_numeric(errors='soft')

And I suppose could offer a convenience feature for this:

df.to_converted()
df.convert() (maybe too generic)
df.convert_objects() (resurrect)
df.to_just_figure_this_out()

jreback on 27 Mar 2017

I think the most useful soft conversion function would have either the ability to order the to_* rules, e.g. numeric-date-time or time-date-numeric since there are occasionally data that could be interpreted as multiple types. At least this was the case in convert_objects. Alternatively, one could only select a subset of the filters, such as only consider numeric-date.

I agree extending the to_* to correctly operate on DataFrames would be useful.

bashtage on 28 Mar 2017

Thanks @jreback - I like adding to_... to the DataFrame api, although maybe it's worth splitting out use cases. Consider this ill-formed frame:

df = pd.DataFrame({'num_objects': [1, 2, 3], 'num_str': ['1', '2', '3']}, dtype=object)

df
Out[2]: 
  num_objects num_str
0           1       1
1           2       2
2           3       3

df.dtypes
Out[3]: 
num_objects    object
num_str        object
dtype: object

The default behavior of convert_objects is to only reinterpret the python ints into a proper int dtype, not cast the strings. This is the behavior that I'd really miss killing convert_objects, and suspect others might too.

df.convert_objects().dtypes
Out[4]: 
num_objects     int64
num_str        object
dtype: object

In [5]: df.apply(pd.to_numeric).dtypes
Out[5]: 
num_objects    int64
num_str        int64
dtype: object

So is it worth adding a convert_pyobjects (...not in love with that name) for just this case?
infer_python_types
convert_python_types
??

chris-b1 on 28 Mar 2017

I think its easy enough to add a soft option to errors to do exactly this.

jreback on 28 Mar 2017

Would pd.Series(['1', '2', '3']).to_numeric(errors='soft') cast?

chris-b1 on 28 Mar 2017

soft would just return [3] (as would coerce

The difference is [4] (the Series in it). I think soft would return [5] and coerce would return [4]

In [3]: pd.to_numeric(pd.Series(['1', '2', '3']), errors='coerce')
Out[3]: 
0    1
1    2
2    3
dtype: int64

In [4]: pd.to_numeric(pd.Series(['1', '2', 'foo']), errors='coerce')
Out[4]: 
0    1.0
1    2.0
2    NaN
dtype: float64

In [5]: pd.to_numeric(pd.Series(['1', '2', 'foo']), errors='ignore')
Out[5]: 
0      1
1      2
2    foo
dtype: object

jreback on 28 Mar 2017

Thanks for the examples.

I still think "only losslessly convert python objects into proper dtypes" might be a better as separate operation from to_numeric? There wouldn't be any way to produce Out[4] from my example above?

chris-b1 on 28 Mar 2017

I don't think that it is possibly to losslessly convert python objects into proper dtypes is generally well defined. There are certainly come objects that don't have a lossless native representation (e.g. str->float).

This ambiguity that was just described is precisely the challenge in writing a useful, correct and precise converter.

Should the set of conversion options and the rules that will be used be described prior to implementing them? I think they should or the code will default to be the reference set of rules (which was one of the problems with convert_objects).

bashtage on 28 Mar 2017

To be clear, what I mean by losslessly converting is doing exactly what pd.Series([<python objects]>) would do - converting to a numpy dtype if possible, otherwise leaving as object.

chris-b1 on 28 Mar 2017

I think the point of convert_objects and any successor is to strictly go beyond what that io tools will automatically do. IOW, some coercion of some objects some of the time is essential. The old convert_objects would, for example, coerce mixed strings and numbers to numbers and nulls. Tools like read_csv intentionally don't do this since this is fairly arbitrary.

bashtage on 28 Mar 2017

The to_* are pretty precise and do what you tell them, even to non-objects. For examples:

import pandas as pd
import datetime as dt
t = pd.Series([dt.datetime.now(), dt.datetime.now()])

pd.to_numeric(t)
Out[7]: 
0    1490739351272159000
1    1490739351272159000
dtype: int64

I would assume that a successor to convert_objects would only convert object dtype and would not behave like this.

bashtage on 28 Mar 2017

The reason that I don't like adding the .to_ functions as method on a DataFrame (or at least not as solution in this discussion), is because IMO you typically do not want to apply this to all columns and/or not in the same way (and if you want this, you can easily do the apply approach as you can do now).
Eg with DataFrame.to_datetime, I would expect that it does this for all columns, which means both converting numerical columns as string columns. I don't think this is typically what you want.

So for me one of the reasons to have a convert_objects method (irregardless of the exact behavioral details) is that it would only try to convert actual object dtyped columns.

jorisvandenbossche on 29 Mar 2017

ok if we resurrect this with an all new signature. this is current.

In [1]: DataFrame.convert_objects?
Signature: DataFrame.convert_objects(self, convert_dates=True, convert_numeric=False, convert_timedeltas=True, copy=True)
Docstring:
Deprecated.

Attempt to infer better dtype for object columns

Parameters
----------
convert_dates : boolean, default True
    If True, convert to date where possible. If 'coerce', force
    conversion, with unconvertible values becoming NaT.
convert_numeric : boolean, default False
    If True, attempt to coerce to numbers (including strings), with
    unconvertible values becoming NaN.
convert_timedeltas : boolean, default True
    If True, convert to timedelta where possible. If 'coerce', force
    conversion, with unconvertible values becoming NaT.
copy : boolean, default True
    If True, return a copy even if no copy is necessary (e.g. no
    conversion was done). Note: This is meant for internal use, and
    should not be confused with inplace.

IIRC @jorisvandenbossche suggested. (with a mod).

DataFrame.convert_object(self, datetime=True, timedelta=True, numeric=False, copy=True)

Though if everything is changed. Then maybe we should just rename this. (note the .convert_object)

jreback on 29 Mar 2017

Sorry I'm just getting back to this here's a proposal of how I think this could work, open to suggestions on any piece.

0.20.1 - leave convert_objects but update depr message with new methods I'll go through
0.20.2 - remove convert_objects

First, for conversions that are simply unboxing of python objects, add a new method infer_objects with no options. This essentially re-applies our ctor inference on any object columns, and if a column can be losslessly unboxed to a native type, do it, otherwise leave unchanged. Useful in munging scenarios where the original inference fails. Example:

df = pd.DataFrame({'a': ['a', 1, 2, 3],
                   'b': ['b', 2.0, 3.0, 4.1],
                   'c': ['c', datetime.datetime(2016, 1, 1), datetime.datetime(2016, 1, 2), 
                         datetime.datetime(2016, 1, 3)]})

df = df.iloc[1:]

In [194]: df
Out[194]: 
   a    b                    c
1  1    2  2016-01-01 00:00:00
2  2    3  2016-01-02 00:00:00
3  3  4.1  2016-01-03 00:00:00

In [195]: df.dtypes
Out[195]: 
a    object
b    object
c    object
dtype: object

# exactly what convert_objects does in this scenario today!
In [196]: df.infer_objects().dtypes
Out[196]: 
a             int64
b           float64
c    datetime64[ns]
dtype: object

Second, for all other conversions, add to_numeric, to_datetime, and to_datetime to the DataFrame API, with the following sig. Basically work as they do today, but some convenience column selecting options. Not sure on the defaults here, starting with the most 'convenient'

"""
DataFrame.to_...(self, errors='ignore', object_only=True, include=None, exclude=None)
Parameters
------------
errors: {'ignore', 'coerce', 'raise'}
   error mode passed to `pd.to_....`
object_only: boolean
    if True, only apply inference to object typed columns

include / exclude: column selection
"""

Example frame, with what is needed today:

df1 = pd.DataFrame({
    'date': pd.date_range('2014-01-01', periods=3),
    'date_unconverted': ['2014-01', '2015-01', '2016-01'],
    'number': [1, 2, 3],
    'number_unconverted': ['1', '2', '3']})


In [198]: df1
Out[198]: 
        date date_unconverted  number number_unconverted
0 2014-01-01          2014-01       1                  1
1 2014-01-02          2015-01       2                  2
2 2014-01-03          2016-01       3                  3

In [199]: df1.dtypes
Out[199]: 
date                  datetime64[ns]
date_unconverted              object
number                         int64
number_unconverted            object
dtype: object


In [202]: df1.convert_objects(convert_numeric=True, convert_dates='coerce').dtypes
C:\Users\chris.bartak\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.
Out[202]: 
date                  datetime64[ns]
date_unconverted      datetime64[ns]
number                         int64
number_unconverted             int64
dtype: object

With the new api:

In [202]: df1.to_numeric().to_datetime()
Out[202]: 
date                  datetime64[ns]
date_unconverted      datetime64[ns]
number                         int64
number_unconverted             int64
dtype: object

chris-b1 on 7 Jul 2017

And to be honest, I don't personally care much about the second API, my pushback over deprecating convert_objects was entirely based on the lack of something like infer_objects

chris-b1 on 7 Jul 2017

I would second infer_objects() as long as the rules were crystal clear and the implementation matched the description. Another important usecase is when one ends up with a DF transposed with all object columns, and then something like df = df.T.infer_types() would produce

I think function like to_numeric, etc. shouldn't methods on a dataframe and instead should just be stand alone. I can't think they are used frequently enough to pollute the to list.

bashtage on 7 Jul 2017

Cool, yeah the more I think about the less I think adding to_... to the DataFrame api is a good idea. In terms of infer_objects the impl would basically be as follows - based on maybe_convert_objects, which generally unsurprising (in my opinion) behavior:

In [251]: from pandas._libs.lib import maybe_convert_objects

In [252]: converter = lambda x: maybe_convert_objects(np.asarray(x, dtype='O'), convert_datetime=True, convert_timedelta=True)

In [253]: converter([1,2,3])
Out[253]: array([1, 2, 3], dtype=int64)

In [254]: converter([1,2,3])
Out[254]: array([1, 2, 3], dtype=int64)

In [255]: converter([1,2,'3'])
Out[255]: array([1, 2, '3'], dtype=object)

In [256]: converter([datetime.datetime(2015, 1, 1), datetime.datetime(2015, 1, 2)])
Out[256]: array(['2015-01-01T00:00:00.000000000', '2015-01-02T00:00:00.000000000'], dtype='datetime64[ns]')

In [257]: converter([datetime.datetime(2015, 1, 1), 'a'])
Out[257]: array([datetime.datetime(2015, 1, 1, 0, 0), 'a'], dtype=object)

In [258]: converter([datetime.datetime(2015, 1, 1), 1])
Out[258]: array([datetime.datetime(2015, 1, 1, 0, 0), 1], dtype=object)

In [259]: converter([datetime.timedelta(seconds=1), datetime.timedelta(seconds=1)])
Out[259]: array([1000000000, 1000000000], dtype='timedelta64[ns]')

In [260]: converter([datetime.timedelta(seconds=1), 1])
Out[260]: array([datetime.timedelta(0, 1), 1], dtype=object)

chris-b1 on 7 Jul 2017

yes maybe_convert_objects is a soft conversion
it will only convert if all these are strictly convertible

jreback on 7 Jul 2017

I could be on board with a very simple .infer_objects() in that case. It wouldn't accept any arguments I think?

jreback on 7 Jul 2017

could add the new function and change the msg on the convert_objects deprecation to point to .infer_objects() and .to_* for 0.21, then remove in 1.0

jreback on 7 Jul 2017

👍1

@jreback : Judging from this conversation, it seems that removal of convert_objects will not be happening in 0.21. Would it be best to close #15757 and let a fresh PR take its place for the implementation of infer_objects (which BTW, seems a like a good idea)?

gfyoung on 13 Jul 2017

IIUC, to what extent is infer_objects just a port of convert_objects to being a method of DataFrame (or just NDFrame in general)?

gfyoung on 13 Jul 2017

convert_objects has it's own logic and has options. infer_objects should use the default inference as-if on a DataFrame (but only object columns).

bashtage on 13 Jul 2017

Ah right, so do you mean then that infer_objects is convert_objects with the defaults passed in (more or less, maybe some tweaked specifically for DataFrame)?

gfyoung on 13 Jul 2017

infer_objects should have no options, simply do soft-conversion (it would basically just call maybe_convert_objects with the default options

jreback on 13 Jul 2017

Ah, okay, that makes sense. I was just trying to understand and collate the comments made in this discussion in my mind.

gfyoung on 13 Jul 2017

fyi, opened #16915 for infer_objects if anyone is interested - in particular if you have edge test cases in mind

chris-b1 on 14 Jul 2017

Pandas: API: .convert_objects is deprecated, do we want a .convert to replace?

Most helpful comment

All 46 comments

Related issues