xref #11173
or IMHO simply replace by use of pd.to_datetime,pd.to_timedelta,pd.to_numeric
.
Having an auto-guesser is ok, but when you try to forcefully coerce things can easily go awry.
cc @bashtage
@jorisvandenbossche @shoyer @TomAugspurger @sinhrks
There is already _convert which could be promoted.
On Fri, Oct 2, 2015, 10:16 AM Jeff Reback [email protected] wrote:
cc @bashtage https://github.com/bashtage
@jorisvandenbossche https://github.com/jorisvandenbossche @shoyer
https://github.com/shoyer @TomAugspurger
https://github.com/TomAugspurger @sinhrks https://github.com/sinhrks—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/11221#issuecomment-145033282.
The advantage of a well designed convert
is that it works on DataFrames. All of to_*
are only for 1-d types.
@bashtage oh I agree.
The problem is with coerce
, you have to basically not auto-coerce things partially and so leave ambiguous things up to the user (via a 1-d use of the pd.to_*
). But assuming we do that then yes, you could make it work.
I was just thinking of the case where I imported data that should be numeric into a DF, but it has some mixed characters, and I want just numbers or NaNs. This type of conversion is what I ultimately wanted when I started looking at convert_objects
when I was surprised that asking to coerce
of all strings didn't coerce it to NaN.
but the problem is that of a mixed boolean/nan one is ambiguous (so maybe just need to 'handle' that)
Some comments/observations:
convert_objects
more than convert
, because it more clearly says what it does: try to convert objects dtyped columns to a builtin dtype (convert is rather general).convert_objects
for a new converts
. I think it should be technically possible to deprecate the old _keywords_ (and not the function) in favor of new keywords. (actually the original approach in the reverted PR).convert_objects
is useful (as already is stated above: that you can do something like to_datetime/to_numeric/..
on dataframes). Using the to_..
functions on each series separate will always be the preferable solution for robust code, but as long as convert_objects
is very clearly defined (now there are some strange inconsistencies), I think it is useful to have this. It would be very nice if this could just be implemented in terms of the to_..
methods.
def convert_objects(self, numeric=False, datetime=False, timedelta=False, coerce=False):
for each column:
if numeric:
pd.to_numeric(self, coerce=coerce)
elif datetime:
pd.to_datetime(self, coerce=coerce)
elif timedelta:
pd.to_timedelta(self, coerce=coerce)
convert_objects
is useful now, is precisely because it has an extra 'rule' that the to_..
methods don't have: _only convert the column if there is at least one value that can be converted_.```
In [2]: df = pd.DataFrame({'int_str':['1', '2'], 'real_str':['a', 'b']})
In [3]: df.convert_objects(convert_numeric=True)
Out[3]:
int_str real_str
0 1 a
1 2 b
In [4]: df.convert_objects(convert_numeric=True).dtypes
Out[4]:
int_str int64
real_str object
dtype: object
```
and does not give:
Out[3]:
int_str real_str
0 1 NaN
1 2 NaN
which would not be really useful (although maybe more predictable). The fact that is not always coerced to NaNs was considered as a bug, for which @bashtage did a PR (and for to_numeric
, it is also logical that it returns NaNs). But this made convert_objects
also less useful (so it was reverted in the end).
So I think that in this case, we will have to deviate from the to_..
behaviour
Maybe this could be an extra parameter to convert/convert_objects
: whether to coerce non-convertible-columns to NaN or not (meaning: columns for which there is at least not one element convertible and would lead to a full NaN column). @bashtage then you could have the behaviour you want, but the method can still be used for dataframes were not all columns should be considered as numeric.
ok so the question is should we u deprecate convert_objects thrn?
I actually think convert is s much better name snd we certainly could add the options u describe to make it more useful
convert_objects
just seems like a bad API feature since it has this path dependence where it
tries to convert to type a
tries to convert to type b if a fails, but not if a succeeds
tries to convert to type c is a and b fail, but not if either succeed
A better design would only convert a single type which removes any ambiguity if some data is ever convertible to more than one type. to to_*
sort of get there, with the caveat that they operate column by column.
Long live convert_objects!
maybe what we need in the docs are some examples showing:
df.apply(pd.to_numeric)
and such (which effectively / more safely) replaces .convert_objects
Hi all,
I currently use convert_objects in many of my codes and I think this functionality is very useful when importing datasets that may differ every day in terms of columns composition. Is it really necessary to deprecate it or there's a chance to keep it alive?
Many thanks,
Umberto
.convert_objects
was inherently ambiguous and this was deprecate multiple versions ago. see the docs here for how to explicity do object conversion.
I agree with @jreback - convert_objects
was full of magic and had difficult to guess behavior that was inconsistent across different conversion targets (e.g. numbers were not forces if all were not numbers even if told to coerce).
A well designed guesser with clear, simple rule and no option to coerce could be useful, but it isn't hard to write your own with your favorite set of rules.
FYI, the convert all (errors='coerce') and ignore (errors='ignore') options in .to_numeric is a problem in data files containing columns of strings that you want to keep and columns of strings that are actually numbers expressed in scientific notation (e.g, 6.2e+15) which require 'coerce' to convert from strings to float64.
The (deprecated) convert.py file has a handy soft convert function that checks if a forced conversion produces all NaNs (such as a string that you want to keep) and then declines to convert the whole column.
A fourth error option, such as 'soft-coerce,' would catch scientific notation numbers while not forcing all strings to NaNs.
At the moment, my work around is:
for col in df.columns:
converted = pd.to_numeric(df[col],errors='coerce')
df[col] = converted if not pd.isnull(converted).all() else df[col]
The great thing about convert_objects
over the various to_*
methods is that you don't need to know the datatypes in advance. As @usagliaschi said, you may have heterogeneous data coming in and want a single function to handle it. This is exactly my current situation.
Is there any replacement for a function that will match this functionality, in particular infer dates/datetimes?
xref https://github.com/pandas-dev/pandas/pull/15757#issuecomment-288090118
I think it would be worth exposing whatever the new soft convert api is 0.20 (I haven't looked at it in detail), referencing it in the convert_objects depr message, then deferring convert_objects to the next version, if possible.
I say this because I know there are people (for example, me) who have ignored the convert_objects depr message in a couple cases, in particular working with data where you don't necessarily know the columns. Real instance:
df = pd.read_html(source)[0] # poorly formatted table, everything inferred to object
# exact columns can vary
df.columns = df.loc[0, :]
df = df.drop(0).dropna()
df = df.convert_objects()
Looking at this again, I realize df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
would also work fine in this case, but that wasn't immediately obvious, and I'm not sure we've done enough handholding (for lack of a better term) to help people transition.
IF we decide to expose a 'soft convert objects', would we want this called .covert_objects()
? or different name, maybe .convert()
? (e.g. instead of removing the deprecation, we simply changed it - which is probably more in breaking back-compat).
xref #15550
so I think a resolution to this could be:
.to_*
to Series
(#15550).to_*
to DataFrame
soft
optionthen easy enough to do:
df.to_numeric(errors='soft')
if you really really want to actually convert things ala the original .convert_object()
.
df.to_datetime(errors='soft').to_timedelta(errors='soft').to_numeric(errors='soft')
And I suppose could offer a convenience feature for this:
df.to_converted()
df.convert()
(maybe too generic)df.convert_objects()
(resurrect)df.to_just_figure_this_out()
I think the most useful soft conversion function would have either the ability to order the to_*
rules, e.g. numeric-date-time or time-date-numeric since there are occasionally data that could be interpreted as multiple types. At least this was the case in convert_objects
. Alternatively, one could only select a subset of the filters, such as only consider numeric-date.
I agree extending the to_*
to correctly operate on DataFrames would be useful.
Thanks @jreback - I like adding to_...
to the DataFrame
api, although maybe it's worth splitting out use cases. Consider this ill-formed frame:
df = pd.DataFrame({'num_objects': [1, 2, 3], 'num_str': ['1', '2', '3']}, dtype=object)
df
Out[2]:
num_objects num_str
0 1 1
1 2 2
2 3 3
df.dtypes
Out[3]:
num_objects object
num_str object
dtype: object
The default behavior of convert_objects
is to only reinterpret the python ints into a proper int dtype, not cast the strings. This is the behavior that I'd really miss killing convert_objects
, and suspect others might too.
df.convert_objects().dtypes
Out[4]:
num_objects int64
num_str object
dtype: object
In [5]: df.apply(pd.to_numeric).dtypes
Out[5]:
num_objects int64
num_str int64
dtype: object
So is it worth adding a convert_pyobjects
(...not in love with that name) for just this case?
infer_python_types
convert_python_types
??
I think its easy enough to add a soft
option to errors
to do exactly this.
Would pd.Series(['1', '2', '3']).to_numeric(errors='soft')
cast?
soft
would just return [3] (as would coerce
The difference is [4] (the Series in it). I think soft
would return [5] and coerce
would return [4]
In [3]: pd.to_numeric(pd.Series(['1', '2', '3']), errors='coerce')
Out[3]:
0 1
1 2
2 3
dtype: int64
In [4]: pd.to_numeric(pd.Series(['1', '2', 'foo']), errors='coerce')
Out[4]:
0 1.0
1 2.0
2 NaN
dtype: float64
In [5]: pd.to_numeric(pd.Series(['1', '2', 'foo']), errors='ignore')
Out[5]:
0 1
1 2
2 foo
dtype: object
Thanks for the examples.
I still think "only losslessly convert python objects into proper dtypes" might be a better as separate operation from to_numeric
? There wouldn't be any way to produce Out[4]
from my example above?
I don't think that it is possibly to losslessly convert python objects into proper dtypes is generally well defined. There are certainly come objects that don't have a lossless native representation (e.g. str->float).
This ambiguity that was just described is precisely the challenge in writing a useful, correct and precise converter.
Should the set of conversion options and the rules that will be used be described prior to implementing them? I think they should or the code will default to be the reference set of rules (which was one of the problems with convert_objects
).
To be clear, what I mean by losslessly converting is doing exactly what pd.Series([<python objects]>)
would do - converting to a numpy dtype if possible, otherwise leaving as object.
I think the point of convert_objects
and any successor is to strictly go beyond what that io tools will automatically do. IOW, some coercion of some objects some of the time is essential. The old convert_objects
would, for example, coerce mixed strings and numbers to numbers and nulls. Tools like read_csv
intentionally don't do this since this is fairly arbitrary.
The to_*
are pretty precise and do what you tell them, even to non-objects. For examples:
import pandas as pd
import datetime as dt
t = pd.Series([dt.datetime.now(), dt.datetime.now()])
pd.to_numeric(t)
Out[7]:
0 1490739351272159000
1 1490739351272159000
dtype: int64
I would assume that a successor to convert_objects
would only convert object
dtype
and would not behave like this.
The reason that I don't like adding the .to_
functions as method on a DataFrame (or at least not as solution in this discussion), is because IMO you typically do not want to apply this to all columns and/or not in the same way (and if you want this, you can easily do the apply
approach as you can do now).
Eg with DataFrame.to_datetime
, I would expect that it does this for all columns, which means both converting numerical columns as string columns. I don't think this is typically what you want.
So for me one of the reasons to have a convert_objects
method (irregardless of the exact behavioral details) is that it would only try to convert actual object
dtyped columns.
ok if we resurrect this with an all new signature. this is current.
In [1]: DataFrame.convert_objects?
Signature: DataFrame.convert_objects(self, convert_dates=True, convert_numeric=False, convert_timedeltas=True, copy=True)
Docstring:
Deprecated.
Attempt to infer better dtype for object columns
Parameters
----------
convert_dates : boolean, default True
If True, convert to date where possible. If 'coerce', force
conversion, with unconvertible values becoming NaT.
convert_numeric : boolean, default False
If True, attempt to coerce to numbers (including strings), with
unconvertible values becoming NaN.
convert_timedeltas : boolean, default True
If True, convert to timedelta where possible. If 'coerce', force
conversion, with unconvertible values becoming NaT.
copy : boolean, default True
If True, return a copy even if no copy is necessary (e.g. no
conversion was done). Note: This is meant for internal use, and
should not be confused with inplace.
IIRC @jorisvandenbossche suggested. (with a mod).
DataFrame.convert_object(self, datetime=True, timedelta=True, numeric=False, copy=True)
Though if everything is changed. Then maybe we should just rename this. (note the .convert_object
)
Sorry I'm just getting back to this here's a proposal of how I think this could work, open to suggestions on any piece.
0.20.1
- leave convert_objects
but update depr message with new methods I'll go through
0.20.2
- remove convert_objects
First, for conversions that are simply unboxing of python objects, add a new method infer_objects
with no options. This essentially re-applies our ctor inference on any object columns, and if a column can be losslessly unboxed to a native type, do it, otherwise leave unchanged. Useful in munging scenarios where the original inference fails. Example:
df = pd.DataFrame({'a': ['a', 1, 2, 3],
'b': ['b', 2.0, 3.0, 4.1],
'c': ['c', datetime.datetime(2016, 1, 1), datetime.datetime(2016, 1, 2),
datetime.datetime(2016, 1, 3)]})
df = df.iloc[1:]
In [194]: df
Out[194]:
a b c
1 1 2 2016-01-01 00:00:00
2 2 3 2016-01-02 00:00:00
3 3 4.1 2016-01-03 00:00:00
In [195]: df.dtypes
Out[195]:
a object
b object
c object
dtype: object
# exactly what convert_objects does in this scenario today!
In [196]: df.infer_objects().dtypes
Out[196]:
a int64
b float64
c datetime64[ns]
dtype: object
Second, for all other conversions, add to_numeric
, to_datetime
, and to_datetime
to the DataFrame
API, with the following sig. Basically work as they do today, but some convenience column selecting options. Not sure on the defaults here, starting with the most 'convenient'
"""
DataFrame.to_...(self, errors='ignore', object_only=True, include=None, exclude=None)
Parameters
------------
errors: {'ignore', 'coerce', 'raise'}
error mode passed to `pd.to_....`
object_only: boolean
if True, only apply inference to object typed columns
include / exclude: column selection
"""
Example frame, with what is needed today:
df1 = pd.DataFrame({
'date': pd.date_range('2014-01-01', periods=3),
'date_unconverted': ['2014-01', '2015-01', '2016-01'],
'number': [1, 2, 3],
'number_unconverted': ['1', '2', '3']})
In [198]: df1
Out[198]:
date date_unconverted number number_unconverted
0 2014-01-01 2014-01 1 1
1 2014-01-02 2015-01 2 2
2 2014-01-03 2016-01 3 3
In [199]: df1.dtypes
Out[199]:
date datetime64[ns]
date_unconverted object
number int64
number_unconverted object
dtype: object
In [202]: df1.convert_objects(convert_numeric=True, convert_dates='coerce').dtypes
C:\Users\chris.bartak\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
"""Entry point for launching an IPython kernel.
Out[202]:
date datetime64[ns]
date_unconverted datetime64[ns]
number int64
number_unconverted int64
dtype: object
With the new api:
In [202]: df1.to_numeric().to_datetime()
Out[202]:
date datetime64[ns]
date_unconverted datetime64[ns]
number int64
number_unconverted int64
dtype: object
And to be honest, I don't personally care much about the second API, my pushback over deprecating convert_objects
was entirely based on the lack of something like infer_objects
I would second infer_objects()
as long as the rules were crystal clear and the implementation matched the description. Another important usecase is when one ends up with a DF transposed with all object columns, and then something like df = df.T.infer_types()
would produce
I think function like to_numeric
, etc. shouldn't methods on a dataframe and instead should just be stand alone. I can't think they are used frequently enough to pollute the to
Cool, yeah the more I think about the less I think adding to_...
to the DataFrame api is a good idea. In terms of infer_objects
the impl would basically be as follows - based on maybe_convert_objects
, which generally unsurprising (in my opinion) behavior:
In [251]: from pandas._libs.lib import maybe_convert_objects
In [252]: converter = lambda x: maybe_convert_objects(np.asarray(x, dtype='O'), convert_datetime=True, convert_timedelta=True)
In [253]: converter([1,2,3])
Out[253]: array([1, 2, 3], dtype=int64)
In [254]: converter([1,2,3])
Out[254]: array([1, 2, 3], dtype=int64)
In [255]: converter([1,2,'3'])
Out[255]: array([1, 2, '3'], dtype=object)
In [256]: converter([datetime.datetime(2015, 1, 1), datetime.datetime(2015, 1, 2)])
Out[256]: array(['2015-01-01T00:00:00.000000000', '2015-01-02T00:00:00.000000000'], dtype='datetime64[ns]')
In [257]: converter([datetime.datetime(2015, 1, 1), 'a'])
Out[257]: array([datetime.datetime(2015, 1, 1, 0, 0), 'a'], dtype=object)
In [258]: converter([datetime.datetime(2015, 1, 1), 1])
Out[258]: array([datetime.datetime(2015, 1, 1, 0, 0), 1], dtype=object)
In [259]: converter([datetime.timedelta(seconds=1), datetime.timedelta(seconds=1)])
Out[259]: array([1000000000, 1000000000], dtype='timedelta64[ns]')
In [260]: converter([datetime.timedelta(seconds=1), 1])
Out[260]: array([datetime.timedelta(0, 1), 1], dtype=object)
yes maybe_convert_objects is a soft conversion
it will only convert if all these are strictly convertible
I could be on board with a very simple .infer_objects()
in that case. It wouldn't accept any arguments I think?
could add the new function and change the msg on the convert_objects deprecation to point to .infer_objects()
and .to_*
for 0.21, then remove in 1.0
@jreback : Judging from this conversation, it seems that removal of convert_objects
will not be happening in 0.21. Would it be best to close #15757 and let a fresh PR take its place for the implementation of infer_objects
(which BTW, seems a like a good idea)?
IIUC, to what extent is infer_objects
just a port of convert_objects
to being a method of DataFrame
(or just NDFrame
in general)?
convert_objects
has it's own logic and has options. infer_objects
should use the default inference as-if on a DataFrame (but only object columns).
Ah right, so do you mean then that infer_objects
is convert_objects
with the defaults passed in (more or less, maybe some tweaked specifically for DataFrame
)?
infer_objects
should have no options, simply do soft-conversion (it would basically just call maybe_convert_objects
with the default options
Ah, okay, that makes sense. I was just trying to understand and collate the comments made in this discussion in my mind.
fyi, opened #16915 for infer_objects
if anyone is interested - in particular if you have edge test cases in mind
Most helpful comment
Long live convert_objects!