enough said.
What is the suggested replacement for the deprecated .ix
? Is it .loc
?
For me .ix
works 5-10% faster than .loc
:
>>> df.shape
(10000, 211)
>>> df.index
CategoricalIndex(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
...
'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
categories=['A', 'B', 'C'], ordered=False, dtype='category', length=10000)
>>> df.loc[['C']].shape
(8000, 211)
>>> %timeit df.loc['C']
100 loops, best of 3: 5.61 ms per loop
>>> %timeit df.ix['C']
100 loops, best of 3: 5.37 ms per loop
BTW, passing a list into the indexer adds another 25-50% overhead:
>>> %timeit df.loc['C']
100 loops, best of 3: 5.61 ms per loop
>>> %timeit df.loc[['C']]
100 loops, best of 3: 9.97 ms per loop
>>> %timeit df.ix['C']
100 loops, best of 3: 5.37 ms per loop
>>> %timeit df.ix[['C']]
100 loops, best of 3: 7.57 ms per loop
yes .loc
and .iloc
are the expected replacements. Timings are expected to eventually be faster, though a single sub-millisecond access difference is pretty meaningless in any real usecase.
@jreback Having terabytes of data and processing it with a help of Dask DataFrame which uses Pandas DataFrames as chunks turns "milliseconds" into minutes...
@frol doesn't matter how much data you have. you are almost certainly ineffeciently using indexing operations.
@frol the indexing code paths are going to be rewritten in C/C++ as part of the pandas 2.0 effort, so the microperformance should improve by a factor of 10 or more. Some refactoring or Cythonization may be able to give some quick perf wins in .loc or .iloc
Question on .ix deprecation-- suppose you want to set the first row of a DataFrame in a particular column with a value (assume that the index is not an Int64Index). Then you can currently use:
df.ix[0, 'colname'] = 5
In the future can you safely do:
df.iloc[0].loc['colname'] = 5
(this seems to beg for SettingWithCopyWarning)? Or is the only proper option going to be
df.loc[df.index[0], 'colname'] = 5
?
Our experience has been that mixing positional and label indexing has been a significant source of problems for users. Here you might want to do df['colname'][0]
unambigously safe setting (may be better syntactically nicer in 2.0)
df.iloc[0, df.columns.get_loc('colname')] = 5
or
df.loc[df.index[0], 'colname'] = 5
@jreback Thanks, makes sense.
@jreback I think you have a typo with square brackets used instead of parens?
df.iloc[0, df.columns.get_loc['colname']] = 5
should be
df.iloc[0, df.columns.get_loc('colname')] = 5
@johne13 yes that was a typo, thanks!
This looks like it will be really painful for me. Rather than removing ix entirely, could it be switched to a function with keyword only args?
df.ix(row_idx=[0,2], col_name=["foo", "bar"])
Then I can take a dangerous df.ix[[0,2], ["foo", "bar"]]
and in a fairly straightforward fashion convert it into an unambiguous index without having to repeat my index name or us the df.get_loc
?
@DavidEscott well you are only delaying the inevitable, so you have some choices
DeprecationWarning
(not this will eventually turn into a FutureWarning
and eventually then be removed, but that is a ways down the roadno, converting .ix
to a function is not possible, its an indexer, eg. ix[ ]
, which is syntactically different.
@DavidEscott you're more than welcome to monkey-patch in your own function that does what you want. Since .ix
has been a significant source of bugs and user problems, we no longer wish to support it
@wesm I understand that this is not an easy function to maintain, but still I find it unfortunate as it was a VERY expressive way to manipulate DataFrames... I hope someone will be able to make a code snippet to replace ix via monkey-patching?
I just found a use case that makes ix
quite valuable to me. I have a Dataframe
df
such that df['mask']
is a boolean mask that I'd like to filter df
on. With ix
, I can do df[df.mask,:n]
to get the first n
columns, filtered by mask
. Now the best way seems to be df.loc[df.mask,:].iloc[:,:3]
, which just reads terribly. Using df.get_loc
as an indexing workaround feels very kludgy whereas the ix
solution made for elegant code.
Of course I can assign a temporary df2 = df.loc[df.mask]
and work from there, but that's inelegant as well.
@JonathanTay To support the boolean indexing case with first-n-columns, in addition to
df.loc[df.mask, :].iloc[:, :n]
you can use the (perhaps prettier, although same length)
df.iloc[df.mask.values, :n]
or
df.loc[df.mask, df.columns[:n]]
Yes it's 7 more characters than
df.ix[df.mask, :n]
but generally not having to worry about subtle bugs from .ix
inference is worth the typing.
Can .ix
can be replaced by an .loc
chained with an .iloc
, or a simple .loc
and .iloc
?
If so, why not have a wrapper around this and keep backward compatibility, and a useful method?
@ManuelLevi The issue is, _each call_ can be replaced with .iloc, .loc, or a combination, but there's no good way for .ix
to tell which to use.
E.g. if you provide a DataFrame with the Index([0, 2, 4, 6, 8]), and call .ix[:4] on it. Did you want .ix to implicitly use .iloc (returning the first 4 elements) or .loc (returning the first 3 elements)?
@Liam3851 I see what you mean.
I usually use .iloc
and .loc
combined, but the impact this will have is greater than me. I believe it impacts all the pandas' community.
A quick search for df.ix
on GitHub shows almost 4M results. Maybe half a million notebooks and almost 200k python files will break after this. Many of these opensource tutorials and libraries people are counting on.
Could there be a simple way to change the function behaviour instead of removing it? Maybe assume integers to always be locations, and other types to always be a label?
This is such a great feature, would be a shame to get it lost...
Please consider some of the suggestions above as a way to ease maintenance
@ManuelLevi As I understand it, ix treats anything that could be a label, as a label. This was a source of bugs. For example, if a Series s is indexed by integers [5,3,2,4], then should s.ix[0] return the 0th element or raise KeyError? What if s.index = ['a','b','c'] or [0,1,2,3]? @Liam3851 has a point that the bugs and unexpected behaviour just keep coming once you allow the ambiguity. For example, label based indexing (loc) takes both end points, while position-based (iloc) takes the start but not the end.
Most helpful comment
Question on .ix deprecation-- suppose you want to set the first row of a DataFrame in a particular column with a value (assume that the index is not an Int64Index). Then you can currently use:
df.ix[0, 'colname'] = 5
In the future can you safely do:
df.iloc[0].loc['colname'] = 5
(this seems to beg for SettingWithCopyWarning)? Or is the only proper option going to be
df.loc[df.index[0], 'colname'] = 5
?