Pandas: Setting values in empty multi-dimensional dataframe with multi-dimensional data?

Created on 9 Apr 2017  路  7Comments  路  Source: pandas-dev/pandas

I am trying to find a clean, concise way of setting the values of a multi-dimensional, multi-indexed dataframe, using data of lower dimensionality. In this case, I am trying to use two-dimensional data to set values in a four-dimensional dataframe.

Unfortunately, the syntax I am using only works with 2D data if the keys I'm using are already in the dataframe's index/column. But empty dataframes do not have any keys (yet) in their indices/columns.

For single points (0D), this is not a problem. Pandas just adds the missing key(s) appropriately and sets the value. For anything else, the key must be already there, it seems, as the below code shows.

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

# Create an empty 2-level mux (multi-index) for the index.
# The first level is run number ('r'). The second is x-axis values ('x').
mux = pd.MultiIndex(levels=[[]]*2,labels=[[]]*2,names=['r','x'])

# Create an empty 2-level mux for the column
# The first level is parameter value ('p'). The second is y-axis values ('y').
mux2 = pd.MultiIndex(levels=[[]]*2,labels=[[]]*2,names=['p','y'])

# Create the empty multi-indexed and multi-columned dataframe
df = pd.DataFrame(index=mux,columns=mux2)

# run number 0 (r=0), using parameter value 1.024 (p=1.024)...
# ... produces 2D data on an x-y grid.
data = np.array([[1,2,3],[4,5,6]])
ys = np.array([0,1,2])
xs = np.array([0,1])

# Now we want to set values in the 4D dataframe with our 2D data. Throws error.
df.loc[(0,list(xs)),(1.024,list(ys))] = data

Traceback (most recent call last):
    KeyError: 0

But single points work fine.

# Single points automatically result in new keys
df.loc[(0,xs[0]),(1.024,ys[0])] = 1
df.loc[(0,xs[0]),(1.024,ys[1])] = 2
df.loc[(0,xs[1]),(1.024,ys[0])] = 3
df.loc[(0,xs[1]),(1.024,ys[1])] = 4

# Keys are now found, and this now works.
df.loc[(0,list(xs[0:2])),(1.024,list(ys[0:2]))] = ((5,6),(7,8))

# But this does not work. '1' is not currently a key.
df.loc[(1,list(xs[0:2])),(1.024,list(ys[0:2]))] = ((1,2),(3,4))

Traceback (most recent call last):
    KeyError: 1L

Problem description

It seems the default behavior for setting single points (that is, auto-creation of keys) is different than the behavior of setting multiple points (no auto-creation of keys). This seems pretty arbitrary from my outsider perspective; not sure why the behavior shouldn't be identical.

If there is another way of accomplishing this, I would love to hear about it. But perhaps the point behavior should be extended to multiple dimensions.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 33.1.0.post20170122
Cython: 0.25.2
numpy: 1.10.4
scipy: 0.17.1
statsmodels: 0.8.0
xarray: 0.9.1
IPython: 5.2.2
sphinx: 1.5.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.45.0
pandas_datareader: 0.2.1

Indexing MultiIndex Usage Question

Most helpful comment

@joseortiz3 Using a simpler example (without multi-indexes). What you are trying to do is this:

In [14]: df = pd.DataFrame()

In [15]: df.loc[[1,2], [1,2,3]] = data
...
KeyError: '[1 2] not in index'

And indeed, this is not supported by pandas at the moment.

BTW, if you don't know the keys in advance to create the dataframe, my suggestion would be to gather the data in something else (eg append a list, or a few lists for the data, index, columns), and only create the dataframe at the end.

All 7 comments

What you are doing is about as inefficient as possible. You want to create all in 1 go. The setting of single keys is for convenience; the expansion is what makes this inefficient, you get a copy each time (of the entire structure).

In [17]: df = DataFrame([[1, 2], [3, 4]], 
    ...:                columns=pd.MultiIndex.from_product([[1.024], [0, 1]], names=list('py')),
    ...:                index=pd.MultiIndex.from_product([[0], [0, 1]], names=list('rx')))

In [18]: df
Out[18]: 
p   1.024   
y       0  1
r x         
0 0     1  2
  1     3  4

Ok, now to address my question: You generated an xy dataset for r=0, p=1.024. Suppose you now have another xy dataset for r=1, p=1.024. How do you obtain a dataframe with both xy sets for r = [0,1] and p=1.024?

Continuing on to arbitrary r, arbitrary p, how do you obtain a dataframe with any arbitrary collection of xy data for each r and p? Especially when you do not know ahead of time what r and p values you will end up with? (And hence, cannot create it from scratch).

The efficiency of this is really not important. I don't care if it takes fifty milliseconds or fifty seconds. I just need to obtain the required multi-dimensional data frame.

show what you are meaning

# Pseudocode
higher_df = Higher_Dimensional_Dataframe() # 4-D Dataframe
for i in range(100):
    p = rand() # Some random float
    r = randint() # Some random integer
    # Dataframe of unique xy values for a particular p and r.
    df = DataFrame(data = [[ randint() , randint() ], [ randint() ,  randint() ]], 
        columns=pd.MultiIndex.from_product([[ p ], [0, 1]], names=list('py')),
        index=pd.MultiIndex.from_product([[ r ], [0, 1]], names=list('rx')))
    # Put each of these dataframes into a single higher-dimensional dataframe.
    higher_df.loc[(r,df.index),(p,df.columns)] = df
# Now I have a dataframe with xy-datasets for an arbitrary collection of p's and r's
do_stuff(higher_df)

lots of ways to do this.

here is one. This is a better question for SO, or you can read some tutorials (and docs).

In [6]: df = DataFrame({(1.024, 0): np.random.randn(10), (1.024, 1): np.random.randint(0, 10, size=10)},
   ...:     ...:                index=pd.MultiIndex.from_product([range(5), [0, 1]], names=list('rx')))
   ...: df.columns.names = ['foo', 'bar']
   ...: df
   ...: 
Out[6]: 
foo     1.024   
bar         0  1
r x             
0 0  1.215597  1
  1  0.475140  3
1 0  1.610304  7
  1 -0.261228  5
2 0  0.476945  6
  1 -0.257677  3
3 0 -2.170884  0
  1  0.743454  3
4 0 -1.721198  4
  1  0.487578  4

Thanks for your help. But I don't think I'm successfully conveying what the problem is. I provided a suggestion, as per the rules.

In the end, I just had to iterate point-by-point to get what I want. Inefficient, but my time is worth more.

@joseortiz3 Using a simpler example (without multi-indexes). What you are trying to do is this:

In [14]: df = pd.DataFrame()

In [15]: df.loc[[1,2], [1,2,3]] = data
...
KeyError: '[1 2] not in index'

And indeed, this is not supported by pandas at the moment.

BTW, if you don't know the keys in advance to create the dataframe, my suggestion would be to gather the data in something else (eg append a list, or a few lists for the data, index, columns), and only create the dataframe at the end.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MatzeB picture MatzeB  路  3Comments

ericdf picture ericdf  路  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  路  3Comments

nathanielatom picture nathanielatom  路  3Comments

BDannowitz picture BDannowitz  路  3Comments