Xarray: .to_xarray(): a 9Mb dataframe requires 30Gb ram

Created on 5 Jul 2020 · 6Comments · Source: pydata/xarray

ds1 = df.set_index(['lat','lon']).stack()
ds1.index.names = ['lat', 'lon', 'time']
ds1 = ds1.sort_index()
ds1.columns = ['T']

xr.Dataset(ds1)

I tried to transform a dataset with 2D latitude and longitude into Xarray dataset, however I failed to do so, because ram error occurred during process.

I also tried to set lat and lon as coordination directly, however it is complex to plotting and conducting geographic manipulation in the following work. This dataset is a non-rectangular area, lat and lon can not be replaced by the corner values.

In all, I hope this data can be transformed into xarray and resampled into traditional rectangle data, which can be easily dealt with.

Any codes and suggestions are sincerely welcomed.

usage question

Source

Drfengze

Most helpful comment

thanks, that helps. First of all (unless I did something wrong with the read_csv call), there's a Unnamed: 0 column that has to be removed.

Other than that, your data seems to be quite sparse so that's an ideal fit for sparse:

In [38]: %%time 
    ...: df = pd.read_csv("/tmp/data.csv") 
    ...: a = df.drop("Unnamed: 0", axis=1).set_index(["lat", "lon"]) 
    ...: a = a.stack() 
    ...: a.index.names = ["lat", "lon", "time"] 
    ...: a = a.sort_index() 
    ...: a.name = "T" 
    ...: xr.DataArray.from_series(a, sparse=True) 
    ...:  
    ...:
CPU times: user 606 ms, sys: 63.9 ms, total: 670 ms
Wall time: 670 ms
Out[38]: 
<xarray.DataArray 'T' (lat: 16100, lon: 29959, time: 31)>
<COO: shape=(16100, 29959, 31), dtype=float64, nnz=1003191, fill_value=nan>
Coordinates:
  * lat      (lat) float64 37.5 37.5 37.5 37.5 37.5 ... 43.1 43.1 43.1 43.1 43.1
  * lon      (lon) float64 96.46 96.46 96.46 96.47 ... 102.6 102.6 102.6 102.6
  * time     (time) object '2011-01-01 00:00:00' ... '2011-01-31 00:00:00'

keewis on 6 Jul 2020

👍2 🎉1

All 6 comments

Please could you fill out the issue template, including a reproducible example? A CSV could be OK if you include the reproduction steps.

max-sixty on 5 Jul 2020

Please could you fill out the issue template, including a reproducible example? A CSV could be OK if you include the reproduction steps.

Thank you, updated.

Drfengze on 6 Jul 2020

thanks, that helps. First of all (unless I did something wrong with the read_csv call), there's a Unnamed: 0 column that has to be removed.

Other than that, your data seems to be quite sparse so that's an ideal fit for sparse:

In [38]: %%time 
    ...: df = pd.read_csv("/tmp/data.csv") 
    ...: a = df.drop("Unnamed: 0", axis=1).set_index(["lat", "lon"]) 
    ...: a = a.stack() 
    ...: a.index.names = ["lat", "lon", "time"] 
    ...: a = a.sort_index() 
    ...: a.name = "T" 
    ...: xr.DataArray.from_series(a, sparse=True) 
    ...:  
    ...:
CPU times: user 606 ms, sys: 63.9 ms, total: 670 ms
Wall time: 670 ms
Out[38]: 
<xarray.DataArray 'T' (lat: 16100, lon: 29959, time: 31)>
<COO: shape=(16100, 29959, 31), dtype=float64, nnz=1003191, fill_value=nan>
Coordinates:
  * lat      (lat) float64 37.5 37.5 37.5 37.5 37.5 ... 43.1 43.1 43.1 43.1 43.1
  * lon      (lon) float64 96.46 96.46 96.46 96.47 ... 102.6 102.6 102.6 102.6
  * time     (time) object '2011-01-01 00:00:00' ... '2011-01-31 00:00:00'

keewis on 6 Jul 2020

👍2 🎉1

thanks, that helps. First of all (unless I did something wrong with the read_csv call), there's a Unnamed: 0 column that has to be removed.

Other than that, your data seems to be quite sparse so that's an ideal fit for sparse:

In [38]: %%time 
    ...: df = pd.read_csv("/tmp/data.csv") 
    ...: a = df.drop("Unnamed: 0", axis=1).set_index(["lat", "lon"]) 
    ...: a = a.stack() 
    ...: a.index.names = ["lat", "lon", "time"] 
    ...: a = a.sort_index() 
    ...: a.name = "T" 
    ...: xr.DataArray.from_series(a, sparse=True) 
    ...:  
    ...:
CPU times: user 606 ms, sys: 63.9 ms, total: 670 ms
Wall time: 670 ms
Out[38]: 
<xarray.DataArray 'T' (lat: 16100, lon: 29959, time: 31)>
<COO: shape=(16100, 29959, 31), dtype=float64, nnz=1003191, fill_value=nan>
Coordinates:
  * lat      (lat) float64 37.5 37.5 37.5 37.5 37.5 ... 43.1 43.1 43.1 43.1 43.1
  * lon      (lon) float64 96.46 96.46 96.46 96.47 ... 102.6 102.6 102.6 102.6
  * time     (time) object '2011-01-01 00:00:00' ... '2011-01-31 00:00:00'

Thanks for your codes! I noticed that only two decimal numbers are kept for lat and lon. Does it mean a resample process happened? The data is a grid with 0.005 degree resolution, can I keep the resolution in the results?

Drfengze on 7 Jul 2020

that's only the short repr, the values are not modified:

In [5]: da.lat
Out[5]: 
<xarray.DataArray 'lat' (lat: 16100)>
array([37.49944, 37.5004 , 37.50135, ..., 43.1014 , 43.10143, 43.10144])
Coordinates:
  * lat      (lat) float64 37.5 37.5 37.5 37.5 37.5 ... 43.1 43.1 43.1 43.1 43.1

keewis on 7 Jul 2020

that's only the short repr, the values are not modified:

In [5]: da.lat
Out[5]: 
<xarray.DataArray 'lat' (lat: 16100)>
array([37.49944, 37.5004 , 37.50135, ..., 43.1014 , 43.10143, 43.10144])
Coordinates:
  * lat      (lat) float64 37.5 37.5 37.5 37.5 37.5 ... 43.1 43.1 43.1 43.1 43.1

Thanks for help！I found sparse grids are not easy to plot, so I changed my code like Colab code, which is similar with the 'rasm' example in xr. Maybe you can show how to create this example datasets (more than the toy weather) in tutorial, which would be helpful.

Drfengze on 7 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings