The use case is for netCDF files stored on s3 or other generic cloud storage
import requests, xarray as xr
fp = 'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_MPI-ESM-LR_2029.nc'
data = requests.get(fp, stream=True)
ds = xr.open_dataset(data.content) # raises TypeError: embedded NUL character
Ideal would be integration with the (hopefully) soon-to-be implemented dask.distributed features discussed in #798.
This _does_ work for netCDF3 files, if you provide a file-like object (e.g., wrapped in BytesIO) or set engine='scipy'.
Unfortunately, this is a netCDF4/HDF5 file:
>>> data.raw.read(8)
'\x89HDF\r\n\x1a\n'
And as yet, there is no support for reading from file-like objects in either h5py (https://github.com/h5py/h5py/issues/552) or python-netCDF4 (https://github.com/Unidata/netcdf4-python/issues/295). So we're currently stuck :(.
One possibility is to use the new HDF5 library pyfive with h5netcdf (https://github.com/shoyer/h5netcdf/issues/25). But pyfive doesn't have enough features yet to read netCDF files.
Got it. :( Thanks!
Is this issue resolvable now that unidata/netcdf4-python#652 has been merged?
Yes, we could support initializing a Dataset from netCDF4 file image in a bytes object.
FWIW this would be really useful 👍 from me, specifically for the use case above of reading from s3
Just to clarify: I wrote about that we use could support initializing a Dataset from a netCDF4 file image. But this wouldn't help yet for streaming access.
Initializing a Dataset from a netCDF4 file image should actually work with the latest versions of xarray and netCDF4-python:
nc4_ds = netCDF4.Dataset('arbitrary-name', memory=netcdf_bytes)
store = xarray.backends.NetCDF4DataStore(nc4_ds)
ds = xarray.open_dataset(store)
Thanks @shoyer. So you can download the entire object into memory and then create a file image and read that? While not a full fix, it's definitely an improvement over download-to-disk-then-read workflow!
@delgadom Yes, that should work (I haven't tested it, but yes in principle it should all work now).
@delgadom - did you find a solution here?
A few more references, we're exploring ways to do this in the Pangeo project using Fuse (https://github.com/pangeo-data/pangeo/issues/52). There is a s3 equivalent of the gcsfs library used in that issue: https://github.com/dask/s3fs
yes! Thanks @jhamman and @shoyer. I hadn't tried it yet, but just did. worked great!
In [1]: import xarray as xr
...: import requests
...: import netCDF4
...:
...: %matplotlib inline
In [2]: res = requests.get(
...: 'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmin/' +
...: 'r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073.nc')
In [3]: res.status_code
Out [3]: 200
In [4]: res.headers['content-type']
Out [4]: 'application/x-netcdf'
In [5]: nc4_ds = netCDF4.Dataset('tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073', memory=res.content)
In [6]: store = xr.backends.NetCDF4DataStore(nc4_ds)
In [7]: ds = xr.open_dataset(store)
In [8]: ds.tasmin.isel(time=0).plot()
/global/home/users/mdelgado/git/public/xarray/xarray/plot/utils.py:51: FutureWarning: 'pandas.tseries.converter.register' has been moved and renamed to 'pandas.plotting.register_matplotlib_converters'.
converter.register()
Out [8]: <matplotlib.collections.QuadMesh at 0x2aede3c922b0>

In [9]: ds
Out [9]:
<xarray.Dataset>
Dimensions: (lat: 720, lon: 1440, time: 365)
Coordinates:
* time (time) datetime64[ns] 2073-01-01T12:00:00 2073-01-02T12:00:00 ...
* lat (lat) float32 -89.875 -89.625 -89.375 -89.125 -88.875 -88.625 ...
* lon (lon) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 1.875 ...
Data variables:
tasmin (time, lat, lon) float64 ...
Attributes:
parent_experiment: historical
parent_experiment_id: historical
parent_experiment_rip: r1i1p1
Conventions: CF-1.4
institution: NASA Earth Exchange, NASA Ames Research C...
institute_id: NASA-Ames
realm: atmos
modeling_realm: atmos
version: 1.0
downscalingModel: BCSD
experiment_id: rcp45
frequency: day
realization: 1
initialization_method: 1
physics_version: 1
tracking_id: 1865ff49-b20c-4268-852a-a9503efec72c
driving_data_tracking_ids: N/A
driving_model_ensemble_member: r1i1p1
driving_experiment_name: historical
driving_experiment: historical
model_id: BCSD
references: BCSD method: Thrasher et al., 2012, Hydro...
DOI: http://dx.doi.org/10.7292/W0MW2F2G
experiment: RCP4.5
title: CESM1-BGC global downscaled NEX CMIP5 Cli...
contact: Dr. Rama Nemani: [email protected], Dr...
disclaimer: This data is considered provisional and s...
resolution_id: 0.25 degree
project_id: NEXGDDP
table_id: Table day (12 November 2010)
source: BCSD 2014
creation_date: 2015-01-07T19:18:31Z
forcing: N/A
product: output
We could potentially add a from_memory() constructor to NetCDF4DataStore to
simplify this process.
On Thu, Jan 11, 2018 at 6:27 PM Michael Delgado notifications@github.com
wrote:
yes! Thanks @jhamman https://github.com/jhamman and @shoyer
https://github.com/shoyer. I hadn't tried it yet, but just did. worked
great!In [1]: import xarray as xr
...: import requests
...: import netCDF4
...:
...: %matplotlib inlineIn [2]: res = requests.get(
...: 'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmin/' +
...: 'r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073.nc')In [3]: res.status_code
Out [3]: 200In [4]: res.headers['content-type']
Out [4]: 'application/x-netcdf'In [5]: nc4_ds = netCDF4.Dataset('tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073', memory=res.content)
In [6]: store = xr.backends.NetCDF4DataStore(nc4_ds)
In [7]: ds = xr.open_dataset(store)
In [8]: ds.tasmin.isel(time=0).plot()
/global/home/users/mdelgado/git/public/xarray/xarray/plot/utils.py:51: FutureWarning: 'pandas.tseries.converter.register' has been moved and renamed to 'pandas.plotting.register_matplotlib_converters'.
converter.register()
Out [8]:[image: output_7_2]
https://user-images.githubusercontent.com/3698640/34856943-f82619f4-f6fc-11e7-831d-f5d4032a338a.pngIn [9]: ds
Out [9]:
Dimensions: (lat: 720, lon: 1440, time: 365)
Coordinates:
* time (time) datetime64[ns] 2073-01-01T12:00:00 2073-01-02T12:00:00 ...
* lat (lat) float32 -89.875 -89.625 -89.375 -89.125 -88.875 -88.625 ...
* lon (lon) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 1.875 ...
Data variables:
tasmin (time, lat, lon) float64 ...
Attributes:
parent_experiment: historical
parent_experiment_id: historical
parent_experiment_rip: r1i1p1
Conventions: CF-1.4
institution: NASA Earth Exchange, NASA Ames Research C...
institute_id: NASA-Ames
realm: atmos
modeling_realm: atmos
version: 1.0
downscalingModel: BCSD
experiment_id: rcp45
frequency: day
realization: 1
initialization_method: 1
physics_version: 1
tracking_id: 1865ff49-b20c-4268-852a-a9503efec72c
driving_data_tracking_ids: N/A
driving_model_ensemble_member: r1i1p1
driving_experiment_name: historical
driving_experiment: historical
model_id: BCSD
references: BCSD method: Thrasher et al., 2012, Hydro...
DOI: http://dx.doi.org/10.7292/W0MW2F2G
experiment: RCP4.5
title: CESM1-BGC global downscaled NEX CMIP5 Cli...
contact: Dr. Rama Nemani: rama.[email protected], Dr...
disclaimer: This data is considered provisional and s...
resolution_id: 0.25 degree
project_id: NEXGDDP
table_id: Table day (12 November 2010)
source: BCSD 2014
creation_date: 2015-01-07T19:18:31Z
forcing: N/A
product: output—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pydata/xarray/issues/1075#issuecomment-357125148, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1hiQ2cgre7e234H1PSZ33v3i1CWzks5tJsMQgaJpZM4KnpJe
.
@delgadom which version of netCDF4 are you using? I'm following your same steps but am still receiving an [Errno 2] No such file or directory
xarray==0.10.2
netCDF4==1.3.1
Just tried it again and didn't have any issues:
patt = (
'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/{scen}/day/atmos/{var}/' +
'r1i1p1/v1.0/{var}_day_BCSD_{scen}_r1i1p1_{model}_{year}.nc')
def open_url_dataset(url):
fname = os.path.splitext(os.path.basename(url))[0]
res = requests.get(url)
content = io.BytesIO(res.content)
nc4_ds = netCDF4.Dataset(fname, memory=res.content)
store = xr.backends.NetCDF4DataStore(nc4_ds)
ds = xr.open_dataset(store)
return ds
ds = open_url_dataset(url=patt.format(
model='GFDL-ESM2G', scen='historical', var='tasmax', year=1988))
ds
@delgadom Ah, I see. I needed libnetcdf=4.5.0, I had been using an earlier version. Sounds like prior to 4.5.0 there were still some issues with the name of the file being passed into netCDF4.Dataset, as is mentioned here: https://github.com/Unidata/netcdf4-python/issues/295
Is this now implemented (and hence can this issue be closed?) It appears that this works well:
boto_s3 = boto3.client('s3')
s3_object = boto_s3.get_object(Bucket=bucket, Key=key)
netcdf_bytes = s3_object['Body'].read()
netcdf_bytes_io = io.BytesIO(netcdf_bytes)
ds = xr.open_dataset(netcdf_bytes_io)
Is that the right approach to opening a NetCDF file on S3, using the latest xarray code?
FWIW, I've also tested @delgadom's technique, using netCDF4 and it also works well (and is useful in situations where we don't want to install h5netcdf). Thanks!
Most helpful comment
yes! Thanks @jhamman and @shoyer. I hadn't tried it yet, but just did. worked great!