Xarray: Use xarray.open_dataset() for password-protected Opendap files

Created on 30 Oct 2016  Â·  33Comments  Â·  Source: pydata/xarray

I've been using xarray.open_dataset() to read Opendap netcdf files from NASA's MERRA-2 data archive. Recently they changed their site so that now you must enter a username and password to read any files. They describe here how to access data with Pydap: http://disc.sci.gsfc.nasa.gov/registration/registration-for-data-access#python.

I experimented with a similar approach (adding username and password to the url) with xarray.open_dataset() and specifying engine='pydap', but no luck. Is there a way to use xarray.open_dataset() to read password-protected Opendap files? Thanks!

backends help wanted

Most helpful comment

Dear all,
Thank you very much for all the time you've put into fixing this issue. I'm a fresh PhD student, started working on solar radiation forecast four months ago, and right now I'm trying to use MERRA 2 aerosol data to initialize WRF Solar. The bug fix on this thread has helped me a lot, since I was trying to avoid the straight forward method of downloading the files by date and then merging them in a single python object. This way I can directly create my python object without downloading one by one and then merging! It's awesome! Thank you all very much!

All 33 comments

If you write engine='pydap' in open_dataset, the URL should be passed directly on to pydap, but you'll still need to follow all of their other instructions. If you're getting an error message from xarray, let us know but otherwise I'm at a loss -- you should check with the folks at NASA.

Thanks very much for your reply! I still get an error from xarray when I use the engine='pydap' option. Here's a minimum (almost) working example (almost because you need an account with the server so you can substitute your username/password into the url string):

import xarray
from pydap.client import open_url

url = 'http://<username>:<password>@goldsmr5.sci.gsfc.nasa.gov/opendap/MERRA2/M2I3NPASM.5.12.4/1986/01/MERRA2_100.inst3_3d_asm_Np.19860101.nc4'

ds1 = open_url(url)    # Works but data isn't in xarray format
ds2 = xarray.open_dataset(url, engine='pydap')    # Error message, see attached

I've attached the error message here --
error_msg.txt
I don't know enough about the inner workings of xarray to trace through it. Please let me know if any of this means anything to you and has a reasonably easy fix or workaround. Thank you!

If the dataset has a "time" dimension, try accessing the first few values. Can you view them in pydap? Xarray's open_dataset does a little more work than pydap's open_url, insofar as it actually downloads some array data.

Ah, I see. Thanks for the suggestion. Using Pydap I'm able to see all the variables and their metadata, so I thought it was working, but when I try to actually access the data values, I get the same error message as from Xarray. The issue must be something unrelated to Xarray -- I'll keep investigating. Thanks for your help!

@jenfly did you find a solution how to make opendap authentication work with xarray? Might be worthwhile posting it here, even though the issue has to do with the backends.

@j08lue no, not yet. I've been in touch with the folks at NASA who run the server, but their suggestions didn't work for me and I haven't had time to keep troubleshooting. I will need to sort out this issue in the next couple of months to get some data that I need, so if/when I ever resolve it, I'll post the solution here.

I've finally found something useful online and am able to use Pydap to open these files -- hoping someone can help me find a way to integrate this into an xarray.open_dataset() function call and then I will be a very happy camper!

Turns out much of the info posted by NASA online is out of date and based on a different implementation of Pydap than what is actually being used currently (argh). Here is something that actually works, from http://www.pydap.org/en/latest/client.html#urs-nasa-earthdata:

from pydap.client import open_url
from pydap.cas.urs import setup_session

url = 'https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/2016/06/MERRA2_400.tavg1_2d_slv_Nx.20160601.nc4'

session = setup_session(username, password)
dataset = open_url(url, session=session)

where I've assigned the username and password variables with the appropriate values in another function.

I've tested this and it is working, but I would prefer to do things within Xarray since all my code is already using it. Just for fun, I tried ds = xarray.open_dataset(url, engine='pydap', session=session), to see if the extra keyword would be magically sent to the pydap engine, but got an error message. Is there a way to incorporate this functionality into xarray.open_dataset? Thank you so much for any assistance!

Hi @jenfly, it's great to see that you have tracked down this root issue! I agree we should be able to support direct access to these sort of opendap resources within xarray. It should not be too tricky to implement, and in fact, if you are interested, it could be a great opportunity for you to open a pull request and become directly involved in the project. We would be very happy to gain another contributor.

You can see the line where pydap.open_url gets called here:
https://github.com/pydata/xarray/blob/master/xarray/backends/pydap_.py#L64

We just need a mechanism to pass the username and password from open_dataset to the pydap backend. There are two possible options I see:

  1. we could add new username and password keyword args to open_dataset. This is the most straightforward, but open_dataset already has a ton of arguments, so maybe it is not ideal.
  2. we could parse out the username and password from a url like https://username:password@... within the pydap backend.

It would be good to get some other opinions on which approach would be preferable.

Thanks, @rabernat! I'd be happy to try implementing this in the project. I'm a newbie when it comes to contributing to big projects like this (so far I've just used Github for my own little projects) so I might have some naive questions as I figure out how things work.

The two options you mentioned for passing username and password info to open_dataset both sound good to me. I don't have any strong preference between them. How do I get other opinions on which approach to use? Should I start a new issue thread?

Also, I realized that there is another hiccup along the way. When I try to specify engine='pydap' in open_dataset, I get the same error message as mentioned in #1174, that the object has no attribute iteritems. When I wrote the first post in this thread, back in October, I was able to use engine='pydap' without any problems. This seems to be related to recent upstream changes in Pydap: https://github.com/pydap/pydap/issues/43 and I presume might require more substantial changes either in Xarray or Pydap so that they can work together again. Any thoughts on how to handle this?

Parsing username/password from the URL would be very easy to add.

We need to figure out a solution for the proliferating arguments on open_dataset before we add many more, so I would prefer that for now.

Another option is to add session as an argument on xarray.backends.PydapDataStore, and encourage passing PydapDataStore objects into xarray.open_dataset for extra customizability, e.g.,

store = xarray.backends.PydapDataStore(url, session)
ds = xarray.open_dataset(store)

Pydap has a new v3.2 release, but it still needs some fixes to work with xarray -- or xarray needs to be updated to work with the new version of pydap. I think https://github.com/pydap/pydap/pull/48 once merged would probably be enough to restore xarray compatibility.

I like the idea of passing PydapDataStore objects that include the session object. It seems more likely to be forward compatible, especially if Central Authentication Services multiply (as one would expect) with different authentication mechanisms.

I also like the idea of passing PydapDataStore objects that include the session object. Delving deeper into the pydap authentication, I found that there are already several different setup_session functions available to create the session object, corresponding to different authentication procedures (pydap.cas.get_cookies.setup_session, pydap.cas.urs.setup_session, pydap.cas.esgf.setup_session) as well as additional arguments to setup_session beyond username and password. Best to deal with all this separately with pydap rather than trying to embed it within xarray.

I'm still having problems trying to get xarray.open_dataset to work with pydap. Using the latest commit on pydap/master (in which https://github.com/pydap/pydap/pull/48 is merged) I'm now getting a new error: AttributeError: '<class 'pydap.model.BaseType'>' object has no attribute 'encode'. When I have some time, I'll look into it further and try to see what else is needed to restore compatibility.

I'm still having problems trying to get xarray.open_dataset to work with pydap. Using the latest commit on pydap/master (in which pydap/pydap#48 is merged) I'm now getting a new error: AttributeError: '' object has no attribute 'encode'. When I have some time, I'll look into it further and try to see what else is needed to restore compatibility.

Indeed, it would be great if someone using pydap could take a look into this. You can find our logic for interoperating with pydap here: https://github.com/pydata/xarray/blob/master/xarray/backends/pydap_.py

@shoyer @jenfly:
Good news, I think I was able to track down the bug in pydap that was preventing compatibility. I'm putting a PR together and we could expect it to be merged pretty soon into the master. I wanted to give you a heads up so that you don't waste more time on this.

Awesome, thanks so much @laliberte!

@jenfly and @shoyer pydap version 3.2.2 (newly released last week) should have fixed this issue. Could you verify?

I spent a few minutes on this but am still getting AttributeError. It would be great if someone could put some time into debugging this. Should be as simple as installing pydap (in both python 2 and 3 virtual/conda environments) and getting py.test -k PydapTest to pass.

Nevermind, I figured it out (I was using an old version of pydap by mistake). See #1439 for the pydap fix.

@shoyer @jenfly Has this been implemented? I can't see any open PRs relating to this, so I guess no one is working on it?

I would be happy to try and implement it, if that's fine with you? It seems like you settled on the solution of passing a session object to a PydapDataStore and then passing that to open_dataset(), correct?

Thanks in advance!

@mrpgraae no, I don't think this has been implemented yet.

Please take a look at #1508 for an example of the model to use:

  • Define an open classmethod method for loading from a URL.
  • __init__ should accept a pydap dataset object (whatever is returned from pydap.client.open_url)

You are also welcome to add any keyword parameters (e.g., session) that open_url accepts to the open method.

So the user API becomes:

pydap_ds = pydap.client.open_url(url, session=session)
store = xarray.backends.PydapDataStore(pydap_ds)
ds = xarray.open_dataset(store)

or

store = xarray.backends.PydapDataStore.open(url, session=session)
ds = xarray.open_dataset(store)

Thank you @shoyer, I'll start work on the implementation.

Dear all,
Thank you very much for all the time you've put into fixing this issue. I'm a fresh PhD student, started working on solar radiation forecast four months ago, and right now I'm trying to use MERRA 2 aerosol data to initialize WRF Solar. The bug fix on this thread has helped me a lot, since I was trying to avoid the straight forward method of downloading the files by date and then merging them in a single python object. This way I can directly create my python object without downloading one by one and then merging! It's awesome! Thank you all very much!

@juliancanellas
Great! Good to see that someone else actually benefits from this feature, years after it was implemented 😄

I am trying to load MERRA2 data via the NASA password-protected opendap server. Although it sounds like both pydap and xarray have been fixed to support this, I still am having basically the same problem @jenfly described over three years ago. At this point it feels like a pydap issue, but I ask on this thread anyway.

Here's a fully reproducible example, password and all 😄

from pydap.client import open_url
from pydap.cas.urs import setup_session

username = 'rabernat'
password = '%8rTMU6VT37r&%3e'
url = 'https://goldsmr5.gesdisc.eosdis.nasa.gov:443/opendap/MERRA2_MONTHLY/M2IMNPANA.5.12.4/2019/MERRA2_400.instM_3d_ana_Np.201901.nc4'

session = setup_session(username, password, check_url=url)
dataset = open_url(url, session=session)
assert 'USVS' in dataset
_ = dataset['USVS'][:]

raises

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-7-56bfca618586> in <module>
----> 1 _ = dataset['USVS'][:]

/srv/conda/envs/notebook/lib/python3.7/site-packages/pydap/model.py in __getitem__(self, index)
    318     def __getitem__(self, index):
    319         out = copy.copy(self)
--> 320         out.data = self._get_data_index(index)
    321         return out
    322 

/srv/conda/envs/notebook/lib/python3.7/site-packages/pydap/model.py in _get_data_index(self, index)
    347             return np.vectorize(decode_np_strings)(self._data[index])
    348         else:
--> 349             return self._data[index]
    350 
    351     def _get_data(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/pydap/handlers/dap.py in __getitem__(self, index)
    140         logger.info("Fetching URL: %s" % url)
    141         r = GET(url, self.application, self.session, timeout=self.timeout)
--> 142         raise_for_status(r)
    143         dds, data = r.body.split(b'\nData:\n', 1)
    144         dds = dds.decode(r.content_encoding or 'ascii')

/srv/conda/envs/notebook/lib/python3.7/site-packages/pydap/net.py in raise_for_status(response)
     37             detail=response.status+'\n'+response.text,
     38             headers=response.headers,
---> 39             comment=response.body
     40         )
     41 

HTTPError: 302 Found
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://urs.earthdata.nasa.gov/oauth/authorize/?scope=uid&amp;app_type=401&amp;client_id=e2WVk8Pw6weeLUKZYOxvTQ&amp;response_type=code&amp;redirect_uri=http%3A%2F%2Fgoldsmr5.gesdisc.eosdis.nasa.gov%2Fdata-redirect&amp;state=aHR0cHM6Ly9nb2xkc21yNS5nZXNkaXNjLmVvc2Rpcy5uYXNhLmdvdi9vcGVuZGFwL01FUlJBMl9NT05USExZL00ySU1OUEFOQS41LjEyLjQvMjAxOS9NRVJSQTJfNDAwLmluc3RNXzNkX2FuYV9OcC4yMDE5MDEubmM0LmRvZHM%2FVVNWUyU1QjA6MTowJTVEJTVCMDoxOjQxJTVEJTVCMDoxOjM2MCU1RCU1QjA6MTo1NzUlNUQ">here</a>.</p>
</body></html>

Is this a problem with pydap? Or the NASA server?

https://en.wikipedia.org/wiki/HTTP_302

Looks like you need a better URL? and that pydap can't deal with redirects?

Yes, seems like a redirect issue. The URL is fine.

No, actually the problem was with my authorization. I had to accept a EULA before my password was valid. Once I did that, everything worked.

One can also add username and password to the .netrc file and all works very smoothly, without a need for explicit username and password in the script.

However, there was one more issue. With Python 3.7.6, I was getting the following error:

Traceback (most recent call last): File "MERRA2.py", line 16, in <module> session = setup_session(username, password, check_url=url) File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/urs.py", line 19, in setup_session verify=verify) File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/get_cookies.py", line 75, in setup_session password_field=password_field) File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/get_cookies.py", line 123, in soup_login soup = BeautifulSoup(resp.content, 'lxml') File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/bs4/__init__.py", line 228, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

That was solved by pip install lxml

So, I tried Ryan's example, and got to the same error, where do you accept
the EULA? It doesn't pop up on screen.

El dom., 22 mar. 2020 a las 6:29, ahahmann (notifications@github.com)
escribió:

One can also add username and password to the .netrc file and all works
very smoothly, without a need for explicit username and password in the
script.

However, there was one more issue. With Python 3.7.6, I was getting the
following error:

File "MERRA2.py", line 16, in
session = setup_session(username, password, check_url=url)
File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/urs.py", line 19, in setup_session
verify=verify)
File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/get_cookies.py", line 75, in setup_session
password_field=password_field)
File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/get_cookies.py", line 123, in soup_login
soup = BeautifulSoup(resp.content, 'lxml')
File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/bs4/__init__.py", line 228, in __init__
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

That was solved by pip install lxml

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pydata/xarray/issues/1068#issuecomment-602170564, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AKIV6EWOBU7AJOJNQWZQEGLRIXK7XANCNFSM4CUSDJ5A
.

So, I tried Ryan's example, and got to the same error, where do you accept the EULA? It doesn't pop up on screen.

https://urs.earthdata.nasa.gov/app_eula/nasa_gesdisc_data_archive

No, actually the problem was with my authorization. I had to accept a EULA before my password was valid. Once I did that, everything worked.

I'm trying this example:

url = 'https://gpm1.gesdisc.eosdis.nasa.gov:443/opendap/hyrax/GPM_L3/GPM_3IMERGHH.06/2019/087/3B-HHR.MS.MRG.3IMERG.20190328-S000000-E002959.0000.V06B.HDF5'
try:
    session = setup_session(username, password, check_url=url)
    pydap_ds = open_url(url, session=session)
    store = xr.backends.PydapDataStore(pydap_ds)
    ds = xr.open_dataset(store)
except Exception as err:
    print(err)

which returns:

302 Found
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://urs.earthdata.nasa.gov/oauth/authorize/?scope=uid&amp;app_type=401&amp;client_id=e2WVk8Pw6weeLUKZYOxvTQ&amp;response_type=code&amp;redirect_uri=https%3A%2F%2Fgpm1.gesdisc.eosdis.nasa.gov%2Fdata-redirect&amp;state=aHR0cHM6Ly9ncG0xLmdlc2Rpc2MuZW9zZGlzLm5hc2EuZ292L29wZW5kYXAvaHlyYXgvR1BNX0wzL0dQTV8zSU1FUkdISC4wNi8yMDE5LzA4Ny8zQi1ISFIuTVMuTVJHLjNJTUVSRy4yMDE5MDMyOC1TMDAwMDAwLUUwMDI5NTkuMDAwMC5WMDZCLkhERjUuZG9kcz90aW1lX2JuZHMlNUIwOjE6MCU1RCU1QjA6MTowJTVE">here</a>.</p>
</body></html>
/usr/local/lib/python3.8/site-packages/xarray/backends/common.py:87: FutureWarning: The ``variables`` property has been deprecated and will be removed in xarray v0.11.
  return len(self.variables)

The error message just comes when I try to use xr.open_dataset
I've already accepted the EULA.
Does anyone know what can be?

Dear all, anyone knows if it is possible in xarray.open_dataset (pydap or netcdf engines) to pass Authorization or Cookie header along with opendap request? For example: Authorization: Bearer u32t4o3tb3gg43 or Cookie: foo=u32t4o3tb3gg43

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zxdawn picture zxdawn  Â·  3Comments

jacklovell picture jacklovell  Â·  4Comments

benbovy picture benbovy  Â·  3Comments

tomchor picture tomchor  Â·  4Comments

blaylockbk picture blaylockbk  Â·  4Comments