Google-cloud-python: [Discussion] HTTP library

Created on 5 Jan 2016  Â·  62Comments  Â·  Source: googleapis/google-cloud-python

Presently, we use httplib2 by default but allow users to specify their own http library (#908) so long as it conforms to httplib2.Http's signature.

httplib2 was chosen because it's the underlying http client used by google/oauth2client. However, httplib2 has a variety of issues include not being thread-safe (#1274), not doing any connection pooling, etc.

We should consider what it would take to move to another http library. The major considerations are:

  1. We must support using oauth2client's credentials with the library. The essential functionality is adding the auth header and performing refresh & retry. This can be done either here or within oauth2client.
  2. The library must work on Google App Engine.
core

Most helpful comment

Update:

This library now uses google-auth instead of oauth2client. google-auth supports httplib2, requests, and urllib3.

We still need a plan of attack for removing httplib2 as the transport here, but we no longer have a hard dependency on it.

All 62 comments

As a start, it'd be nice to

  • Get oauth2client to 100% code coverage (https://github.com/google/oauth2client/issues/212)
  • Replace monkey-patching in Credentials.authorize. Idea would be that

``` python

http = httplib2.Http()
auth_http = credentials.authorize(http)
auth_http

```

and auth_http.request would perform the same functionality that the monkey-patched stuff would

What I would prefer even more:

>>> auth_http = AuthenticatedHttp(credentials)
>>> auth_http
<oauth2client.transport.AuthenticatedHttp at 0x00deadbeaf00>

Which would allow something like:

>>> auth_session = AuthenticatedSession(credentials)
>>> auth_session
<oauth2client.transport.requests.AuthenticatedSession at 0x00deadbeaf00>

To be idiomatic.

Does it make sense to have transports be a separate library all together?

Does it make sense to have transports be a separate library all together?

nvm, the refresh logic is very oauth-centric so it doesn't quite make sense. Though I am curious what special considerations we'll need to make to use those transports in this library.

Wow, there's a ton of code in gcloud.streaming that seems to be working around httplib2.

What's the utility in continuing to support httplib2 directly? Could we move to urllib3 and be better served?

gcloud.streaming was for #935. We would have used apitools but it's future was unclear and it was (is?) completely undocumented. Hopefully we can strip down gcloud.streaming to just the core of what we need.


Your suggestion

>>> auth_http = AuthenticatedHttp(credentials)

is not agnostic of the transport. Instead you'd need

>>> auth_http = AuthenticatedHttp(credentials, http)

and you'd need a well-defined interface for http (e.g. defines http.request with signature same as httplib2.Http.request)

Sorry, I may not have been clear: AuthenticatedHttp is the transport - it's httplib2, whereas AuthenticatedSession would be a requests-based transport. Perhaps this is better:

authed_transport = transports.httplib2.Transport(credentials)
authed_transport = transports.urllib3.Transport(credentials)
authed_transport = transports.requests.Transport(credentials)

Got it. I think it'd be easier to have a simple default that can use a httplib2.Http object with no modification and then maybe have libraries like oauth2client-urllib3 and oauth2client-requests that define the necessary interface.

I'm not a fan of httplib2's interface, and I think we'd be doing ourselves a disservice if we tried to make that our definition for the future.

Personally, I'd be in favor of dropping httplib2 altogether in favor of urllib3. Requests support would then be pretty straightforward. But that's pretty extreme and the waters for urllib3 and app engine are rocky.

Now we're talking! :grinning:

Can you describe your desired interface?

Not sure yet. :)

Maybe something along the lines of how urllib3 does things already: the urlopen method:

    def urlopen(self, method, url, body=None, headers=None, retries=None,
                redirect=True, assert_same_host=True, timeout=_Default,
                pool_timeout=None, release_conn=None, **response_kw):

Our transport for urllib3 could just be a wrapper around urlopen that does the authentication and refresh logic. Any other transport would need to implement the same urlopen interface.

Maybe it's better to put it this way: I think it's fine to let urllib3's poolmanager define our interface. We don't necessarily need to create any abstract classes or anything like that.

Very similar to httplib2.Http.request:

def request(self, uri, method="GET", body=None, headers=None,
            redirections=DEFAULT_MAX_REDIRECTS, connection_type=None):

also google.appengine.api.urlfetch.fetch

def fetch(self, url, payload=None, method=1, headers={}, allow_truncated=False,
          follow_redirects=True, deadline=None, validate_certificate=None):

and urllib2.urlopen in 2.7.6

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT):

and urllib2.urlopen in 2.7.10 (latest as of Jan. 5, 2016)

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            cafile=None, capath=None, cadefault=False, context=None):

We don't necessarily need to create any abstract classes or anything like that.

I agree. I think it'd be nice just to allow someone to create any old transport they like and have it work with oauth2client provided it knew how to quack.

agree. I think it'd be nice just to allow someone to create any old transport they like and have it work with oauth2client provided it knew how to quack.

That's sort of the beauty of urllib3's design - the 'Manager' is the top-level component and there's currently multiple managers (e.g, ProxyManager, AppEngineManager, etc.).

Seems like we agree on the interface part, but maybe not on the dependency bit. I'd like it to work without any assumption that the user has urllib3 or knows what it is. (That might be a bit extreme if we want to provide a nice default.)

Considering requests (and by extension urllib3) are currently the 3rd most popular python package (above the package manager, even!) it doesn't seem like much of a stretch for us to use urrlib3 by default and say that anyone who wishes to use another transport should implement urllib3's PoolManager interface. We can even provide the existing AppEngineManager as an example.

Yeah that may be the right route. (We should loop in httplib2 at some point.)

So does our plan from here look like:

  1. 100% coverage on oauth2client.
  2. Create a AuthenticatedManager class that handles auth and retry logic and forwards to a urllib3.PoolManager compliant class.
  3. Replace httplib2 usage in in oauth2client and just expect and object with urllib3.PoolManager's signature and exceptions.
  4. Implement aforementioned interface to httplib2 for historical reasons (read: App Engine).
  5. Come up for a plan of attack for this library.

You don't think App Engine will be ready to go with urllib3?

It should be, but I'm unsure. It's not as thoroughly tested as raw httplib2. Though the surface area is quite small. What do you think?

Raw httplib2 has lots of problems too, but at least it has a lot of cycles.

What are you feelings on replacing httplib2 wholesale with urllib3 in this library once oauth2client is ready?

I'm fine with it. I have no allegiance to httplib2 and it is essentially unmaintained, which isn't very desirable.

Cool. Okay. Seems like we have at least a half-baked plan here. Do we want to file bugs on oauth2client to bring the discussion over there?

FWIW @shazow is currently asking for opinions while designing urllib3 2.0 here.

Fun fact: I wrote urllib3 partly because of very many httplib2-related issues in 2008. :P (Not much had changed.)

Ha! Nice. I wish I would've known in 2011 what I know now when I started working with the Google client libraries.

@jonparrott I've already filed https://github.com/google/oauth2client/issues/128 for the http interface discussion

I've been thinking about this a bit more. Replacing httplib2 down the entire stack is a really difficult maneuver. There's a lot of users of oauth2client and a few (such as google-api-python-client) that aren't maintained well enough to make this easily feasible. I don't personally have enough time to dedicate to refactoring httplib2 out of several libraries that I don't directly maintain. I'm unsure where to go from here, but I still feel strongly that httplib2 isn't the right choice for this library.

So I'm open to thoughts from the maintainers (and @waprin).

Some off-the-cuff ideas, half-baked, possibly terrible ideas:

  1. Once oauth2client has 100% coverage, fork it into oauth2client-urllib3. Depend on that instead.
  2. Switch this library to urllib3, leave oauth2client alone, re-implement the OAuth2 auth, retry, and refresh logic in our Connection class or via a urllib3 Retry class.
  3. Switch this library requests and requests-oauthlib, write an adapter to translate oauth2client credentials. Requests is far more high-level and may be overkill.

I personally like option 2 right now, but I could be swayed.

Another thing I think we should do is switch classes like pubsub.Connection to _not_ inherit from the base connection class. Instead, it should have an instance of the base connection or we could adopt a different object model. This way users (and us) can easily swap out the base connection.

This is why I've been advocating for


    1. Define a well-understood interface for HTTP requests and make oauth2client transport agnostic.

So I'm unsure how that will play out. It seems that oauth2client will pretty much have to be trapped in httplib2's interface unless we also change all of the clients (untenable).

Does that mean that in this library, we'll use urllib3 but write an adapter to make it pretend to be httplib2 for oauth2client?

make it pretend to be httplib2 for oauth2client

Yes, that's what I've been advocating for. More specifically, make it pretend to be httplib2.Http.request. AFAIK that is the only interface needed.

Hmm. Alright. I'm on board with that.

Yeah, it's definitely a hmm situation. I'm not saying I like the interface but it could be worse. I do like the fact that it's only one method/function interface that needs to be used.

the other side of it is adapting the exceptions, but hopefully that'll be straightforward.

Oh yeah? The exceptions from where?

Any exceptions thrown by httplib2.

Gotcha. I don't know about google-api-python-client but oauth2client doesn't rely on any. I just did a quick check via:

$ git log -1 --pretty=%H
6531263b7ed73de23d56562fa2d86c12e3b264c7
$ git grep httplib2 -- '*.py' | grep import
oauth2client/client.py:import httplib2
oauth2client/contrib/appengine.py:import httplib2
oauth2client/contrib/django_util/__init__.py:import httplib2
oauth2client/contrib/flask_util.py:import httplib2
samples/oauth2_for_devices.py:import httplib2
scripts/run_system_tests.py:import httplib2
tests/contrib/test_appengine.py:import httplib2
tests/contrib/test_flask_util.py:import httplib2
tests/http_mock.py:import httplib2
$ git grep 'httplib2\.' -- oauth2client/client.py
...
$ git grep 'httplib2\.' -- oauth2client/contrib/appengine.py
...
$ git grep 'httplib2\.' -- oauth2client/contrib/django_util/__init__.py
...
$ git grep 'httplib2\.' -- oauth2client/contrib/flask_util.py
...

Good deal.

@dhermes FYI I'm doing some experiments with replacing httplib2 with urllib3 in this library. So far, I managed to get pubsub to work. I'll let you know when I have something worth looking at.

Thanks for the heads up. The main workload was re-implementing oauth2client, yes? Because if I had working auth, I could do it in about 3 lines.

Also something to consider: most services (including pubsub) will prefer using gRPC as the transport, so make sure your work isn't fruitless.

Thanks for the heads up. The main workload was re-implementing oauth2client, yes? Because if I had working auth, I could do it in about 3 lines.

Actually you depend a little on how httplib2 returns responses. I'm pulling that up a layer into a Transport class. For oauth2client, I'm likely going to write a small library urllib3_httplib2_shim to use for oauth2client stuff.

Also something to consider: most services (including pubsub) will prefer using gRPC as the transport, so make sure your work isn't fruitless.

Absolutely, which is why a Transport class makes a little bit of sense.

Yep. :+1: on it.

Okay, I got datastore to work to- which is enough for me to consider this proof-of-concept proved. I'm going to work on the httplib2-to-urllib3 shim and I'll publish that as a separate library. I'll come back to ripping httplib2 out of here after that, as having that library allow us to do it in a more gradual fashion (and we'll immediately get the benefit of connection pooling, thread safety, and ssl verification).

Looking forward to it.

I feel like I'm taking crazy pills to consider writing such a thing. ¯_(ツ)_/¯

Looking forward to it.

This weekend I got a working prototype of it going. I need to fill in a few gaps (SSL verification had some .. interesting results on OS X, thanks Apple) and get it through the Google OSS process but I should be able to put it out there this week.

It your call, but I feel like for the time being we can just document how to use it instead of directly depending on it. This way, people who need threadsafety etc, can easily drop this in, and those that don't can continue using what we know works.

Prototype coming anytime soon?

Going through OSS process now. It will live at
GoogleCloudPlatform/httplib2shim.

On Wed, Jan 20, 2016, 5:57 PM Danny Hermes [email protected] wrote:

Prototype coming anytime soon?

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/gcloud-python/issues/1346#issuecomment-173426591
.

:+1:

httplib2shim has been published.

@dhermes what do you think we should do from here? My initial thoughts:

  1. Merge #1274 for the near-term.
  2. Get system-tests to run with both httplib2 and httplib2shim.
  3. Figure out how to use urllib3 for all of our httplib2 usage, but only use httplib2shim when interacting with oauth2client.
  4. Figure out what to do about oauth2client, if anything.

Thanks for the heads-up!

I think we should attack oauth2client first. I don't really like the idea of merging #1274 as-is, but maybe with some design discussions / me calming down, it'll be fine.

What are you ideas for oauth2client? I feel like our hands are kind of tied as we don't control all of the downstream clients.

I have the same issue.

@ds-hwang Feel free to give httplib2shim a go and report any issues.

I think httplib2shim is fine.
The root issue is that oauth2client doesn't set proxy up to httplib2shim properly
I found the temporary solution; https://code.google.com/p/googleappengine/issues/detail?id=9533#c4

Update:

This library now uses google-auth instead of oauth2client. google-auth supports httplib2, requests, and urllib3.

We still need a plan of attack for removing httplib2 as the transport here, but we no longer have a hard dependency on it.

hi all, i appreciate that this issue is being more broadly discussed and will be fixed with #1346. but we've been seeing this issue for a couple years now and it really hasn't been documented at a very visible level, which would have helped when we first ran into this.
i'm a member of the ISB-CGC NCI cloud pilot team and we're tasked with uploading the files from the Genomic Data Center (GDC) into Google cloud storage. we've been collaborating with the Google Genomics team. because the data files are organized by program, cancer type, and data type, the upload can be implemented as embarrassingly parallel but we noticed the errors when trying to use gcloud python client as depicted in #1214. we've been using boto, that doesn't have that problem, but boto3 no longer supports GCS. as one workaround, i tried gcloud_requests which was mentioned in #1214 and that seemed to work fine, but saw that the issue was closed in favor of this issue (makes perfect sense), so i tried httplib2shim just now on an upload of ~1500 files and did see one problem. the code is set up to retry a couple of times on exception. for five files, on the first try, there was the following:

problem uploading gdc/test_local_gdc_upload_run/TCGA-ACC/Isoform Expression Quantification/48081b15-4369-431f-a344-663135e9fa91_isoforms.quantification.txt due to 'http:metadata.google.internal'

but on the subsequent retries, they failed (our application logic) because the file had evidently been uploaded and this error occurred after the successful upload on the first try.
look forward to the final solution being well publicized.

What's the current status of thread-safety in the python APIs? https://github.com/GoogleCloudPlatform/google-cloud-python/issues/1214 was closed and redirected to this issue, but it'd be nice to have a clearer explanation of what's going on in this issue with respect to thread safety. FWIW https://developers.google.com/api-client-library/python/guide/thread_safety still mentions only httplib2 and a somewhat cumbersome workaround to ensure thread safety.

Hi @asilversempirical:
@dhermes is in the process of moving us over to urllib3, which is thread-safe.
I actually thought there was an issue for this, but I can not find it, so I am reopening this one. :-)

@lukesneeringer I'd guess that issue was #1998

Okay. Let's keep that one and close this one.

Thanks for the pointer!

Was this page helpful?
0 / 5 - 0 ratings