Google-cloud-python: Allow easy paging for list operations

Created on 26 May 2015 · 32Comments · Source: googleapis/google-cloud-python

Taking the comment on https://github.com/GoogleCloudPlatform/gcloud-python/pull/889/files#r31042158 over here.

Requiring explicit paging for topics is really ugly, there must be a better way than:

all_topics = []

topics, token = client.list_topics()
all_topics.extend(topics)
while token is not None:
  topics, token = client.list_topics(page_token=token)
  all_topics.extend(topics)

If our concern is "people could make a boatload of requests and they don't realize it", we could always limit this like we do with the recursive deleting in storage?

topics = client.list_topics(limit=100)

We can do a bunch of this with the page_size parameter, but that might mean that I have to wait for _everything_ to come back before starting any work, which seems kind of ridiculous.

It'd be really nice if that limit and page-size stuff was both there, so it's easy to do things like "I want the first 5000 topics, and I want to pull them from the server in chunks of 50":

for topic in client.list_topics(page_size=50, limit=5000):
  push_work_to_other_system(topic)

To add a bit more context, I'd like to toss out: what if we made all of our list operations return iterators?

The use cases I see here are...

I want everything, give it all to me (for topic in list_topics())
I want up to N things, stop giving them to me then (for topic in list_topics(limit=100))
I don't know how many I want, I'll know when I want to stop though... (for topics in list_topics(): if topic.name == 'foo': break)
Combination of the previous two (I don't know when I want to stop, but don't let me go on forever, kill it at say... 1000)
I want to pick up where I left off, I saved a token somewhere (sort of like offset)! (for topic in list_topics(page_token=token, limit=100))

The "let's just always return page, page_token" thing doesn't really make all of those use-cases all that fun... But if we always return iterators, they are all easy.

Further, let's say I have a weird case where I just want one page worth of stuff... list_topics().get_current_page() could return what you want, no?

feature request datastore storage pubsub

Source

jgeewax

Most helpful comment

Here's my input, for what it's worth.

I agree that it is important to provide a simple interface that doesn't require the user to deal with page tokens. If we don't, people will often choose to fetch just the first page. I know I've done that.

Whether it is better to return a list or an iterator will depend on the API, as will whether it makes sense to offer a limit parameter.

Default limits are rarely going to be appropriate. A default limit on PubSub topics would surely be bad, for instance. Failing to override such a default with limit=None would almost always be an insidious bug.

If listing PubSub topics returns an iterator, I actually don't think it makes sense to offer a limit parameter at all. I can't think of a use-case for it. If the user really did want to retrieve some limited number of arbitrary topics, they could do that by terminating the iteration early.

rimey on 12 Apr 2016

❤2

All 32 comments

In general I like the idea, but was likely intimidated due to the sheer length of what you wrote.

Recall #90 was in place just in case this was necessary. We have a nice Iterator matching most of this pattern in datastore.query.

dhermes on 6 Jun 2015

Sorry about that. In short, I want to write:

for topic in client.list_topics(page_size=50, limit=5000):
  do_something_with(topic)

with the ability to short-circuit out (break or return).

jgeewax on 6 Jun 2015

@tseaver Do you want to take a stab at this in pubsub?

dhermes on 7 Jul 2015

Should be pretty easy now that we have gcloud.iterator.Iterator

jgeewax on 7 Jul 2015

Indeed. I am moving the iterator in storage into the client right now (for #952) and noticed we had no iterator in pubsub.

dhermes on 7 Jul 2015

Is this not addressed?

dhermes on 28 Aug 2015

I just looked at documentation for pubsub and storage, both seem to use the iterator, so I'm happy.

I'll open a separate issue if I see anything else.

jgeewax on 28 Aug 2015

Yea, this isn't actually fixed.

See https://github.com/GoogleCloudPlatform/gcloud-python/blob/188515f192e3823ce21a47df2d4742995d08f109/gcloud/pubsub/client.py#L47

I think we're making a mistake returning a tuple here, and instead should return an iterator.

This would mean that we can treat it as a list (not giving a shit about pagination -- aka, good for scripts that need to go through all topics) as well as a paged-resource if needed.

Infiniscroll list

for topic in client.list_topics():
  do_something_with_topic(topic)

Explicitly paginated

token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html)

The reason I care so much about this is that the first case is incredibly common when people first start using this library... So I want to make the common (intro) case fast.

Compared to today:

Infiniscroll list

token = None
while True:
  topics, token = client.list_topics(next_page_token=token):
    for topic in topics:
      do_something_with_topic(topic)
    if token is None:
      break

# or maybe...

token = True
while token is not None:
  topics, token = client.list_topics(next_page_token=token)
    for topic in topics:
      do_something_with_topic(topic)

(correct me if there's a better way?)

Explicitly paginated

token = request.GET('page_token', None)
topics, token = client.list_topics(next_page_token=token)
html = template.render({'page_token': token, 'topics': topics})
response.write(html)

Note that the paginated version doesn't really change all that much, but the infiniscroll version looks way worse off.... It forces people to learn about page tokens when they aren't interested in paginating anything themselves. They just want to iterate over all the stuff.

jgeewax on 5 Apr 2016

@jgeewax #1699 added a much more natural spelling for the "just give me all the results of a method" crowd:

>>> from gcloud.iterator import MethodIterator
>>> topics = list(MethodIterator(client.list_topics))

(We can rename it just Iterator later, after we finish removing all the subclasses of the current iterator).

We _do_ make those users explicitly chose the "iterate everything" choice, but they don't have to deal with paging / tokens. In exchange, we make it possible for those who care about distributed-systems performance, or their API budgets, to do their thing cleanly: it is a "good API smell" when the "convenience" layer uses the "less conventient but more controlled" layer.

I actually mean to revise the existing list_foo() methods (e.g., Bucket.list_blobs) to emulate this pattern.

tseaver on 5 Apr 2016

So you're saying that .list_foo() will look like the example above?

for topic in client.list_topics():
  do_something_with_topic(topic)

And

token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html)

The example of using MethodIterator and passing in a method (client.list_topics) seems ... not very friendly to me...

jgeewax on 5 Apr 2016

No, list_topics won't return an iterable: it will return the batch, token pair (as it does now). If you don't care about when the API calls are made, or how many of them, you use the MethodIterator wrapper, which is an iterable:

for topic in MethodIterator(client.list_topics):
    do_something_with(topic)

tseaver on 5 Apr 2016

OK, I really don't like that. It feels like we're just being jerks...

Why wouldn't we just return a thing that is both iterable, and gives you access to the tuple (batch, token) if you want it...?

jgeewax on 5 Apr 2016

I'd say it's the opposite of jerks. Encouraging paging is discouraging behavior that could accidentally use too many resources. Though we do fine with a hybrid approach / offer both in datastore Query.

dhermes on 5 Apr 2016

I'm proposing a way to do this where both cases ("let me iterate over everything" and "I want to manually page through") are intuitive, easy, and simple. The answer being "import a special class, and wrap a method with it" doesn't seem like the right one to me. It seems like we're saying "we can make this easier for you, but we won't".

I can't imagine why the result of list_something() wouldn't be iterable... It just seems wrong.

Can someone give me a great reason why we wouldn't just return an iterable class that exposes the batch and the next page token if people want it? In other words, why the code I write shouldn't be:

# I want to iterate through everything...
for topic in client.list_topics():
  do_something_with_topic(topic)

And

# I want to go through things one page at a time...
token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html)

jgeewax on 6 Apr 2016

I suspect there is also a bit of concern about doing the equivalent of a HEAD request, as well as the "I only want a few items back, please don't loop forever" situation. In those cases, here's the code I'm proposing:

# I just want the first page:
for topic in client.list_topics().page:
  do_something_with(topic)

# Please only give me a few items, not everything I've ever added...
for topic in client.list_topics(limit=100):
  do_something_with(topic)

jgeewax on 6 Apr 2016

@tseaver @dhermes : Any thoughts here?

jgeewax on 7 Apr 2016

/cc @daspecster

jgeewax on 11 Apr 2016

Hello!

From the discussion that I've read here I'm not sure what's holding us back from making this return an iterator? I could be missing something but that seems like the natural result to me.

@dhermes, @tseaver are there more implications to this that I'm not seeing?

daspecster on 11 Apr 2016

@daspecster the issue is that we will be encouraging users to naively loop over all results of a paged query API. Such looping cannot be best-practice in a distributed system where the users get charged (or exhaust quotas) based on API usage: it hides when / where the API calls are made.

tseaver on 11 Apr 2016

@tseaver, ahhh, I think I see.
So having the API call abstracted down one level forces end users to dig deeper in order to confirm that they want to make an API call and use some of their quota?

Would renaming the method help clarify this? I personally _would_ expect a method called list_topics to return a iterator.

Calling it client.get_topics could help imply quota usage. I'm a fan of very descriptive method names so I might even propose client.get_list_of_topics().

Feedback?

UPDATE:
Sorry typo. I meant that I _would_ expect an iterator from a method called list_topics. Corrected.

daspecster on 11 Apr 2016

Also, @stephenplusplus passed this on to me from previous discussions of a similar topic. Hopefully it's helpful.

https://github.com/GoogleCloudPlatform/gcloud-node/pull/692#issuecomment-121021623

daspecster on 12 Apr 2016

TL;DR: It's time to use the iterator with a reasonable default limit.

The idea that we're making clear where the API calls are doesn't actually fix the problem. If you give me back chunks and page tokens, I'll write a loop and then we're back where we started. If I forget my exit condition, we're in the same boat I was before, except you made me write more code to do it:

token = True
api_call_count = 0
while token is not None and api_call_count < 10:
  topics, token = client.list_topics(next_page_token=token)
  api_call_count += 1
  for topic in topics:
    do_something_with_topic(topic)

This sucks for two reasons:

I don't really care about the number of API calls here (they cost $0.0000004 each), I care about the number of topics I get back to make my app work.
There's no guarantee I'll get N items per page, so depending on how PubSub feels today, I'll get a different number of topics as a result.

The right answer for me is being able to say "don't give me more than N topics", and understanding that this could mean a varying number of API calls:

token = True
topic_count = 0
while token is not None and topic_count < 500:
  topics, token = client.list_topics(next_page_token=token)
  for topic in topics:
    topic_count += 1
    do_something_with_topic(topic)

That seems more reasonable to me... however I'm still writing more code, and I still have to remember the exit condition. As a matter of fact, it's more difficult for me to do what I want and just as easy for me to mess up and loop forever over billions of topics. There's no way to get multiple pages with this model _without_ writing my own loop and putting in a manual exit condition.

In other words, if the problem we're trying to solve is "don't make people stupidly loop over everything", the right answer is likely:

class Client:
  def list_topics(self, ..., limit=1000):
    return Iterator(...)

for topic in client.list_topics():
  do_something_with(topic)

This means the "loop forever" call would be...

for topic in client.list_topics(limit=None):
  do_something_with(topic)

IIRC, we had the same discussion before with listing the objects in a bucket, and it currently returns an iterator of sorts.

I believe that we had decided to have a reasonable default on how much listing we'd do, however it seems we only implemented that for the recursive delete and the global make_public

Regardless of all of this, based on our pricing, I could understand the argument for Datastore however that goes out the window due to Datastore's LIMIT operator (aka, you're expected to say how many results you want back). For PubSub, if you created 1m topics, and looped over them, it'd take hours and cost you a whole 40 cents.

I think it's unreasonable to leak this paging abstraction which makes the API obnoxious to use simply to avoid someone spending $0.40. If you want the library to show you exactly where your API calls are made, you should be using https://github.com/google/google-api-python-client which is 1:1 with the API...

jgeewax on 12 Apr 2016

Limiting the number of items by default seems arbitrary and incorrect to
me, except when you are listing items like log entries in reverse time
order. If the ordering is not defined, limiting is useless.

On Tue, Apr 12, 2016 at 6:31 AM, JJ Geewax [email protected] wrote:

_TL;DR: It's time to use the iterator with a reasonable default limit._

The idea that we're making clear where the API calls are doesn't actually
fix the problem. If you give me back chunks and page tokens, I'll write a
loop and then we're back where we started. If I forget my exit condition,
we're in the same boat I was before, except you made me write more code to
do it:

token = True
api_call_count = 0while token is not None and api_call_count < 10:
topics, token = client.list_topics(next_page_token=token)
api_call_count += 1
for topic in topics:
do_something_with_topic(topic)

This sucks for two reasons:

I don't really care about the number of API calls here (they cost
$0.0000004 each), I care about the number of topics I get back to make my
app work.

There's no guarantee I'll get N items per page, so depending on how
PubSub feels today, I'll get a different number of topics as a result.

The right answer for me is being able to say "don't give me more than N
topics", and understanding that this could mean a varying number of API
calls:

token = True
topic_count = 0while token is not None and topic_count < 500:
topics, token = client.list_topics(next_page_token=token)
for topic in topics:
topic_count += 1
do_something_with_topic(topic)

That seems more reasonable to me... however I'm still writing more code,
and I still have to remember the exit condition. As a matter of fact, it's
more difficult for me to do what I want and just as easy for me to mess up
and loop forever over billions of topics. There's no way to get multiple
pages with this model _without_ writing my own loop and putting in a
manual exit condition.

In other words, if the problem we're trying to solve is "don't make people
stupidly loop over everything", the right answer is likely:

class Client:
def list_topics(self, ..., limit=1000):
return Iterator(...)
for topic in client.list_topics():
do_something_with(topic)

This means the "loop forever" call would be...

for topic in client.list_topics(limit=None):
do_something_with(topic)

IIRC, we had the same discussion before with listing the objects in a
bucket, and it currently returns an iterator of sorts
https://github.com/GoogleCloudPlatform/gcloud-python/blob/cefff49fb57eea68e15e3a64f1a71c1ca107f44d/gcloud/storage/bucket.py#L245
.

I believe that we had decided to have a reasonable default
https://github.com/GoogleCloudPlatform/gcloud-python/blob/cefff49fb57eea68e15e3a64f1a71c1ca107f44d/gcloud/storage/bucket.py#L84
on how much listing we'd do, however it seems we only implemented that for
the recursive delete and the global make_public

Regardless of all of this, based on our pricing, I could understand the
argument for Datastore however that goes out the window due to Datastore's
LIMIT operator (aka, you're expected to say how many results you want
back). For PubSub, if you created 1m topics, and looped over them, it'd
take hours and cost you a whole 40 cents.

I think it's unreasonable to leak this paging abstraction which makes the
API obnoxious to use simply to avoid someone spending $0.40. If you want
the library to show you exactly where your API calls are made, you should
be using https://github.com/google/google-api-python-client which is 1:1
with the API...

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/gcloud-python/issues/895#issuecomment-208838314

rimey on 12 Apr 2016

I agree that the limit should be set manually, however if the argument against returning an iterator is "it might loop forever! we can't do that to our users!", an arbitrary limit (not many people will have 1k topics) is one way to do it.

Regardless of that arbitrary limit, list_* methods should return something I can iterate over rather than a batch, token tuple.

jgeewax on 12 Apr 2016

Here's my input, for what it's worth.

Whether it is better to return a list or an iterator will depend on the API, as will whether it makes sense to offer a limit parameter.

rimey on 12 Apr 2016

❤2

For comparison, boto's sns.get_all_topics() returns 100 at a time with a next_token in the results. However this limit isn't documented very well. It appears that questions on stackoverflow came up from people that knew they had over 100 topics. (http://stackoverflow.com/a/32019553)

daspecster on 12 Apr 2016

That's a fair point... The difference is that our answer will be:

for topic in client.list_topics(limit=None) if you really want ALL of them. Otherwise, you'll get a maximum of 1000 (or some sufficiently large default).

jgeewax on 12 Apr 2016

I agree that we should have a simple iterate that iterates over everything, however, I want to make sure we don't remove the ability to manually page. This use case is extremely important for datastore where the pagination can happen across multiple user requests.

theacodes on 3 Oct 2016

My $0.02 on the strategy:

Rename the current paging methods, from list_foo -> list_foo_paged.
Add sibling list_foo methods which just return an iterator already wrapped around list_foo_paged. These methods do _not_ carry params for all the possible iterator properties to list_foo: instead, users who care can configure them before beginning to iterate.

tseaver on 3 Oct 2016

👍1

FYI I am really close to done

dhermes on 25 Oct 2016

0.21 releases include the new iterators.

tseaver on 15 Nov 2016

@tseaver There are still some list_ type methods in BigQuery and Monitoring that need to be changed over.

dhermes on 15 Nov 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Pub/Sub Subscriber does not catch & retry UNAVAILABLE errors

kir-titievsky · 59Comments

Pub/sub: amount of streaming pull operations way higher than streaming ack operations

krelst · 46Comments

httplib2.Http is not thread-safe

pcostell · 83Comments

AppVeyor build is failing

dhermes · 48Comments

GAX support: need new releases of 'google-gax' and 'gax-google-pubsub-v1'

tseaver · 34Comments