Athens: Introduction of a new endpoint that lists name all of the packages that are in the storage

Created on 28 Oct 2018 · 19Comments · Source: gomods/athens

Is your feature request related to a problem? Please describe.
Sometimes, users would want to know what packages are stored in their storage module.
This can be useful when we have a Web UI as well.

Describe the solution you'd like
An API like
GET /storage/modules/list

would list be super useful to get info on what is on the storage.

Describe alternatives you've considered
I thought to overload the list endpoint, but I do not like it as much as it is also being used by the go cmd.

good first issue

Source

manugupt1

Most helpful comment

yes it should be the first one. also, i suggest starting implementation with a more complicated storage e.g GCP or S3 so you can hit the caveats in the design as soon as possible

michalpristas on 2 Nov 2018

👍2

All 19 comments

I see only one drawback, the listing of blobs might be really slow. unless you keep a registry as another blob but then you have an additional set of problems
i did something like this before and listing more than 15k or 50k (i don't really remember the precise number) runs into request timeout. so we would need to provide some sort of paging with a default paging size

michalpristas on 28 Oct 2018

Ah.. yes! I thought about the drawback earlier and did not really see any way out of it. An ls does not necessarily mean that we are reading the contents of the blob though. Although, I agree that the implementation of ls will depend on the backend storage. Some storages like S3 have a recursive ls. I see that GCP and Azure did not have a recursive one. For fs, recursive walking is easy. For storages that do not have ls, we will end up making multiple classes. Thoughts on this?

A metadata database would be cool (which would open us to a lot of possibilities) as well. Right now, I do not think we should add anything extra unless absolutely required.

I agree that a pagination api will be require in this case.

manugupt1 on 29 Oct 2018

I think this would be really useful for an "admin" API, and also for use in a UI for Athens - for example if you GET / it could list all the packages it has. That would be awesome

arschles on 30 Oct 2018

Yep, I actually stole this proposal from you when you were talking about it on slack, totally forgot to mention it though.

manugupt1 on 31 Oct 2018

Can I take this one?

fedepaol on 1 Nov 2018

👍1

Yes, that would be awesome. Would it make sense to separate it into smaller issues for different storages?

manugupt1 on 2 Nov 2018

Agree. What matters now is to define the interface & the api. Will implement a storage and the other will come afterwards.
My understanding is that we rely on the storage to fetch the whole list instead of having to maintain the list in a separate storage, right?

fedepaol on 2 Nov 2018

yes it should be the first one. also, i suggest starting implementation with a more complicated storage e.g GCP or S3 so you can hit the caveats in the design as soon as possible

michalpristas on 2 Nov 2018

👍2

yes! I did jump ahead. I think this issue is more of a design issue than code issue. Once we have settled down, we can write the code.

manugupt1 on 4 Nov 2018

So, I gave a look at s3 / gcp implementations. I think it would not be efficient if we don't store the information separately.
Both s3 and gcp store the modules together with the versions, which means that we should iterate across all the objects (meaning all the versions of all the modules), extract the module name, put them in a set and return them. This would make the pagination really difficult to implement imho.

What we can do is

to rely to some kind of storage (bringing back redis?)
have version independent objects in the storage with a path on the line of mod-modulename (with no version)

The second alternative would work because both s3 and gcp allow to iterate filtering by prefix (but again, I am not sure about the pagination) so filtering all the mod- objects would result in all the modules stored.

Will ping tomorrow on slack to discuss this.

fedepaol on 7 Nov 2018

do they provide some continuation token?
what azure does it will return continuation token together with the result so with the next request you pass a continuation token so pagination is not that difficult to implement

michalpristas on 7 Nov 2018

@michalpristas they do. I assumed that by pagination we meant page number / num elems per page.

It also seems that the results are returned in a lexicographically order, which means that we would be able to detect when the iterator goes from moduleA to moduleB and add a record to the resulting set.

The only drawback with this is if the jump is across two pages, i.e. the last element of a request is moduleA and the new element of the continuated request is still moduleA. I guess we can store that somehow in memory (with an expiration, maybe).

Let me know what you think, apart from iterating (a lot) and supporting only paging tokens and not page number / num of elements per page it seems doable.

fedepaol on 7 Nov 2018

yes that would be great meaning page/count
I would rather have this logic than a continuation as it is more intuitive and you can easily imagine where you stand at the exact point of time

one suboptimal solution would be to load blobs until you encounter last of the last page and then retrieve pageSize more
from my experience with listing blobs every solution had some major drawback,

michalpristas on 7 Nov 2018

Which means that if the user asks for page # 3 we would scroll all the blobs until we collect & discard 2 * pagenum modules.

fedepaol on 7 Nov 2018

I like the last option. Scroll by token. Does the token have a time to live?
Essentially, I would expect
GET /storage
GET a token
GET /storage with token
and so on..

Everytime there is a request finished, the time to live for the token get refreshed. This can be in the order of some seconds.

manugupt1 on 7 Nov 2018

I would propagate the token directly from s3 / gcp to the user and use that for the new requests. So our behaviour would be aligned to the one of the storage which sounds reasonable whatever this behaviour is.
As I said the probelm with this approach is only about having the same module returned in the second request, which we can avoid by providing some kind of caching (maybe bound to the token itself?).

fedepaol on 7 Nov 2018

regarding paging yes unfortunately 😢
continuation seems like a good idea resource wise. i wouldn't see the problem in displaying one item multiple times, it can be a case even with .Skip(page * pageSize).Take(pageSize) if the item is inserted somewhere at the start
it is counterintuitive but probably the best we can get

michalpristas on 7 Nov 2018

What if we started with an API that lists the total number of modules we're serving?

arschles on 9 Nov 2018

Closing since #955 is in.