Wallabag: performance issue in entries API

Created on 31 Jan 2017 · 4Comments · Source: wallabag/wallabag

Issue details

It seems to me there is a big performance issue in the entries listing API. From what I can tell, the entries are listed in the _embedded JSON field, and in there have more or less the following structure (sorry for the golang, I couldn't find that documentation in the API directly):

type Item struct {
    Links       Links         `json:"_links"`
    Annotations []interface{} `json:"annotations"`
    CreatedAt   WallabagTime  `json:"created_at"`
    DomainName  string        `json:"domain_name"`
    ID          int           `json:"id"`
    IsArchived  int           `json:"is_archived"`
    IsStarred   int           `json:"is_starred"`
    Mimetype    string        `json:"mimetype"`
    ReadingTime int           `json:"reading_time"`
    Tags        []interface{} `json:"tags"`
    UpdatedAt   WallabagTime  `json:"updated_at"`
    UserEmail   string        `json:"user_email"`
    UserID      int           `json:"user_id"`
    UserName    string        `json:"user_name"`
}

Most of those fields are fine and small, but wise API developers will notice one field is missing: content. It is missing because it was optimized out of there: it's just too frigging huge to carry around the memory all the time, in my program.

Unfortunately, even though I don't parse it, it is still sent by the API. This means that, in my use case, even if I just list the unread entries with the API, i receive a 2MB JSON blob. And that's for 66 unread entries. I can't imagine the monstrosity the API would return if I would ask it to list all 5000 archived entries.

I know there's paging and everything, but it seems to me that listing entries and fetching individual entries should be separate operations.

It is unclear to me why the full content is sent in API operations, given that there's a distinct /api/entries/{entry}.{format} endpoint. I understand it may be useful for certain endpoints to avoid making multiple roundtrips, but I suspect it makes a lot of API calls needlessly slow...

Could there be a flag to keep the API from sending the actual content? That would retain backwards compatibility and allow for faster requests for those who just want a list of article titles (for example).

This is a key performance issue I have found when developing the wallabako downloader, which fetches the list of articles and then downloads EPUB versions of given entries. It doesn't need the actual content of the items when it does the listing, because it wants the EPUB version.. To give you an idea, it takes at least one full second to list the 66 unread entries in the API here. Downloading them after takes of course much longer, but once that's done, there's nothing else to do and most of the time the program is waiting on the API:

2017/01/31 10:14:21 logging in to https://lib3.net/wallabag
2017/01/31 10:14:21 CSRF token found: 200 OK
2017/01/31 10:14:21 logged in successful: 302 Found
2017/01/31 10:14:22 found 66 unread entries
2017/01/31 10:14:22 completed in 1.07s
2017/01/31 10:14:22 URL https://lib3.net/wallabag/export/23128.epub older than local file /home/anarcat/tmp/epubs2/23128.epub, skipped
2017/01/31 10:14:22 URL https://lib3.net/wallabag/export/19964.epub older than local file /home/anarcat/tmp/epubs2/19964.epub, skipped
2017/01/31 10:14:22 URL https://lib3.net/wallabag/export/23152.epub older than local file /home/anarcat/tmp/epubs2/23152.epub, skipped
2017/01/31 10:14:22 URL https://lib3.net/wallabag/export/23160.epub older than local file /home/anarcat/tmp/epubs2/23160.epub, skipped
2017/01/31 10:14:22 URL https://lib3.net/wallabag/export/23170.epub older than local file /home/anarcat/tmp/epubs2/23170.epub, skipped
2017/01/31 10:14:22 processed: 5, downloaded: 0
2017/01/31 10:14:22 completed in 1.53s

1.07 seconds for the API call, about 60% of the time here...

can this be improved?

API Improvement

Source

anarcat

👍2

Most helpful comment

Of course we need improvement on that part.
Regarding what Pocket does, they provide two kind of _retrieval_ type:

detailType
simple = only return the titles and urls of each item
complete = return all data about each item, including tags, images, authors, videos and more

It could be good inspiration.

j0k3r on 31 Jan 2017

👍3

All 4 comments

Of course we need improvement on that part.
Regarding what Pocket does, they provide two kind of _retrieval_ type:

detailType
simple = only return the titles and urls of each item
complete = return all data about each item, including tags, images, authors, videos and more

It could be good inspiration.

j0k3r on 31 Jan 2017

👍3

a similar issue is when we PATCH an entry, we get the whole dump of the entry instead of just a confirmation, which is rather silly - we already have the data...

anarcat on 10 Feb 2017

Is there any change here? This is quite annoying: if you have limited traffic and big list of entries in your wallabag it is sometimes completely impossible to fetch the entries. I am now in a travel and I can see how this issue hits iOS client: basically it doesn work even on wifi(at least I highly suspect that this issue is the reason).