(I debated whether to add this as a comment to #1817, but it seems like too much text and detail for that.)
Currently if CKAN downloads many files from GitHub at the same time, they often fail with HTTP status code 403-Forbidden. #1817 contains an example, but these reports are common and I've definitely seen it happen myself several times.
The GitHub API uses 403 codes for throttling; you get 60 unauthenticated requests per hour, and any beyond that return a 403. I encountered this while working on an unrelated project, and I had to use a GitHub token to allow 5000/hour, passed in the HTTP request headers:
Authorization: token <OAuth token here>
Currently CKAN's downloads do not go through the GitHub API, so this does not necessarily indicate exactly what's going on with them. However, it establishes that 403-Forbidden is sometimes used for throttling, and it becomes more relevant later in discussion of the API.
Sample API data for releases, minus the author and uploader fields since they're long and not relevant to this issue:
{
"url": "https://api.github.com/repos/HebaruSan/Astrogator/releases/7538924",
"assets_url": "https://api.github.com/repos/HebaruSan/Astrogator/releases/7538924/assets",
"upload_url": "https://uploads.github.com/repos/HebaruSan/Astrogator/releases/7538924/assets{?name,label}",
"html_url": "https://github.com/HebaruSan/Astrogator/releases/tag/v0.7.8",
"id": 7538924,
"tag_name": "v0.7.8",
"target_commitish": "master",
"name": "Frictionless toilet",
"draft": false,
"prerelease": false,
"created_at": "2017-08-28T06:00:32Z",
"published_at": "2017-08-28T06:09:07Z",
"assets": [
{
"url": "https://api.github.com/repos/HebaruSan/Astrogator/releases/assets/4682631",
"id": 4682631,
"name": "Astrogator.zip",
"label": null,
"content_type": "application/zip",
"state": "uploaded",
"size": 114372,
"download_count": 5243,
"created_at": "2017-08-28T06:08:01Z",
"updated_at": "2017-08-28T06:08:02Z",
"browser_download_url": "https://github.com/HebaruSan/Astrogator/releases/download/v0.7.8/Astrogator.zip"
}
],
"tarball_url": "https://api.github.com/repos/HebaruSan/Astrogator/tarball/v0.7.8",
"zipball_url": "https://api.github.com/repos/HebaruSan/Astrogator/zipball/v0.7.8",
"body": "Fix glitches when the settings file is invalid"
}
The zip file that we want to download is associated with assets[0], and there are two fields for it, url and browser_download_url. This becomes important later.
I used the "Contact GitHub" link to reach out to GitHub about how their download throttling works. Surprisingly, the person who replied understood exactly what I was talking about and how to fix it :+1:. It turns out that these problems happen because CKAN is not using GitHub as intended. From my conversation with the very helpful support person:
If I understood your message correctly, it seems like you're programmatically downloading resources from github.com, is that right? If that's so, then you shouldn't be doing that. GitHub.com wasn't build for programmatic use like that, it was built for humans. For programmatic use, you should be using the API. The API has well defined rate limits and caching behavior you can rely on, while GitHub.com doesn't. That doesn't mean that github.com doesn't have any rate limits, it only means that you can be rate limited at any time and without warning.
So, we'd like to ask you to switch and use the API for downloading the data you need, and respect the defined rate limits (that's what a good citizen app should be doing, instead of hitting github.com):
("good citizen" was my phrasing in my original message, so don't take that as an unprovoked criticism of our civic virtues.)
If I'm interpreting that code snippet correctly, you're using the browser_download_url link, which, as the name suggests, is intended to be used by human users via a browser.
For downloading release assets via the API, you should be using this endpoint:
https://developer.github.com/v3/repos/releases/#get-a-single-release-asset
Notice this note: "To download the asset's binary content, set the Accept header of the request to application/octet-stream. The API will either redirect the client to the location, or stream it directly if possible. API clients should handle both a 200 or 302 response."
That would be the "url" field of a particular asset (which are listed when you fetch a release e.g. via https://developer.github.com/v3/repos/releases/#get-a-single-release), but with the addition of the special Accept header.
Key points:
browser_download_url) is for users and browsers only, not applications. It can be throttled, but there is no explicit policy or workaround.This can be done by requesting the url field instead of browser_download_url and setting a custom HTTP header:
Accept: application/octet-stream
I tested this with wget, and setting the Accept header did indeed give me the download. Without this header, it returns a JSON object describing the asset.
browser_download_urlWhen downloading from GitHub, we need to send the custom HTTP header. This cannot be accomplished simply by swapping out the bad URL for the good URL in the download metadata field.
Proposed new metadata field:
github_download - The assets[0].url value from the APISpecific changes:
If a GitHub download returns a 403 status, we should handle the exception and notify the user that their downloads are being throttled. We could direct them to the setting (see below) and web page dealing with GitHub auth tokens, and/or advise them to wait 60 minutes for their limit to reset. https://api.github.com/rate_limit can be used to get the exact limit and timing numbers.
Users will be limited to 60 GitHub downloads per hour, because this is the limit of the GitHub API. 140+ mod installs are pretty commonly mentioned on the forums, and reinstalling everything from scratch is a common method for dealing with compatible upgrades, so some users would probably encounter this limit and not appreciate the 60-minute wait to be able to download more. The only way around this is to use a GitHub auth token, which boosts the limit to 5000/hour per token.
It would be nice to ship a single internal auth token for all of CKAN, since then users would have the 5000/hour limit by default without having to worry about any of the details. More responses from the GitHub contact person:
Including a single pre-defined token with the app so that this token is used by all users of the app is possible. You could create a scopeless token here https://github.com/settings/tokens/new and include that. A scopeless token doesn't have any special permissions -- it can be used for read-only access of public data. So, it would be safe in that way. However, someone could easily take your token from the app, and then drain the API quota for the user who owns the token by making lots of unnecessary API requests. At that point, the app would stop working for everyone who uses the app.
Deliberate abuse like that is unlikely, but assuming 200 downloads per active user per hour, a 5000/hour limit across all CKAN users would support 25 active users in a given hour. The number of active users at a given time isn't known, but the latest CKAN release has over 60000 downloads, so it's probably more than 25. If we were able to determine the limit we needed per hour, we could divide it by 5000 and then generate that many tokens and pick one randomly per request, but that might not be in the spirit of the API's rules.
We could create a new settings field called GitHub Auth Token, where the user could fill in their own tokens to allow more downloads. This could be instead of or in addition to any built-in tokens we may or may not use, and it should support all the UIs.
If Netkan was updated to use this new scheme tomorrow, current CKAN clients would break unless the old download field was still populated. So we should not remove support for the old metadata immediately; GitHub downloads should use both download and github_download until all clients are updated.
The API docs say:
Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently.
So even with a token, CKAN's parallel download method would still be in violation of the letter of the law.
As a halfway measure, we could try scaling back the parallelization of downloads.
This might solve the problem without messing with all the API/token stuff. We would still technically be misusing GitHub, but users should no longer encounter failed downloads as frequently.
Well, we're already on track for a point release with a spec update, so it's a good time to do this. We've had projects before to go through old metadata and update it, so adding GitHub download info isn't impossible. (Though it may be impracticable) Even if we just force the current releases to be repopulated, and all new ones going forward, we'll be ahead of the game, and if our trigger for the different download behaviour is the existence of the github_download URL, the worst that will happen is that things work as they do now.
I like the multi-pass approach of starting with no token and failing over to a common token. The number of users that would still have issues is probably pretty small (There aren't that many that have issues currently), and adding the ability for users to add their own API token would probably work for that minority of users wanting three figure mod downloads.
Could you post something on the CKAN thread explaining what you've found and what users can do to avoid the download throttling in the meantime? (ie, download contents in blocks of 60 mods and wait in between)
I would love to give users more info, however right now there is no useful info to give.
The 60/hour limit assumes use of the API, which is not how the currently released CKAN client works. It instead uses the browser download URL, which has secret and unknown throttling characteristics. This was what I was hoping to learn from the GitHub contact person in the first place, and he only confirmed that they're not going to explain it publicly.
There's also the bit at the end about parallel downloads; it's quite possible that the secret throttling rules would trigger for (far) fewer than 60 simultaneous downloads, and in fact I think I've seen it do that. So unfortunately I'm not able to outline an effective workaround for end users, other than downloading mods one at a time.
We'd want to be exceptionally careful about baking tokens into the client, firstly for them getting taken and abused by a third party and secondly it may be against the terms of service rotating the tokens to work around the limits.
Probably in the interim we can start generating URLs that point to the API download and use some logic to use it instead, but I'm guessing the unauthenticated limit is quite low?
We also have the client implementation part of #1682, which would allow for FOSS mods to fallback to the Internet Archive to get the download. Which is one of the driving factors in building that system.
Ok, if I'm synthesizing the data above correctly, we should be doing this:
And as a separate exercise, implement a way for users to enter their own GitHub token for the GitHub downloads and throw a message advising of the option when we get a GitHub 403 error.
Is the only reason to have a github_download field so that the HTTP client knows to set the Accept header? In that case I'd rather just bump the spec version, dump that information in the download field, and just always set the Accept header to application/zip, application/octet-stream.
Actually, since we store the content type in the metadata as download_content_type, the Accept header should default to: ${download_content_type}, application/octet-stream.
Actually, throw in some quality values to assert that we would prefer the specified content type over some arbitrary binary stream and we should have something that works for anything: ${download_content_type};q=1.0,application/octet-stream;q=0.9.
Essentially it's because older clients will fail to download the API URLs, and the metadata is shared between old and new clients. If we switch all the URLs over in CKAN-meta, then everyone who hasn't upgraded to the newest CKAN will get download errors for all GitHub URLs, every time.
The way that I would roll it out would be as follows:
Accept header for all downloads. Let's call this version 1.xx. Update the spec to say that clients that abide by spec > 1.xx must send appropriate Accept headers. Update documentation to state that all GitHub downloads should use a spec version of at least 1.xx..netkan files with github kref and set their spec to 1.xx. This will trigger a mass commit to the CKAN-meta repo for all the most recent versions..ckan files with a download url that points to github.com to change their spec to 1.xx..ckan files to use API download URLs.One thing I've been meaning to do is add a SpecVersionTransformer to NetKAN that will automatically spit out the oldest possible spec version that supports the features used in the .netkan file. Shouldn't be that difficult. Standard practice for .netkan files would then be to not specify a spec_version and have it automatically be populated.
Sounds like a good plan to me. We can probably add the Accept header in 1.24...
Wait, if the spec version requirement prevents them from downloading, then what do steps 3-5 buy us as compared to just switching the URLs immediately? GitHub downloads break either way, but changing the URLs would actually fix something as well.
I suppose we don't need step 5 and can do 3-7 pretty much simultaneously.
OK, sorry, I forgot some pieces of the puzzle since writing this up.
The github_download property isn't needed (only) for old clients. It's because of the OAuth token stuff to make API throttling less severe. If we want users to be able to download more than 60 API URLs from GitHub per hour, then we need some way of sending an auth token that is either hard coded, configured, or whatever. This would require special handling just for GitHub downloads, which would be tied to the github_download property.
Couldn't that be specially handled in the client by checking for the presence of a api.github.com host in the download URL, rather than having extra metadata?
Yeah, that makes sense. In fact, we can have a Dictionary<string, string> to store a mapping from any download host to an auth token, in case other servers add this in the future.
Where would we store the tokens? registry.json or GUIConfig.xml would be instance-specific, which wouldn't be great.
Where would we store the tokens?
Right now it would be the registry, which is where user configuration data is stored. I'm not really a fan of that, however, and have been meaning to replace it with a file-based solution with JSON files.
Accept can take multiple comma-delimited values. Use application/zip;q=1.0,application/octet-stream;q=0.9, which tells the server "I prefer application/zip, but will also accept application/octet-stream if you can provide it".
curl -L -H "Accept: application/zip;q=1.0,application/octet-stream;q=0.9" https://api.github.com/repos/HebaruSan/Astrogator/releases/assets/4682631 -o Astrogator.zip
Works as expected.
Yeah, I realized that after I posted and deleted that comment. It does indeed work even in a patched CKAN.
Not sure how feasible this is, but maybe offer the option for users to use their own API keys?
Not sure how feasible this is, but maybe offer the option for users to use their own API keys?
As of today, most CKAN users will be on 1.24, which supports API URLs. We can consider migrating the metadata as we please now.
Asterisk: Mono users might still be able to run 1.22.6 and earlier, since it worked for me on Ubuntu even after #2293's GitHub Apocalypse. Maybe allow a few weeks for this group to upgrade.
Summary of current status:
DownloadThrottledException, which the UIs catch use to prompt the user to add auth tokens (#2277)Best of all, everyone now has these improvements, because #2293 broke all older clients.
Remaining to be done:
However, complaints about the throttling have largely gone quiet lately. It may be wise to leave this as-is for a while lest we spoil a calm situation.