I'm not sure whether this is due to one of the recent changes…
…or simply #1119, which is a bug that causes a token to erroneously be considered exhausted once it's used for a search request.
People can't add new tokens either (#1243), exacerbating this slightly, but that will be fixed in #1038.
The first report was roughly 16 hours after deploy.
cc @espadrine
The badges are working again, and I think our "main" rate limit just reset:
core:
remaining: 12489 of 12500
reset: in an hour
search:
remaining: 30 of 30
reset: in a minute
graphql:
remaining: 5000 of 5000
reset: in an hour
Generated with https://github.com/paulmelnikow/github-limited
Now intermittently broken, though plenty of rate limit left.
core:
remaining: 12479 of 12500
reset: in 18 minutes
search:
remaining: 30 of 30
reset: in a minute
graphql:
remaining: 5000 of 5000
reset: in an hour
Here's an example: https://img.shields.io/github/tag/expressjs/express.svg
I sent this to @espadrine about an hour ago:
The Github badges are failing intermittently. Are you seeing crashes on the server?
It might be an old bug, related to our handling of quotas for the Github search API. Though the timing of the incident makes me suspect recent changes. I've made some recent changes to the github auth but it's not jumping out at me from reading them.
It's difficult to debug without server access. I'm thinking I should add an endpoint to get all the user tokens, or else, hashed user tokens with stats. That way I could troubleshoot a bit better, locally.
Is there currently any backup of the user tokens, apart from the other servers?
I do have some new github token code to fix the search API quota issue, though it's a rewrite and I'd like to test it more first. Before merging I also want to add some optional trace logging we can turn on in cases like this.
Feel like I need deploy access + logs + a way to restore the token file, to deploy that with confidence I can find and fix whatever might be wrong with it.
Any thoughts on what could be causing the ssh issue?
I like getting to the bottom of things, and want to fix this, however my options are limited.
I set up a status page:
https://status.shields-server.com/
It runs a static badge, the Github license badge, and the npm license badge, and for each one, looks for some of the expected right text.
I'm happy to cover the cost for a couple months ($5.50) but it might be good to migrate to something else soon.
When I created shields-server.com, I set up cnames for s0.shields-server.com, s1.shields-server.com, s3.shields-server.com, though it'd be better to make these subdomains of shields.io and dump the extra domain.
Nice page, Should help get some insight on what's going wrong,
The GitHub licence badge seems to be failing a fair amount (~20% currently)
Is s1, s2, s3 running different code or are they all the same?
Yea, thanks, it should help. The code on the three servers should be the same.
There are interesting patterns in the downtime:
https://status.shields-server.com/779605524
https://status.shields-server.com/779605526
https://status.shields-server.com/779605529
The three servers had correlated downtime around 15:30 (that’s NY time). One of them also had downtime an hour earlier, around 14:30. Two had downtime around 13:33 / 13:43.
The duration of the downtime varies from server to server. For example, s0 was down from 15:28 to 15:49, s1 from 15:30 to 15:36, and s2 from 15:32 to 15:48.
Correlated downtime suggests there is some shared state, pointing to rate limit exhaustion as a factor. Downtime about an hour apart might correlate with rate limit resets.
The skew in recovery time might be explained by caching, though there might be other explanations too.
Yeah it's quite strange that the down times are very similar, would setting a very low max-age help with possible caching issues?
As far as I can tell, maxAge only affects cache headers – and potentially the behavior of the client – though not the behavior of the Shields server. I wouldn't think UptimeRobot did any caching. It wouldn't really make sense for a monitoring service. So I don't think setting maxAge would have any effect.
I just wanted to clarify that the caching I think might be involved is the Shields internal vendor cache in lib/request-handler.js.
Interesting that we're still seeing hourly downtime, though less correlated between servers. I wonder if it is related to hours since uptime.






Still seems to be failing ~20% of the time,
Any clues yet as to what the problem could be?
It still seems they generally go down/up within 5-20 minutes of each other.
Three things.
s1 is European (located at Gravelines, IIRC), while s0 and s2 are in Canada (Montreal?).
Most of our vendors (typically, GitHub) have US servers.
It is inevitable that crossing the Atlantic yields a poorer SLA.
On the plus side, it is the least infuriating SSH session for me, and Europeans enjoy a faster static badge thanks to it.
Second, the worldwide load looks like this.

(Local time probably means UTC? Hard to tell. It's 10:40am here in France.)
In which case, we can call the two low points "Pacific daytime" and… "Chinese lunch break"?
I can't recall what the third thing was, but maybe it was related to describing exactly the shape of the failures? Like, is it failing once every ten during the high-load hour?
Just bumping to say I experience non-loading badges for several days now. Every other refresh I get Invalid upstream response (521) from githubusercontent.com
I've been seeing a lot of this the last few days:

Indeed, this has happened with a good chunk of requests over the last few days.
https://status.shields-server.com/
Things have been much worse the last 22 hours because of #1263, which is unrelated service-provider downtime that took out one of our servers.
s1 is European (located at Gravelines, IIRC), while s0 and s2 are in Canada (Montreal?).
Most of our vendors (typically, GitHub) have US servers.
Good to know. That explains why the stats for s1 are sometimes slightly worse.
To re-summarize:
I just emailed this plan:
To solve #1119, I rewrote the GitHub auth logic which is in an unmerged PR. There were other minor bugs I found along the way. A logic error in the token sorting, a missing callback.
I’d like to deploy that new code, but it’s a big change and I don’t feel comfortable doing it without some way to back up and restore the tokens, and deploy and logs access or else a deploy window when you’re around.
Here’s what I’ll do:
- Add some debug output and/or debug API to the current github-auth code
- Self-review, again, the new github-auth PR
- Add debug output and/or debug API to the new github-auth PR
Could I ask you to:
- Check how many tokens we have in production
- Deploy latest so we can start collecting additional tokens (it’ll help a little, I think)
- Sort out the logging
- Debug the ssh issue
@paulmelnikow I was checking the links:
https://img.shields.io/codecov/c/github/bragful/ephp.svg
https://img.shields.io/travis/bragful/ephp/master.svg
They are working but they take too much time to load (around 15 seconds). Github works using a proxy to retrieve this kind of images so, the error retrieved by the browser is a 504 (Gateway timeout).
Did you check the amount of requests your system is receiving to generate the badges? If I can help you with something just let me know.
@manuel-rubio Yea, that's unfortunate. See #1263.
While working on
- Add some debug output and/or debug API to the current github-auth code
I found the issue. It's a dumb thing I introduced in #1118. Fixed in #1266.
AFAICT production has been running with anonymous quota. I'm shocked this has been working as well as it has. Admittedly, not that well, though I'd have expected what we have to work for the first few _seconds_ of every hour.
Either the server is using a different github secret from the one I expect, or as likely, the Shields IPs do indeed have special treatment from GitHub.
I'm still eager to get the new code shipped, as it has a lot more tests. And of course #1263 remains an issue.
Opened #1267 with an auth debug endpoint + logging.
If I can help you with something just let me know.
I didn't really answer this question @manuel-rubio!
There are four ways you can help:
The fix is deployed. Status looks good:

Most helpful comment
The fix is deployed. Status looks good: