Shields: Github badges are intermittently inaccessible

Created on 2 Nov 2017 · 24Comments · Source: badges/shields

I'm not sure whether this is due to one of the recent changes…

1142 #1117 #1195 #1186 #1118

…or simply #1119, which is a bug that causes a token to erroneously be considered exhausted once it's used for a search request.

People can't add new tokens either (#1243), exacerbating this slightly, but that will be fixed in #1038.

The first report was roughly 16 hours after deploy.

bug operations

Source

paulmelnikow

👍1

Most helpful comment

The fix is deployed. Status looks good:

screen shot 2017-11-11 at 2 22 10 pm

paulmelnikow on 11 Nov 2017

🎉4

All 24 comments

cc @espadrine

paulmelnikow on 2 Nov 2017

The badges are working again, and I think our "main" rate limit just reset:

core:
  remaining: 12489 of 12500
  reset: in an hour
search:
  remaining: 30 of 30
  reset: in a minute
graphql:
  remaining: 5000 of 5000
  reset: in an hour

Generated with https://github.com/paulmelnikow/github-limited

paulmelnikow on 2 Nov 2017

Now intermittently broken, though plenty of rate limit left.

core:
  remaining: 12479 of 12500
  reset: in 18 minutes
search:
  remaining: 30 of 30
  reset: in a minute
graphql:
  remaining: 5000 of 5000
  reset: in an hour

Here's an example: https://img.shields.io/github/tag/expressjs/express.svg

paulmelnikow on 2 Nov 2017

I sent this to @espadrine about an hour ago:

The Github badges are failing intermittently. Are you seeing crashes on the server?

It might be an old bug, related to our handling of quotas for the Github search API. Though the timing of the incident makes me suspect recent changes. I've made some recent changes to the github auth but it's not jumping out at me from reading them.

It's difficult to debug without server access. I'm thinking I should add an endpoint to get all the user tokens, or else, hashed user tokens with stats. That way I could troubleshoot a bit better, locally.

Is there currently any backup of the user tokens, apart from the other servers?

I do have some new github token code to fix the search API quota issue, though it's a rewrite and I'd like to test it more first. Before merging I also want to add some optional trace logging we can turn on in cases like this.

Feel like I need deploy access + logs + a way to restore the token file, to deploy that with confidence I can find and fix whatever might be wrong with it.

Any thoughts on what could be causing the ssh issue?

I like getting to the bottom of things, and want to fix this, however my options are limited.

paulmelnikow on 2 Nov 2017

I set up a status page:

https://status.shields-server.com/

It runs a static badge, the Github license badge, and the npm license badge, and for each one, looks for some of the expected right text.

I'm happy to cover the cost for a couple months ($5.50) but it might be good to migrate to something else soon.

When I created shields-server.com, I set up cnames for s0.shields-server.com, s1.shields-server.com, s3.shields-server.com, though it'd be better to make these subdomains of shields.io and dump the extra domain.

paulmelnikow on 3 Nov 2017

👍3

Nice page, Should help get some insight on what's going wrong,
The GitHub licence badge seems to be failing a fair amount (~20% currently)
Is s1, s2, s3 running different code or are they all the same?

RedSparr0w on 3 Nov 2017

Yea, thanks, it should help. The code on the three servers should be the same.

There are interesting patterns in the downtime:

https://status.shields-server.com/779605524
https://status.shields-server.com/779605526
https://status.shields-server.com/779605529

The three servers had correlated downtime around 15:30 (that’s NY time). One of them also had downtime an hour earlier, around 14:30. Two had downtime around 13:33 / 13:43.

The duration of the downtime varies from server to server. For example, s0 was down from 15:28 to 15:49, s1 from 15:30 to 15:36, and s2 from 15:32 to 15:48.

Correlated downtime suggests there is some shared state, pointing to rate limit exhaustion as a factor. Downtime about an hour apart might correlate with rate limit resets.

The skew in recovery time might be explained by caching, though there might be other explanations too.

paulmelnikow on 3 Nov 2017

Yeah it's quite strange that the down times are very similar, would setting a very low max-age help with possible caching issues?

RedSparr0w on 3 Nov 2017

As far as I can tell, maxAge only affects cache headers – and potentially the behavior of the client – though not the behavior of the Shields server. I wouldn't think UptimeRobot did any caching. It wouldn't really make sense for a monitoring service. So I don't think setting maxAge would have any effect.

paulmelnikow on 3 Nov 2017

I just wanted to clarify that the caching I think might be involved is the Shields internal vendor cache in lib/request-handler.js.

Interesting that we're still seeing hourly downtime, though less correlated between servers. I wonder if it is related to hours since uptime.

screen shot 2017-11-05 at 12 28 10 pm

paulmelnikow on 5 Nov 2017

screen shot 2017-11-06 at 11 02 34 am

paulmelnikow on 6 Nov 2017

Still seems to be failing ~20% of the time,
Any clues yet as to what the problem could be?
It still seems they generally go down/up within 5-20 minutes of each other.

RedSparr0w on 9 Nov 2017

Three things.

s1 is European (located at Gravelines, IIRC), while s0 and s2 are in Canada (Montreal?).
Most of our vendors (typically, GitHub) have US servers.
It is inevitable that crossing the Atlantic yields a poorer SLA.
On the plus side, it is the least infuriating SSH session for me, and Europeans enjoy a faster static badge thanks to it.

Second, the worldwide load looks like this.

(Local time probably means UTC? Hard to tell. It's 10:40am here in France.)
In which case, we can call the two low points "Pacific daytime" and… "Chinese lunch break"?

I can't recall what the third thing was, but maybe it was related to describing exactly the shape of the failures? Like, is it failing once every ten during the high-load hour?

espadrine on 9 Nov 2017

Just bumping to say I experience non-loading badges for several days now. Every other refresh I get Invalid upstream response (521) from githubusercontent.com

GBH on 9 Nov 2017

👍2

I've been seeing a lot of this the last few days:

screen shot 2017-11-10 at 12 51 56 pm

jaydenseric on 10 Nov 2017

Indeed, this has happened with a good chunk of requests over the last few days.

https://status.shields-server.com/

Things have been much worse the last 22 hours because of #1263, which is unrelated service-provider downtime that took out one of our servers.

paulmelnikow on 10 Nov 2017

s1 is European (located at Gravelines, IIRC), while s0 and s2 are in Canada (Montreal?).
Most of our vendors (typically, GitHub) have US servers.

Good to know. That explains why the stats for s1 are sometimes slightly worse.

paulmelnikow on 10 Nov 2017

To re-summarize:

@espadrine, who has limited time on this project, is the only sysadmin.
He's working on giving me access.
Doing so is complicated because the hosting account (and maybe the servers too) are shared with other services he runs.
I like getting to the bottom of things, and want to fix this, however my options are limited.

I just emailed this plan:

To solve #1119, I rewrote the GitHub auth logic which is in an unmerged PR. There were other minor bugs I found along the way. A logic error in the token sorting, a missing callback.

I’d like to deploy that new code, but it’s a big change and I don’t feel comfortable doing it without some way to back up and restore the tokens, and deploy and logs access or else a deploy window when you’re around.

Here’s what I’ll do:

Add some debug output and/or debug API to the current github-auth code

Self-review, again, the new github-auth PR

Add debug output and/or debug API to the new github-auth PR

Could I ask you to:

Check how many tokens we have in production

Deploy latest so we can start collecting additional tokens (it’ll help a little, I think)

Sort out the logging

Debug the ssh issue

paulmelnikow on 10 Nov 2017

😕1

@paulmelnikow I was checking the links:

https://img.shields.io/codecov/c/github/bragful/ephp.svg
https://img.shields.io/travis/bragful/ephp/master.svg

They are working but they take too much time to load (around 15 seconds). Github works using a proxy to retrieve this kind of images so, the error retrieved by the browser is a 504 (Gateway timeout).

Did you check the amount of requests your system is receiving to generate the badges? If I can help you with something just let me know.

manuel-rubio on 10 Nov 2017

@manuel-rubio Yea, that's unfortunate. See #1263.

paulmelnikow on 10 Nov 2017

While working on

Add some debug output and/or debug API to the current github-auth code

I found the issue. It's a dumb thing I introduced in #1118. Fixed in #1266.

AFAICT production has been running with anonymous quota. I'm shocked this has been working as well as it has. Admittedly, not that well, though I'd have expected what we have to work for the first few _seconds_ of every hour.

Either the server is using a different github secret from the one I expect, or as likely, the Shields IPs do indeed have special treatment from GitHub.

I'm still eager to get the new code shipped, as it has a lot more tests. And of course #1263 remains an issue.

paulmelnikow on 10 Nov 2017

Opened #1267 with an auth debug endpoint + logging.

paulmelnikow on 10 Nov 2017

If I can help you with something just let me know.

I didn't really answer this question @manuel-rubio!

There are four ways you can help:

Review my changes. I have to self-review my code, which is hardly ideal. For what it's worth, the PR that caused this regression was open _for four weeks, which is plenty of opportunity_. If a team of five people could review a couple PRs per week, my changes could easily have 2–3 reviews apiece. Not only does would this reduce bugs, over time it has a wonderful side effect of making the code more readable and therefore more approachable.
Perform first reviews of simple changes, like badge additions.
Monitor issues and the chat room, and help other people who have questions about contributing to Shields, or using it for their projects. Dig into the code as needed. This is the easiest way to create time among the people who have the most context on this project.
Contribute GitHub tokens and $. I honestly don't know much about our current financial state, though I would love to have the flexibility to use third-party monitoring and logging services, and not to mention choose hosting that makes scaling and shared administration easy. We're setting up an OpenCollective since Gratipay is shutting down.

paulmelnikow on 10 Nov 2017

👍1

The fix is deployed. Status looks good:

screen shot 2017-11-11 at 2 22 10 pm

paulmelnikow on 11 Nov 2017

🎉4

Was this page helpful?

0 / 5 - 0 ratings