Shields: Servers fail to respond to certain requests fast enough at peak time

Created on 1 Feb 2017  ·  31Comments  ·  Source: badges/shields

My GitHub release shield is hanging:

https://img.shields.io/github/release/tony19/named-regexp.svg

which causes it to appear like this:
screen shot 2017-02-01 at 8 28 08 am

I don't think it's a GitHub limit because I authorized Shields.io with GitHub. How can I fix this?

operations

Most helpful comment

@cdornan That's certainly a welcome pattern!

That said, I wouldn't worry too much. The slowdowns were simply a consequence of the number of requests per second going up. I did a bit of benchmarking; the current bottleneck is the use of SVGO (to compress results); it takes about 8ms, and is completely avoidable by pre-compressing templates. I'll try doing that next week-end. Once that is taken care of, the next bottleneck will be text width computation, at about 2ms. But a single server spending 2ms per request means it can bear ~500 req/s, which is much better than ~125 req/s with an 8ms bottleneck.

All 31 comments

Same issue here, badge takes too long to load and it does not show on my readme file ☹️

I am sorry. I currently have two servers: one in France and one in Canada. The Canadian one gets too much load. I need to add a server in North America. I had planned to do it in the past week-end, but unexpected work also on shields prevented that. I plan on doing that next week, or this week, if I can dedicate a couple of hours on it.

If I have two servers at the same location, presumably DNS round-robin would be enough to avoid one of them taking too much of the load, right?

Many providers have a free load balancer, but DNS is also an option. If you use DNS, you might want to configure floating IPs and then have the servers monitor each other. If one of them goes down, another server should take the IP. If you don't do that, a failure of one of the servers will cause half of the requests to be dropped until you fix it manually. If you use a cloud LB, there is no need to do all of that.

The issues I have experienced today in #872 were 12s from Poland and 7s from Sweden. Are those routed to Canada?

Hi!
I'm the author of the #874, and I'm in France, but it's exactly the same problem.
16s to load a badge... :-/

UK and have had shield badges timing-out most of the time for the past couple of months.

Add my name to the list of interested parties. A couple days ago I added a comment about the current outage to an old outage issue. I didn't realize this newer outage issue was open. My comment is at: https://github.com/badges/shields/issues/191#issuecomment-278058650

I had included these examples from this repo's README.md:


alt="Gratipay">


alt="npm version">


alt="build status">

As I write this, I see the images load sometimes, so it's now an intermittent problem. That's progress. 😃

What could be done to help the situation, @espadrine? Additional hosting?

Given how these badges are literally everywhere, maybe @GitHub would be willing to help with the hosting?

This was discussed, they said no.

I will add the IP of the s2 server to the DNS settings on Monday after I wake up.

The traffic on the s0 server looks more manageable (the s1 server has a similar graph).

shields-s2

The CPU usage is at about 70% at peak time (down from a consistent 100% at peak time before).

We now average 130 req/s. We are getting near 10 Mreq/day. It seems like I will need to add another server when we reach 160 req/s, which I will monitor.

I don't know the architecture of the service, but I assume you are responding with 302 and the client gets the actual image from another server, right?

@ppolewicz No. All three servers directly respond with images.

If they responded with 302, there would be less load and also clients could
cache the images. Is there any reason for the images to be served directly?

14.02.2017 22:20 "Thaddée Tyl" notifications@github.com napisał(a):

@ppolewicz https://github.com/ppolewicz No. All three servers directly
respond with images.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/badges/shields/issues/868#issuecomment-279839343, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAsWGRA_hZjffpZyADUadANdUrADlm7-ks5rchqBgaJpZM4Lz7W_
.

@ppolewicz Why would there be less load? 302 just means "now you have to make a new request". The amount of work is the same, and whatever the 302 points to would then have to do it. It simply moves work elsewhere.

As for cache, we use HTTP caching already. But a badge's value can change from downloads: 378k to downloads: 379k, so that clients still need to request updates after a couple seconds.

Unless GitHub or somebody similar helps out this is surely going to need some help from the community.

As soon as I realised what was happing I reorganised my build status badges as a static snapshot of the current build and linked the badge to a build status page which gave the live update -- going directly to travis-ci, appveyor, etc.

This way the shields only get used to generate the static copies once per build. My site is immune from the outages, the load on the shields infrastructure is effectively nil and my visitors get the info they really want -- the state of the current release, not head.

Can we not make this the recommended best practice?

Thanks for the lovely service @espadrine and all!

@cdornan That's certainly a welcome pattern!

That said, I wouldn't worry too much. The slowdowns were simply a consequence of the number of requests per second going up. I did a bit of benchmarking; the current bottleneck is the use of SVGO (to compress results); it takes about 8ms, and is completely avoidable by pre-compressing templates. I'll try doing that next week-end. Once that is taken care of, the next bottleneck will be text width computation, at about 2ms. But a single server spending 2ms per request means it can bear ~500 req/s, which is much better than ~125 req/s with an 8ms bottleneck.

Happy hacking! You guys are great!

@espadrine

image backend perspective

before the change

one request per user request, unless the image is cached by shields server

after the change

one request per user request, unless the image is cached by user browser - and most likely it will be, as the generated image would use practically infinite cache max-age

shields perspective

before the change

request looks like this: user wants something, we check the state, look in the cache. if it is not there, we download the image from the image server. We actually transmit each and every byte of the image.

after the change

request looks like this: user wants something, we check the state and return 302. We never transmit images over the network, the only thing we use the memory for is to cache the states that we check.

user point of view

before the change:

request the shield (unless it is cached, but cache expires in an hour or so), display it
after an hour throw the cache away and start over

after the change

request the shield (unless it is cached, but cache expires in an hour or so), get a 302.
If this 302 is pointing at something that we have in cache (and most likely it is, because the browser will keep the image in the cache for a very long time), then display from the cache. If not, then get an image from the image server directly.

There is one more advantage in this approach. For example, a standard dynamic "License=MIT" image (or Travis: pass, or monthly downloads=0 etc) can be seen on many github pages, but for each repository the browser needs to download the actual bits of the image again. After the change, the user will get license-mit image from the image server only once - the rest of the time it will stop after getting a 302, because that 302 pointed at something it has in the cache.

I'm not sure if this description is clear. If any part of it requires an explanation, then please point it.

It took me a while to get @ppolewicz suggestion, but I think I got it:

User Request: https://img.shields.io/travis/rails/rails.svg
Server Response: 302 https://cache.shields.io/build/passing.svg

So _build/passing.svg_ is created once and cached for the long run. And all the time spent on the first request is deciding what image to serve, no serving the image. Similarly, images that change more often will only be created once:

https://cache.shields.io/downloads/398k.svg
https://cache.shields.io/downloads/395k.svg

I guess it would be a tradeoff between CPU and Storage/@espadrine 's time.

Actually, in my proposition the image cache will be much more efficient, as it will deduplicate between the pages. There will be only one "Travis CI: passing" image in the memory of the cache server... So storage-wise it should be better. From last post of @espadrine we can see that the CPU cost is image generation - and as the cache will improve a lot, the CPU needed to generate the images should go down too.

@ppolewicz First of all, it's really good to have suggestions, so thanks for that!

Now, I assume that by saying "302", you mean that there will be a server or two solely dedicated to the /badge/ endpoint (which only produces images fully determined by the path). At three servers, that leaves only one to receive vendor APIs (/travis/ and the like), which is a single point of failure that I would be sorry to have. Moreover, it would mean going from three servers performing the image generation to only two, which means less machines to churn on the bottleneck.

As far as the number of calls to shields.io servers go, let's say there are A requests on average for images that the user has never seen, and B that they have. Currently, the number of shields.io calls are A+B; with dedicated /badge/ servers, it would be A*2. But A is much larger than B, so the number of total requests per second over all shields servers would increase. For the servers, it isn't too big an issue, as the number of image generations would still be of the same order of magnitude. For clients however, the added call means that badges will take one more round-trip to fetch the image for the majority of cases (A), which means that on average, users will wait more time.

But the good news in your idea is that, crucially, B is an amount than any of the current server can now hold in memory. So we can make a cache in each server that holds all frequently-requested badges, and short-circuit image generation for them. It certainly is a low-hanging fruit; just one that I haven't gone to yet.

I assumed that (after deduplication), B is larger than A. Are you sure they
are not?

All the servers can have two IPs (or vhosts) and perform both roles
("redirector" and "generator"). This will eliminate the SPOF.

The total number of requests may increase, but many of them will be just
cheap 302 ones.

It is true that for the user, loading the images for the first time may
take twice as long, but if the user sees the site multiple times and sees a
few sites, the average time will go back down.

My impression (from another ticket) was that the actual images determined
by the are not hosted by the service, but a github CDN, where they are
placed for hosting. That'd take practically all the image generation
requests away from the server. Isn't it like this? Maybe it could be?

15.02.2017 11:05 "Thaddée Tyl" notifications@github.com napisał(a):

@ppolewicz https://github.com/ppolewicz First of all, it's really good
to have suggestions, so thanks for that!

Now, I assume that by saying "302", you mean that there will be a server
or two solely dedicated to the /badge/ endpoint (which only produces
images fully determined by the path). At three servers, that leaves only
one to receive vendor APIs (/travis/ and the like), which is a single
point of failure that I would be sorry to have. Moreover, it would mean
going from three servers performing the image generation to only two, which
means less machines to churn on the bottleneck.

As far as the number of calls to shields.io servers go, let's say there
are A requests on average for images that the user has never seen, and B
that they have. Currently, the number of shields.io calls are A+B; with
dedicated /badge/ servers, it would be A*2. But A is much larger than B,
so the number of total requests per second over all shields servers would
increase. For the servers, it isn't too big an issue, as the number of
image generations would still be of the same order of magnitude. For
clients however, the added call means that badges will take one more
round-trip to fetch the image for the majority of cases (A), which means
that on average, users will wait more time.

But the good news in your idea is that, crucially, B is an amount than any
of the current server can now hold in memory. So we can make a cache in
each server that holds all frequently-requested badges, and short-circuit
image generation for them. It certainly is a low-hanging fruit; just one
that I haven't gone to yet.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/badges/shields/issues/868#issuecomment-279968708, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAsWGb3Oazbd9Vo-NvOol5t8pQHGghDgks5rcs3wgaJpZM4Lz7W_
.

@ppolewicz

I assumed that (after deduplication), B is larger than A. Are you sure they are not?

The badges likely to be seen multiple times only include things like "build: passed", "dependency: none", essentially all the boolean information and the licenses. They're a minority, both in variety and in frequency in the wild. The others have variety: badge downloads, versions, coverage, and of course the bulk of it all, custom badges.

All the servers can have two IPs (or vhosts) and perform both roles ("redirector" and "generator"). This will eliminate the SPOF.

They can even do without that; we can 302 to the corresponding /badge/ URL on the same server. But then we don't need the 302 at all.

My impression (from another ticket) was that the actual images determined by the are not hosted by the service, but a github CDN

I think you're talking about camo, which GitHub uses to make sure all image requests on github.com hit camo.githubusercontent.com first. But camo doesn't cache images: the camo URL includes the original image URL, which camo calls every time. On the other hand, camo changes the HTTP headers, and forces browser caching unless we explicitly ask it not to.

They can even do without that; we can 302 to the corresponding /badge/ URL on the same server. But then we don't need the 302 at all.
We do. The 302 will have a cache timeout of half an hour, but the image it redirects to, will have a timeout of 120 days or something like that.

Even if someone has a custom badge, it will be generated once per user and stored in the long-term browser cache. After that, the browser won't need to download the image again (unless it flushes the image from the cache, that is).

Similar to what @cdornan mentioned, I was thinking of requesting badges when my projects are built, then keep them with the project. It's not ideal, but it could be helpful until a more robust system is available. I was also thinking that I could modify the SVG badge I receive and inject JavaScript to try to make them self-updating.

I should clarify what I meant by throwing "additional hosting" at this problem. I'm glad @espadrine added some hosts to the collection, but when I suggested that, I was thinking of asking my employer or the standards organization with which I'm working to contribute to the hosting.

For that matter, if my employer/organization decides the badges are useful enough, they could just set up their own copy of the shields application for our use. Does shields have any features to allow only requests for specific services/projects? If my employer/organization doesn't want to altruistically add resources to the pool for everybody to use, they may like this approach.

They can even do without that; we can 302 to the corresponding /badge/ URL on the same server. But then we don't need the 302 at all.

We do. The 302 will have a cache timeout of half an hour, but the image it redirects to, will have a timeout of 120 days or something like that.

With the 302, an unknown badge first waits 900 ms: 600 ms for the redirect, then 300 to follow it and get the image; a known badge waits 600 ms for the redirect, and uses its cache. With server-side caching, an unknown badge waits 600 ms and gets a redirect, while a known badge waits 600 ms. On the user's side, what they wait for is us calling the vendor, but that has no impact on the server's request per second limit, since it's not spending server CPU time.

@lsloan There isn't a significantly practical way to pool external servers within the shields.io DNS. However, the hosting costs are handled by the donations, which is the best way to contribute to hosting.

Time spent in badge.js (computing text width and compressing the SVG with SVGO) clocks at 8.4 ms on average. Since the servers seem roughly able to handle 100 requests per second individually (as the dropped requests arrived at the 200 req/s mark), they probably spend about 10 ms/req of CPU time, so that part is a good chunk of that.

With patch 2f97be9, time spent in badge.js clocks at 2.4 ms on average. That should remove 6 ms of CPU time processing requests. As a result, each server should be able to process 1/0.004 = 250 requests per second, which hopefully means that we can operate with three servers for a few years. (In fact, it may mean that a single server should be able to handle the whole load, although I am hesitant to remove a Canadian server.) I will probably add a server when the volume of requests triples.

Because of the way our template-to-SVGO compression works, the SVG output is less optimal on the social badge by a single byte. (It could be fixed, but that micro-optimization is silly. It comes from a <rect> converted to a <path>; in the optimal way, the path gets closed with a H72Z (72 being the width of the left hand side of the badge plus six pixels) instead of a h-.5z — using a relative position instead of an absolute one shaves one byte for left-hand-side widths below 100 pixels.)

Apart from that, badges now have to explicitly ask for xmlns:xlink, which adds 43 bytes to the output. It can be fixed by explicitly omitting it in the templates when we don't have links. I'll probably do so tomorrow.

Yeaaahhh ! It works for me !

Thank you so much @espadrine for your amazing job!

✌️

Thanks! Your support warms my heart!

maybe you should put a default maxAge too? it's not practical to put &maxAge=86400 on all projects

by default it's cache-control: no-cache, no-store, must-revalidate

Actually Github is also making some weird things, it wraps those urls with their CDN, example: https://camo.githubusercontent.com/083dff9a2fe3c003685c948b78a4d41a20a5868f/68747470733a2f2f696d672e736869656c64732e696f2f7472617669732f636175622f6d6f6e676f2d6c617a792d636f6e6e6563742e7376673f7374796c653d666c61742d737175617265266d61784167653d3836343030 and those urls can fail

For the current discussion, see #1568!

Was this page helpful?
0 / 5 - 0 ratings