Streetcomplete: Background Map is incredibly slow

Created on 5 Feb 2020 · 15Comments · Source: westnordost/StreetComplete

I have a hard time using SC currently, because the map layer isn't loading or is loading after an extrem lag.

I was in a city center where I haven't been for a while, so SC had no cached tiles.

I had a good 3G+ connection and I don't have any bandwidth limit.

I opened the app (it took 45 seconds to load the first time):
https://youtu.be/PwjUSbWKU-U

I moved the map a bit, closed the app and opened it again (I waited four minutes and gave up):
https://youtu.be/RuH9JCVXTbo

How to Reproduce
Open the app and scroll to a location which has no cached map.

Versions affected
17

The map is also pretty outdated, as pointed out here:
https://github.com/westnordost/StreetComplete/issues/1700

Source

RubenKelevra

Most helpful comment

The region 6_34_22 (Muinich and most of Austria) seem to be the region with most of the Problems, I have to generate statistics per region to see if it is the most used. From that I can see in most regions the time to the first tile on a new to render place is 1-2 seconds mostly, i will try to generate statistics on that.

@RubenKelevra thanks for the caching tips, most of them is used already. I think the long timeouts are not a good idea, because this is that generates your current problem, there are more long running tiles in the queue to process before your request times out. I am actually thinking about reducing the timeout and handle the long rendering tiles in a separate queue. (and maybe add rate limits based on IPs)

Generally @ftsell and I am working on improvements, but it will take some weeks.

Akasch on 16 Feb 2020

❤1 👍1

All 15 comments

I'd agree with @RubenKelevra it does feel like it's got slower recently.

peternewman on 9 Feb 2020

👍1

cc @Akasch

westnordost on 9 Feb 2020

At the moment the region 6/34/22 (https://a.tile.openstreetmap.org/6/34/22.png, Munich and most of Austria) is overloaded. Because of the timeouts the server is getting more and more requests and can handle them even less. I try to find some time later Today or tomorrow to get it back to normal.

Akasch on 9 Feb 2020

At the moment the region 6/34/22 (https://a.tile.openstreetmap.org/6/34/22.png, Munich and most of Austria) is overloaded. Because of the timeouts the server is getting more and more requests and can handle them even less. I try to find some time later Today or tomorrow to get it back to normal.

I'm looking at https://a.tile.openstreetmap.org/5/15/10.png @Akasch (sorry I should have clarified location before). Should that be having the same issues?

peternewman on 9 Feb 2020

There I do not see a high load, but as it is on the same physical server, it could explain why it is slow

Akasch on 9 Feb 2020

👍1

At the moment I generate the tiles which are often getting timeouts in the Austria region, and will serve them as static files. This should make the whole process faster again and fix the problems in this region.

For the interested: Code to automate the pre generation of complex tiles is in work at https://github.com/Map-Data/tileserver-mapping/pull/1

Akasch on 9 Feb 2020

The slow tiles of 6/34/22 are now served pre generated as static files from the web-server. Should be much better now.

Akasch on 10 Feb 2020

Turns out the app did not cache anything anymore. Will be fixed in next version.

westnordost on 10 Feb 2020

👍1

@westnordost thanks for the fast fix!

You're killing it :)

RubenKelevra on 10 Feb 2020

The slow tiles of 6/34/22 are now served pre generated as static files from the web-server. Should be much better now.

Have a look at this comment, you can automate the caching and updating with some nginx cache settings.

Nginx will serve outdated tiles and update them in the background, instead of waiting for the newly generated tile from the background - which might run in a timeout for the client.

This should give a good performance boost :)

https://github.com/westnordost/StreetComplete/issues/1700#issuecomment-582498486

RubenKelevra on 10 Feb 2020

@RubenKelevra we are using the nginx cache settings. The problem are tiles that take a really long time to generate as they are blocking the worker and often do not complete due to timeouts, so the client requests them again and it fails again, while draining recurses on the server. There are tiles that take in the region of 3 minutes to generate if the Database is on a HDD. For this region I have now generated the tiles manually without timeouts in an extra worker and saved the results for tiles that hit the timeout of 90 seconds more than 5 times yesterday

Akasch on 10 Feb 2020

@RubenKelevra we are using the nginx cache settings. The problem are tiles that take a really long time to generate as they are blocking the worker and often do not complete due to timeouts, so the client requests them again and it fails again, while draining recurses on the server. There are tiles that take in the region of 3 minutes to generate if the Database is on a HDD. For this region I have now generated the tiles manually without timeouts in an extra worker and saved the results for tiles that hit the timeout of 90 seconds more than 5 times yesterday

Did you set proxy_cache_lock on nginx? This will keep a lock on a cache miss until the tile has been delivered by the backend. You can increase the timeout for the backend (in nginx) to 1-2 hours, if you know, that it will eventually return with a tile. This will remove any double requests to the backend.

You need to raise the default proxy_cache_lock_age as well as the proxy_cache_lock_timeout (default for both is just 5 seconds) to the larger values like for the read timeout of the backend requests:

The timeout can be set with the variable proxy_read_timeout - the default is 60 seconds.

Also make sure to set inactive to like 30 days or so, because that's the maximum time an element will be hold in the cache, regardless how 'fresh' it is.

To help the application developers to understand that they got a stale tile, you can fill the X-Cache-Status header with the cache status with the variable $upstream_cache_status.

My recommendation for you in a sum would be:

proxy_cache_use_stale error timeout invalid_header updating http_500 http_502 http_503 http_504 http_403 http_404;
proxy_cache_valid 200 48h;
proxy_cache_valid 302 10m
proxy_cache_valid 301 24h;
proxy_cache_valid any 1m;
proxy_read_timeout 2h;
proxy_send_timeout 2h;
proxy_socket_keepalive on;
proxy_connect_timeout 30s;
proxy_http_version 1.1;
proxy_cache_background_update on;
proxy_cache_convert_head on;
proxy_cache_key $request_uri;
proxy_cache_lock on;
proxy_cache_lock_age 2h;
proxy_cache_lock_timeout 4h;
proxy_cache_methods GET;
proxy_cache_min_uses 1;
proxy_buffering on;
proxy_cache tile_cache;
proxy_ignore_client_abort on;
proxy_cache_revalidate on;
proxy_cache_path /path/to/nginx/tiles/cache/ levels=2:2:2 manager_sleep=1s manager_files=10000 max_size=30g loader_files=50000 loader_threshold=4s loader_sleep=500ms manager_threshold=3s inactive=720h use_temp_path=off keys_zone=tile_cache:1200m;
proxy_ignore_headers "X-Accel-Buffering" "X-Accel-Limit-Rate" "X-Accel-Expires" "Expires" "Cache-Control" "Set-Cookie" "Vary";
proxy_request_buffering on;
proxy_pass http://tileserver;
add_header X-Cache-Status $upstream_cache_status;
expires 2d;
add_header Cache-Control "public, no-transform";
proxy_set_header Connection "";
etag off;

Basically:
It will store up to ~8 million tiles, for 30 days in the cache folder. If the cache folder gets bigger than 30 gigabyte, the oldest files will be removed.

A cached file is considered "fresh" for 2d, if a client requests a cached file which has been stored more than 2d ago, the client gets the cached file and the proxy will add the file to a list which has to be refreshed. The new file will be requested from the backend and if the next client wants the file, it will be fresh again.

The backend connections will be reused (if possible) after a request, and the concurrent requests are limited to the amount of connections allowed for the backend.

The proxy will wait up to 2h for a tile to be completed (read timeout), when the update is not completed after 2h, the cache lock will allow the next request to be made to the backend (if a client will request again).

If the backend becomes unresponsive, the enabled TCP keepalive should drop all connection. The proxy will try to reconnect and keep refreshing tiles when possible.

All cache settings from your backend will be ignored (I don't know if you set any) and only the settings in this config will be applied. If your backend is able to answer if there has been changes via an conditional request with If-Modified-Since header. If you backend is not able to respond to them, delete the line proxy_cache_revalidate on;.

Also: If there's a client requesting a tile and the backend doesn't respond in time, the client might close the connection. The setting proxy_ignore_client_abort will ignore this abort from the client side and continue to wait for the response from the proxy and save this response in the cache.

It also makes sense to add connection and request limits for the clients and some other settings (when not already set in http or server):

in main:

worker_processes 1;

worker_rlimit_nofile 80000;

events {
multi_accept on;
use epoll;
worker_connections 40000;
 }

thread_pool default threads=4096 max_queue=122880;

in http:

limit_conn_zone $binary_remote_addr zone=tileconns:50m;

limit_req_zone $binary_remote_addr zone=tilereq:50m rate=15r/s;



upstream tileserver
{
  server 127.0.0.1:12345 max_conns=20;
  keepalive 5;
  keepalive_timeout 300s;
  keepalive_requests 10000;
    queue 25000 timeout=48h;
}

In the location:

sendfile on;

aio threads=default;

limit_conn tileconns 10;
limit_conn_status 429;
limit_conn_log_level warn;

limit_req zone=tilereq burst=250 delay=40
limit_req_log_level warn;
limit_req_status 429;

client_max_body_size 8k;
if_modified_since before;
keepalive_disable none;
keepalive_timeout 160s;
keepalive_requests 2000;
limit_except GET {
  deny all;
}
msie_padding off;
send_timeout 160s;
server_tokens off;
tcp_nodelay on;
tcp_nopush on;
postprone_output 1150;

The reduction in max body size of the client avoids that malicious clients send large files to your server which needs to be buffered as a file. Since we expect and allow only GETs 8k should be more than expected and is the standard limit before a file will be created.

This will allow 40 tiles to be immediately downloaded by a client, after which a rate limit of 15 tiles per second is enforced. So each 66ms one tile will be delivered maximum to each IP, after the initial burst.

If the client ip is idle for a while, e.g. looks at the map, the 40 initial immediately download slots are also freed up again, at the rate of 66ms each.

This allows for fast download if zooming in/out, but will avoid that one IP will overload the server with requests of an area not cached, or any download scripts to receive a huge amount of data.

After the queue per IP exceeds 250, every additionally request by the IP will be answered with http 429.

If an IP does open more than 10 connections, the 11th connection will be dropped after returning http 429 to the client. This also applies to concurrent request thru a multiplex connection, like http/2 or spdy.

Note:

It's pretty late here, I don't wanted to test this. There might be some typos etc in the config. So don't copy and paste it into a production environment 😉

Remove everything you cached previously before using this config, I changed the default hash key which means all old cached files are useless.

The documentation of Nginx isn't perfectly clear if queue under upstream is now available for the free version. Haven't checked the source code.

You might wanna run this nginx on an additional box with an SSD and memory which is independent from the database memory

Knobs you might want to tweak:

In upstream tileserver you can set the amount of concurrent tasks of the backend. Everything above will be put in a queue and will be kept up to 48h.

threads=4096 for the default thread_pool can be changed depending on the IO capabilities of the system. The number of threads define how much files will be concurrently read/written to the disk by nginx. 4096 is probably too high for an spinning hard drive.

This settings are optimized for a system with a lot of spare ram, since I don't know your specifications.

RubenKelevra on 11 Feb 2020

@westnordost

I've retested today and caching of the app do hide most of the time, that the server is slow, but this what a new user would see if the server has no cached tiles for his location:

https://youtu.be/IHi2TPFJgLI

I think that we can't expect a new user to stick around for more than 7 minutes with an empty map.

I think we have to change the map provider.

RubenKelevra on 16 Feb 2020

👀1

Generally @ftsell and I am working on improvements, but it will take some weeks.

Akasch on 16 Feb 2020

❤1 👍1

The region 6_34_22 (Muinich and most of Austria) seem to be the region with most of the Problems, I have to generate statistics per region to see if it is the most used. From that I can see in most regions the time to the first tile on a new to render place is 1-2 seconds mostly, i will try to generate statistics on that.

The map was centered at 51.1529377, 7.3405633 for the recording. The App was just launched, so there haven't been any tiles requested that day with this IP.

@RubenKelevra thanks for the caching tips, most of them is used already. I think the long timeouts are not a good idea, because this is that generates your current problem, there are more long running tiles in the queue to process before your request times out. I am actually thinking about reducing the timeout and handle the long rendering tiles in a separate queue. (and maybe add rate limits based on IPs)

If you abort the generation of tiles, you discard work which has already been done, increasing the overall processing time (you basically spend the timeout additionally on long to process tiles).

Without a look at your current config I guess the lock is released after 5 seconds on a non-cached/out of date item. So if a client tries to read again, since the request timed out, the tile generation is started a second time, for the same tile.

Also most of the standard settings are tuned for larger avg file size and a lot less files in the cache. That's why I was recommending so much changes to the default config.

Generally @ftsell and I am working on improvements, but it will take some weeks.

Sure, but SC shouldn't use a testbed environment for their map layer. @westnordost isn't there anything which is more stable, which we can use as a drop in, until this service is more mature?

Or at least as a default, to reduce the stress on this service?

RubenKelevra on 16 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings