Osrm-backend: Move shared memory block swapping control to osrm-datastore

Created on 21 Jun 2016  路  9Comments  路  Source: Project-OSRM/osrm-backend

OSRM supports a hot-swap feature - you can use osrm-datastore to replace an old dataset with a new one while osrm-routed (or node-osrm) remains running, with only a very brief hitch in request handling performance.

However, the design is a bit ugly - osrm-datastore simply loads the new data, but it's actually up to the reader process (osrm-routed) to perform the data exchange, mark the new block in use and release the old memory.

This has a few problems:

  • the locking logic is hard to grasp
  • you can't have multiple readers of the shared memory block, _and_ support hot-swapping
  • readers need write-access to the block

We should change this. I picture it working as follows:

  • shared memory segments must be named via the command-line, no more "magic numbers"
  • users of a shared memory block register their interest via a semaphore
  • osrm-datastore loads new data, then signals all registered listeners that new data is available, then waits
  • listeners swap to the new data address, and release interest in the old one
  • osrm-datastore removes the old data once all listeners have moved, then exits

There would be some additional locking required to avoid problems with running additional osrm-datastore during an exchange, and to handle the startup/shutdown of listeners during an exchange.

Feature Request Refactor

Most helpful comment

Is definitely a bottleneck, but given our usual request time not a big problem to me. We don't loose much here.

We sometimes hit outliers on our production machines where requests take several seconds and clog the pipeline. This makes switching dataset times quite non-deterministic. Combine this with the fact that datastore doesn't actually wait until the data is completely switched over, but just exists after the shared memory region is written, we have hacks in place just to ensure that the data correctly loaded.

What I am more concerned about is the complexity of the solution.

This is actually less complex then our current locking setup, because we needed to add several locks to work around edge cases (concurrent datastore runs, non-commited data changes, several race conditions between the two). And multiple clients are just not supported at all which is a major downside for a shared-memory based system.

We can replace that haywire of mutexes using established multi-processing idioms. This will make the code both easier to understand and less error prone. And as cherry on top we will get support for multiple clients.

Could we do better if we target this for a 6.0 release and make it an API change?

As I outlined above this is a strict feature subset of something that would require an API change. This approach can be extended to use named datastores, but that is actually completely independent on how we do the locking.

This multiplies memory requirements, or am I missing something here?

Both yes and no. It multiplies the memory requirement for storing a SharedDataFacade object, which can now have two instances (old and new dataset). However since that object does not actually own any of the significant memory but is only a collection of pointers and wrappers this will not impact our memory usage.
As @MoKob already pointed out, when using shared memory there are always two complete datasets in shared memory at the time of switching. So this does not change that part of our memory requirements.

All 9 comments

related #1221

As a first step I would like to implement this without having named memory segments, since this would definitely require an API change. I believe to get this right we cannot operate on process granularity but need to consider every thread.

Currently we have a piece of code that waits until all requests are finished before it swaps the data. However that is unnecessary, what we really want to do is to wait until all requests that referenced the old dataset are finished, before we deallocate that shared memory region.

To allow this we would need to have multiple versions of the shared-memory data facade in memory. Every request holds shared ownership of that data facade. If all ownerships are given up the facade is deallocated, and we can signal for the release of the shared memory region.

Things that need to change to enable this:

  • [x] All stored references to data facades (e.g. for routing algorithms or inside the plugins) need to be changed to non-owning references when the request is executed
  • [x] The CheckAndReleadFacade() logic can't live in the shared dataface anymore, because every datafacade is immutable
  • [x] Engine will store a shared_ptr to the current datafacade, which will be protected by a (process local) read-writer lock. Every time a new facade is create a write-lock will need to be aquired.
  • [x] Every request will first copy (acquire&release read lock) that shared_ptr and hold it for the duration of the whole request.
  • [x] An IPC reader/writer lock for the every dataset in use. Every datafacade on creation enters the critical section as reader and leaves the critical section when the destructor is called.
  • [x] An IPC read/writer lock for the shared memory section that contains information about the newest dataset.

@daniel-j-h do you think this sounds sane at all?

First of all, I agree with the assessment in https://github.com/Project-OSRM/osrm-backend/issues/2570#issuecomment-251808503.

Still I'd like to raise the question if we are implementing a complex solution here that might be based on not-changing the API and/or unnecessarily complicated.

Currently we have a piece of code that waits until all requests are finished before it swaps the data. However that is unnecessary, what we really want to do is to wait until all requests that referenced the old dataset are finished, before we deallocate that shared memory region.

Is definitely a bottleneck, but given our usual request time not a big problem to me. We don't loose much here.

What I am more concerned about is the complexity of the solution. Could we do better if we target this for a 6.0 release and make it an API change? Are we doing a major change here to work around an API change?

@TheMarex sounds good overall, shared ownership makes sense.

Currently we have a piece of code that waits until all requests are finished before it swaps the data. However that is unnecessary, what we really want to do is to wait until all requests that referenced the old dataset are finished, before we deallocate that shared memory region.

The issue I'm seeing here is the following: at the moment we only ever have one dataset loaded. With your proposed design we potentially have multiple datasets loaded to memory as long as there are requests holding onto it. This multiplies memory requirements, or am I missing something here?

For the record, I played around with shared memory reader / writer locks for hotswapping over at
https://github.com/daniel-j-h/shm-hotswap

The issue I'm seeing here is the following: at the moment we only ever have one dataset loaded. With your proposed design we potentially have multiple datasets loaded to memory as long as there are requests holding onto it. This multiplies memory requirements, or am I missing something here?

Is this not already the case? Loading a dataset takes quite some time. So I guess the hot-swap would be a real cold swap if we don't already load two sets into memory.

Is definitely a bottleneck, but given our usual request time not a big problem to me. We don't loose much here.

We sometimes hit outliers on our production machines where requests take several seconds and clog the pipeline. This makes switching dataset times quite non-deterministic. Combine this with the fact that datastore doesn't actually wait until the data is completely switched over, but just exists after the shared memory region is written, we have hacks in place just to ensure that the data correctly loaded.

What I am more concerned about is the complexity of the solution.

This is actually less complex then our current locking setup, because we needed to add several locks to work around edge cases (concurrent datastore runs, non-commited data changes, several race conditions between the two). And multiple clients are just not supported at all which is a major downside for a shared-memory based system.

We can replace that haywire of mutexes using established multi-processing idioms. This will make the code both easier to understand and less error prone. And as cherry on top we will get support for multiple clients.

Could we do better if we target this for a 6.0 release and make it an API change?

As I outlined above this is a strict feature subset of something that would require an API change. This approach can be extended to use named datastores, but that is actually completely independent on how we do the locking.

This multiplies memory requirements, or am I missing something here?

Both yes and no. It multiplies the memory requirement for storing a SharedDataFacade object, which can now have two instances (old and new dataset). However since that object does not actually own any of the significant memory but is only a collection of pointers and wrappers this will not impact our memory usage.
As @MoKob already pointed out, when using shared memory there are always two complete datasets in shared memory at the time of switching. So this does not change that part of our memory requirements.

I found a limitation with the design:

  1. Start osrm-routed with shared memory
  2. Run osrm-datastore with another dataset
  3. Observe that osrm-datastore is blocked because it can't free the old dataset

Even though there are no queries referencing the old dataset, osrm-datastore is blocked. This is because the Engine stores a reference to the dataset, to be able to serve queries from it. If there are no queries, there is no code in Engine being executed: there is no way for it to detect a new dataset. So osrm-datastore actually waiting for the next query to _arrive_ to detect a new dataset and do the switch (and by that releasing ownership of the old dataset).

This of course breaks our test setup, where we start an osrm-routed instance, wait for osrm-datastore to finish and _then_ send queries.

We could side-step that problem by not needing any client intervention to switch datasets. That would mean we don't keep state (e.g. safe the current SharedDataFacade) but always create it new for every request. However right now creating a new data facade is a rather expensive operation involving the creation of a lot of wrapper objects and mapping a shared memory region (that is at least one syscall, probably more).

I'm going to benchmark this to see what we would be looking at here in terms of overhead. We could potentially safe some overhead with keeping the shared memory mapped at constant locations (hence a syscall less). Downside there would probably be that we have a constant requirement of datasize*2 (not just during updates). That could be improved just switching the memory lock on/off., hence letting the OS do its job and just page it out.

I did a quick benchmark on this and creating SharedDataFacade objects is surprisingly expensive, about half a millisecond. As such I think adapting the design in the following way should get us closest to where we want to go:

  1. osrm-datastore always tried to overwrite the oldest dataset. It can never overwrite the most recent dataset. osrm-datastore runs can't happen in parallel.
  2. If the oldest dataset is still in use (read lock) it blocks and waits for the requests to finish before updating.
  3. A read lock on a shared memory region is only maintained during query (not bound to the lifetime of a ShareDataFacade)
  4. Every time a SharedDataFacade is deallocated it deallocates the shared memory it is pointing to if:

    • the shared memory is not the newest memory

    • there is no lock on it (still in use, or in use again)

I have this design now finally working. Only remaining issues are around hardening: How are we going to detect and recover from dangling locks. This might happen if either osrm-routed or osrm-datastore crash in a critical section of a shared memory lock. We have a tool for that osrm-unlock-all but some resilience for server deployments is important.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

freenerd picture freenerd  路  4Comments

zifeo picture zifeo  路  3Comments

brunosan picture brunosan  路  5Comments

pat841 picture pat841  路  4Comments

ifle picture ifle  路  5Comments