Mattermost-server: Use sync.Pool for msgp serialization

Created on 14 Aug 2020  路  27Comments  路  Source: mattermost/mattermost-server

After moving to a serialized cache, we are now serializing and de-serializing every time we want to read/write from the cache.

This forces us to do lot of allocations, results in a lot of CPU time spent in mark and sweep.

Fortunately, the msgp library works on struct methods directly and therefore we can use a sync.Pool to give back model objects to the pool once we are done with it.

This ticket is to investigate if it is possible to know when the objects retrieved from the cache is no longer needed, and therefore put them back to the pool. The challenge here is to identify how to do that.

We can apply this for model.User and model.Session. So we need to look through the codebase and see all places where we get user and session objects and see how they are used and when they can be released.

As an example, let us consider GetProfileByIds. This returns a cached set of users from the cache, and in the *App.createGroupChannel method, we get the profiles, and create a group channel from those and don鈥檛 use that anymore. So when the method returns, we could put them back in the pool. However, this is just one example, and we need to obviously cover the hot paths so that a sync.Pool actually gives us a benefit.

This also needs some changes in the cache layer so that the pool can be shared globally from the app layer to the cache layer.


If you're interested please comment here and come join our "Contributors" community channel on our daily build server, where you can discuss questions with community members and the Mattermost core team. For technical advice or questions, please join our "Developers" community channel.

New contributors please see our Developer's Guide.

JIRA: https://mattermost.atlassian.net/browse/MM-27471

ArePerformance Hard Help Wanted PR Exists TecGo

All 27 comments

@agnivade Hey 馃憢 , I'm interested to investigate.

That sounds amazing @RajatVaryani. Please let me know if there's anything to help with. I have just given a cursory overview in the issue, but the general idea is to use a sync.Pool to store the model.User and model.Session objects which are backed by the LRU cache. It might be a good idea to focus on just one object type at a time and send PR for that one.

Hi @RajatVaryani - how is it going with this?

@agnivade Hey, please feel free to assign it to someone else.

hi @agnivade can I pick this up ?

Sure!

@agnivade I had a look at the code during some spare time I got, and followed the execution path. Just want to run some things through you based on your original comment

This ticket is to investigate if it is possible to know when the objects retrieved from the cache is no longer needed

If we take the model.User object, wouldn't these be at the api4 layer. If we take a look at the route api.BaseRoutes.User.Handle("", api.ApiSessionRequired(getUser)).Methods("GET"), the method getUser is the location where the object is written to the response and here we can potentially return the object to the sync.Pool somewhere here https://github.com/mattermost/mattermost-server/blob/master/api4/user.go#L214 ?

As you have said we of course need to expose the cache layer to the app layer.

Yep, you are on the right track!

The main thing to focus on is the a.Srv().Store.User().Getcall. Now the getUser method does call (*App).GetUser which eventually calls the store method, but there are other places in the app layer which directly call the store method and use the user object. And there are places in the api4 layer too where (*App).GetUser is called. We would need to look through all of those, and possibly add a defer block to all those methods immediately after we get the user, to return it back to the pool.

I'd also suggest that you can change all direct invocations of a.Srv().Store.User().Get call to go via (*App).GetUser if it makes things easier for you.

And then in the LRU cache, where we do:

        var u model.User
        _, err := u.UnmarshalMsg(e.value)
        *v = &u
        return err

You would need to get the user object from the pool and Unmarshal the byte slice into it.

Thank you for your time! I realize this is a big task, but from my measurements, a big chunk of CPU time comes from just these 2 objects (user and session). So it will be a considerable improvement, if we are able to pull this off.

@agnivade I assume this also has to apply to hot structs as well (https://github.com/mattermost/mattermost-server/blob/master/services/cache/lru.go#L150) ?

One more thing, going through the code I noticed the method signatures up-to the cache layer take a pointer reference to a variable to populate the read bytes to. Are you suggesting to change the method signatures in these ?

Another option would be to get the object from the pool here for example https://github.com/mattermost/mattermost-server/blob/master/store/localcachelayer/user_layer.go#L141. We of-course would have to modify the lru.go file to call UnmarshalMsg from this passed in method. Let me know your feedback on this.

In terms of where defer should be added I assume that would have to be at the api layer ?

@arjunagl

Yes, this will apply _only_ to hot structs. Because we need to apply this to specific structs. Each pool will be for each struct.

Are you suggesting to change the method signatures in these ?

No, these signatures won't change. What we want is to get the user object from the pool and then do the unmarshaling into it.

Another option would be to get the object from the pool here

That's a great idea actually.

In terms of where defer should be added I assume that would have to be at the api layer ?

Yes, wherever the objects are used last. It can be in the api layer or in the app layer.

But I am now realizing that this will make the code very unmaintainable and error-prone in the future. Mainly because of 2 things-

  • The Put and Get operations with the pool are very far together. So it's very difficult to understand what's going on.
  • New developers adding code, getting a session and user object have to remember to put the object back into the pool, which is not very straightforward.

Do you see a simple way to tackle this without such complications?

@Leryan -

Glad to see you here!

My apologies for not giving enough context on the issue. Please feel free to DM me directly if you want to get more details.

Yes, we weren't serializing at all earlier. It was just a cache of pointers. We would do some copying in a few places, but it was half-baked. So another goroutine might still be holding on to the older object just by getting it's pointer, and then lots of bugs would cause due to this.

@Leryan -

[鈥 Please feel free to DM me directly if you want to get more details.

I figured that out and deleted my initial post, sorry for the noise sincerely :)

Yes, we weren't serializing at all earlier. It was just a cache of pointers. We would do some copying in a few places, but it was half-baked. So another goroutine might still be holding on to the older object just by getting it's pointer, and then lots of bugs would cause due to this.

Oh i understand, thanks for the update.

_flying away from the issue 馃槃_

@agnivade

But I am now realizing that this will make the code very unmaintainable and error-prone in the future. Mainly because of 2 things-

I understand what you are saying and I had this concern as well. I will have a think/search on this and get back to you.

@agnivade I have the following two options

Option 1: We use the following (I haven't figured out the details) https://godoc.org/runtime#SetFinalizer, however I'm not convinced this is a good solution

Option 2: We modify the method signatures right from the api layer to get from the sync.pool and pass it all the way down. This was the get and put will be in the api layer. I personally prefer this approach since the get and put are now centralised and whatever being done is explicit in the code.

What are your thoughts on this ?

Thanks for spending time on this @arjunagl.

A SetFinalizer will only be called during garbage collection. So that's not really a sync.Pool because until the GC kicks in, we won't be able to put it back to the pool.

Option 2 is better, but now every layer will be made aware of the fact that there is a sync.Pool being passed, but it will only be used for 2 structs, which would make it a bit strange. And also, this does not fix the issue that developers still have to remember to put the object back into the pool whenever new code gets added. That is still a risk we have.

It's a bit tricky to implement this nicely I think. While I personally would have liked to have this, I see now that it would make the codebase much more complicated and difficult to maintain. The costs seem to outweigh the gains.

thanks @agnivade I guess this issue will be closed and will no longer be worked on.

Thank you @arjunagl and I apologize that it didn't come out as a success. This was an investigative issue anyways and we weren't sure if this was going to work or not. I do hope this does not discourage you from contributing to more issues :)

@arjunagl - Actually, thinking more on this, I think we should be okay doing this only for Session objects. They are called in fairly limited places throughout the codebase, and there are fewer chances of someone getting new session objects in new places.

A quick search gives this:

$git grep "\.GetSession(" | grep -v "_test"
app/oauth.go:   session, _ := a.GetSession(token)
app/opentracing/opentracing_layer.go:   resultVar0, resultVar1 := a.app.GetSession(token)
app/plugin_requests.go:     session, err := a.GetSession(token)
app/web_conn.go:        session, err := wc.App.GetSession(wc.GetSessionToken())
app/web_conn.go:        hasReadPrivateDataPermission = model.NewBool(wc.App.RolesGrantPermission(wc.GetSession().GetUserRoles(), model.PERMISSION_MANAGE_SYSTEM.Id))
app/web_conn.go:            hasReadPrivateDataPermission = model.NewBool(wc.App.RolesGrantPermission(wc.GetSession().GetUserRoles(), model.PERMISSION_MANAGE_SYSTEM.Id))
app/web_conn.go:    if wc.GetSession().Props[model.SESSION_PROP_IS_GUEST] == "true" {
app/web_conn.go:    currentSession := wc.GetSession()
app/web_conn.go:        session, err := wc.App.GetSession(wc.GetSessionToken())
app/websocket_router.go:        session, err := wr.app.GetSession(token)
plugin/api_timer_layer_generated.go:    _returnsA, _returnsB := api.apiImpl.GetSession(sessionId)
plugin/client_rpc_generated.go:     returns.A, returns.B = hook.GetSession(args.A)
web/handlers.go:        session, err := c.App.GetSession(token)
wsapi/websocket_handler.go: session, sessionErr := wh.app.GetSession(conn.GetSessionToken())

Would you still be interested in picking this up just for the session objects? Then we can re-open this issue.

@agnivade sure, I would like to pick this up. Looking at the above search results, I would suggest creating a pool that can be accessed across all the files. Let me know if you have any concerns on this. Plus we would still have the problem where the object is taken from the pool from one location and put back to the pool in another location. I assume you are ok with this?

Thanks @arjunagl! Yes, I am ok with that. We can create the pool in the app layer and globally access that everywhere else.

@agnivade just to update you, I have started on the code changes and will try and submit a draft PR (if its allowed from forks) just to make sure we are on the same page.

Amazing. Thank you for the update! Yes draft PRs are allowed on forks.

@agnivade 鈽濓笍 is the link to the draft PR, please have a look. I have added the pool into the model layer in order to minimise
imports. Let me know if you prefer it in the app layer, however this would mean changes to the imports.

@agnivade I have created a draft PR based on your comments on putting the Sync.Pool into the app layer. Please have a look and let me know if this is heading in the direction you expect.

On a side note I don't think the code block if msgpVal, ok := value.(msgp.Unmarshaler); ok evaluates to true currently for the Session object since value is actually **Session. Let me know if you want me to change this as well.

Yep, this is the right direction.

On a side note I don't think the code block if msgpVal, ok := value.(msgp.Unmarshaler); ok evaluates to true currently for the Session object since value is actually **Session.

That's because you are passing a pointer to a pointer again here: sessionCache.Get(token, &session). We need to change that to just pass session.

Great, thanks, I will continue my work then,

That's because you are passing a pointer to a pointer again here: sessionCache.Get(token, &session). We need to change that to just pass session.

Actually before my change as well it was passing a pointer to a pointer, not sure how that worked, anyways I can change it.

Actually before my change as well it was passing a pointer to a pointer, not sure how that worked,

Yes that was the reason for the whole casting to ** business. It was working because we were just passing it as interface{} and the serialization library will accept anything you give to it.

@agnivade I have submitted a PR for this, could you please have a look when you get the time

https://github.com/mattermost/mattermost-server/pull/16103

Was this page helpful?
0 / 5 - 0 ratings