Go: encoding/json: memoize strings during decode

Created on 25 Jun 2019 · 50Comments · Source: golang/go

Part of the motivation for #32593 is the observation in json-iterator/go#376 that when you json decode, many of the same strings appear over and over in the JSON and get copied and reconstructed in the result over and over as well.

We do _not_ want to make the JSON coalesce all the allocated strings into one giant buffer, because then holding on to just one of them holds onto the entire thing.

But it might make sense to add to the decoder state a simple map[[]byte]string and remember which specific byte slices we've already converted to string and reuse those allocated strings instead of doing the same conversions again in a particular Decode or Unmarshal operation.

It's unclear to me whether the map in a Decoder should persist between Decode operations. Probably? That will affect users who put Decoders in a pool or something like that, though. (Solution: don't put decoders in pools.)

Proposal Proposal-Accepted

Source

rsc

👍2

Most helpful comment

@rsc I think we should answer the two questions I raised again in https://github.com/golang/go/issues/32779#issuecomment-609479347 before getting into a CL. To repeat my current understanding:

1) Would this only affect map keys, or all strings? This is unclear, and IMO should be decided before we jump to an implementation.

2) The change, as proposed right now, would silently introduce memory leaks in programs that currently keep a Decoder alive for a long time. You say:

If you want to avoid the memoization, you could use multiple json.Decoders.

In my opinion, this is a subtle and dangerous breaking change. Even if we strongly marked this change in behavior in the release notes, discouraging the use of global or long-lived decoders, it still seems like a change that could bring pain to our users since it's easy not to notice a leak until a service runs out of memory a week later.

mvdan on 9 Apr 2020

👍7

All 50 comments

A while ago I wrote some code to test this exact same thing with the json decoder, but could not detect any significant difference between the memory usage of the decoder with memoized strings and the regular json decoder, which made me think maybe this is already done somewhere? Apparently not. I can try to find out my test code...

bserdar on 25 Jun 2019

Here's a link to my test code that does this for decoding json into interface{}:

https://github.com/bserdar/jsdecodertest

My initial comment was wrong, it does make a difference to store strings in a map. My test code stores only field keys in a map because those are the strings that are likely to repeat.

The json/ package is a copy of the stdlib encoding/json package modified to include a map[string]string in the decoder state.

bserdar on 26 Jun 2019

Nit: I assume the proposal meant to use a map[string]string with map[string([]byte)] lookup as map[[]byte]string is not a supported map type.

martisch on 26 Jun 2019

Also cc @josharian, who had a very similar idea a while ago. That was briefly discussed in #5160.

As a thought, we probably don't want to reuse all strings. For example, in a map the keys probably tend to repeat much more often than the values.

Also, do we want to make this configurable? Someone who writes a program likely knows whether they'll read somewhat static/repeated data or not.

mvdan on 26 Jun 2019

👍1

Change https://golang.org/cl/188500 mentions this issue: encoding/json: memoize keys during decode

gopherbot on 1 Aug 2019

@mvdan, are you concerned about the overhead of the map lookup in your Jun 26 comment? From the outside there can be no possible downside to reusing all strings that are constructed.

rsc on 6 Aug 2019

The CL mentioned above has an implementation of this for JSON decoder. It introduces some overhead when decoding to an interface{}, because that's the only case where the keys read during decoding will be kept after decoding is done.

However, I think there is a better way to do this. What I propose is to write an encoding/LiteralStore type (a map[string]string) containing a Get(string) string method to retrieve the shared instance of the string. Then this can be used from the decoders in encoding/xml and encoding/json. To use the LiteralStore, the caller has to assign an instance to the JSON/XML decoder. This way, the caller has the option to use a literal store for large input where it makes a difference, use the store for multiple docs, or avoid it completely.

What do you think?

bserdar on 27 Aug 2019

This is one of the conventional use-cases for weak references in GC'd languages.

Ideally, we should reuse strings that remain reachable from the last Decode invocation, and let the collector clean up any strings that have since become completely unreachable. That way, the frequently-used strings will remain in the map, but the infrequent strings can still be collected (so that the map doesn't grow without bound if the same Decoder is used on many different inputs over time).

bcmills on 28 Aug 2019

Alternatives that do not require weak references include:

Make the interning behavior an explicit opt-in using a Decoder method (perhaps func (*Decoder) InternStrings(func([]byte)bool)).
Make the interning behavior an explicit annotation on the struct field tag.
- But this gets awkward to describe the behavior for map keys vs. map values.

bcmills on 28 Aug 2019

@bcmills, I propose to make the interning behavior explicit opt-in using a Decoder field instead of a method. After reading #32593, I believe one way to make this acceptable in terms of performance and usage is this:

Add a strings/LiteralStore type (a map[string]string), with an Add(string) string method that will return a shared copy for string
Add Decoder.KeyLiterals field of type LiteralStore to both json and xml decoders.
Use the KeyLiterals if it is non-nil during decoding to store keys

This implementation does not change the existing behavior and makes the use of a literal store explicit opt-in.

A literal store like this is only useful for larger json/xml input, and only if it is manually decoded, or unmarshaled into an interface{}. If the input is unmarshaled/decoded into a struct, then there is no need to keep the keys as they will be lost.

Re: weak references, afaik there is no way to implement this using finalizers, am I wrong?

bserdar on 29 Aug 2019

@bserdar I don't think there is any way to do this with finalizers because existing users will be using string values, not pointers, so there is nothing to hang a finalizer on. For that matter weak references wouldn't work either: you would have to change all the users to use the weak reference type rather than string.

In general finalizers and weak references are equivalent in power, so anything you can do with one you can do with the other. But you have to be able to control the types that people will use.

ianlancetaylor on 29 Aug 2019

@ianlancetaylor thanks for the clarification.

If there is interest in this, I can modify https://golang.org/cl/188500 to implement this. I already have some wrapper code around json/xml decoder to do this because reusing just the key strings makes a lot of difference for large docs.

cc @mvdan

bserdar on 29 Aug 2019

For that matter weak references wouldn't work either: you would have to change all the users to use the weak reference type rather than string.

No? We would need some sort of language-supported “weak string” type, but then only the json.Decoder would hold a weak reference — the decoded messages would use a regular string so that the data doesn't disappear out from beneath them.

(But that assumes a language change that seems pretty unlikely to ever happen.)

bcmills on 29 Aug 2019

In general finalizers and weak references are equivalent in power, so anything you can do with one you can do with the other. But you have to be able to control the types that people will use.

Ah, I think I see what you're saying. If we could control the types in use, then we could add a pointer indirection, and require users to keep the pointer alive in order to retain the string value in the decoder's cache.

But then you can't use those pointers in the same way as ordinary strings: they can't be map keys, they won't work with the normal built-in string operations, and they can't be passed to the strings package without falling back to manual memory management (via runtime.KeepAlive).

It may be true that they are _technically_ equivalent in power, but they are not equivalently usable.

bcmills on 29 Aug 2019

@bserdar, strings.LiteralStore seems unfortunate given that we're also discussing adding generics to the language, because there is a much more general “interning table” API that can be used with any type that supports hashing and equality.

The general form could have an API like:

package intern

type Table(type T HashEqualer) […]

func (t Table(T)) Get(x T) (canonicalX T) {
    […]
}

Such an API is also useful for building things like abstract syntax trees, where it can be useful to compress trees by coalescing occurrences of a common subexpression to the same canonical instance.

(A long time ago I maintained a server that evaluated large numbers of boolean expressions for ad targeting, and it used a similar approach — to great effect — to compress the targeting expressions, including in our wire protocol.)

bcmills on 29 Aug 2019

@bcmills, interning table would be nice, but it won't be available any time soon. Until that becomes available, we can keep interning internal to the Decoder, but expose an InternKeys(bool) method for explicit opt-in. Or, implement the strings.LiteralStore, and change it to use the intern package when generics are done.

bserdar on 29 Aug 2019

xml.Decoder can benefit from interning strings as well. What is the opinion about that? Decoder.InternKeys() would be the explicit opt-in for interning keys during decode. Something similar to this: https://go-review.googlesource.com/c/go/+/188500

bserdar on 2 Oct 2019

There's been a lot of discussion here but I haven't seen any objections to simply putting the map I suggested into Decoder and not trying to reuse it any more than that.

I've seen no rationale for adding new API. Just do it all the time. Am I missing something?

rsc on 27 Nov 2019

There is a small performance hit it if it used all the time. Measurements are here: https://go-review.googlesource.com/c/go/+/188500:

CodeDecoder-8 9.87ms ± 2%
CodeDecodeToInterfaceNoIntern-8 10.9ms ± 3%
CodeDecodeToInterfaceIntern-8 11.6ms ± 4%

If there is interest in this, I can fix that CL

bserdar on 27 Nov 2019

I'd gladly take that performance hit over new API. Most JSON has many repeated strings so as long as we can show a lowered memory usage I think that's a clear win.

rsc on 27 Nov 2019

It uses less memory with interning. These are from the same unit test (CodeDecode*), with explicit calls to gc:

No interning:

Alloc:10708232 TotalAlloc:26603840 Sys:71762168 Lookups:0 Mallocs:364454 Frees:196341 HeapAlloc:10708232 HeapSys:66519040 HeapIdle:54239232 HeapInuse:12279808 HeapReleased:49496064 HeapObjects:168113

With interning:

Alloc:10487800 TotalAlloc:38961736 Sys:71762168 Lookups:0 Mallocs:635980 Frees:481644 HeapAlloc:10487800 HeapSys:66617344 HeapIdle:52895744 HeapInuse:13721600 HeapReleased:44720128 HeapObjects:154336

bserdar on 27 Nov 2019

This seems worth doing, and it seems like a minor optimization, not a visible API change.
It's unclear this really needs to be in the proposal process at all,
but since it's here, based on the discussion above and how little is affected,
it seems like a likely accept.

rsc on 25 Mar 2020

👍1

I'd say that the memory comments are the other way around?

mvrhov on 26 Mar 2020

@mvrhov, sorry, I don't quite understand what you mean.

rsc on 26 Mar 2020

Well the numbers for interned strings are larger... Total allocs by 30%, the number of mallcos by 50%, frees by 60%. The number of allcos is smaller by 2%

mvrhov on 26 Mar 2020

Allocs are smaller meaning there's less memory allocated in the end. TotalAlloc is higher because of the interning. Frees are larger, because most of the interned string copies that would've been in the unmarshaled object are collected. Overall, with interning when everything is said and done, there's less memory allocated.

bserdar on 26 Mar 2020

@mvdan, are you concerned about the overhead of the map lookup in your Jun 26 comment? From the outside there can be no possible downside to reusing all strings that are constructed.

@rsc sorry that I missed your comment for this long. I could have sworn I had replied.

My concern is about keeping strings alive for longer than needed. That's very much related to one of your original points:

It's unclear to me whether the map in a Decoder should persist between Decode operations. Probably?

For example, imagine if I have a server that receives JSON objects via a network stream, unmarshals them, writes them to a database, and discards them. Right now, I would write this program in a way that the "incoming objects" goroutine keeps a single decoder alive forever.

If the decoder starts memoizing strings forever (since the decoder persists forever), my memory usage could increase over time if I keep seeing new/unique strings. For example, it seems pretty common to represent UUIDs and hashes as strings in JSON, so I'm a bit worried that users could easily run into this kind of "memory leak".

mvdan on 1 Apr 2020

👍1

@mvdan, is this still a concern if only the JSON keys are memoized? Keys are repetitive. The memory numbers I pasted a while ago above is from memoizing only the JSON keys, not the field values.

Also, memoizing keys are only meaningful when decoding to a map.

bserdar on 1 Apr 2020

It's unclear to me if the current proposal is to memoize all strings, or just keys when decoding into map[string]T. I imagine most cases of unique/changing strings don't happen in map keys, but I still find that a possible scenario. For example, a map where the keys are unique IDs or hashes seems pretty reasonable, if one wants to then do quick lookups after unmarshaling.

So, my concern is probably smaller if we only memoize map keys, but even then, I'm still a bit concerned.

mvdan on 1 Apr 2020

The proposal is to only memoize in a single json.Decoder, not globally.
If you want to avoid the memoization, you could use multiple json.Decoders.
Note that each call to json.Unmarshal uses its own Decoder, so all those would be independent.
(If you decoded an object list into a []map[string]int, then all the map keys in one decoding could be shared, but not across decodings.)

Another possibility is to do the memoization by default but allow a hook in the Decoder (a method SetSaveString(func([]byte) string) or something like that) that would let you override the policy to have a global pool, or a pool only for small strings.

But it seems like we if there's a win-win here then we should take that win by default.

Does anyone object to marking this accepted assuming that there is a win-win?
Obviously if the benchmarks say there isn't a win then we wouldn't take the CL.

rsc on 1 Apr 2020

👍1

I think this still seems like a likely accept, but I'm leaving it for another week to make sure I'm reading the thread right.

rsc on 1 Apr 2020

If you want to avoid the memoization, you could use multiple json.Decoders.

My point is precisely that a program keeping a long-lived json.Decoder isn't doing anything wrong, but could silently gain a "memory leak" of sorts when upgrading its Go version. That's what I tried to explain in my earlier comment. I guess the changelog could tell them to redo their code to not reuse a decoder forever, but that seems a bit weird.

Also, even though this proposal is marked as a likely accept, I'm still not clear on this detail which I think is pretty important:

It's unclear to me if the current proposal is to memoize all strings, or just keys when decoding into map[string]T.

I don't object to the proposal being accepted per se, but I don't support it either, given that for now it's just a "maybe, we'll see how the CL turns out" :)

mvdan on 5 Apr 2020

Would adding an API to json.Decoder to disable (or enable) interning help? There's already a CL (though quite old) with optional interning.

bserdar on 5 Apr 2020

Just as a data point, I actually included an experience report of when we did this via josharian/intern in https://github.com/golang/go/issues/5160#issuecomment-424988628 (at the time it was just about reducing memory usage and not memory allocations, since allocations happened during unmarshaling - i.e. before we could dedup the strings)

In the same comment I actually proposed as well to make it configurable on a per-field basis using the json field tag, so +1 from me for making it configurable:

type Collection struct {
  Objects []Object `json:"objects"`
}

type Object struct {
  GUID string `json:"guid"`
  ParentGUID string `json:"parent_guid,intern"` // perform interning during unmarshaling
  // ...
}

FWIW I'm implementing proper support for this - with the syntax above - in https://github.com/mailru/easyjson/pull/279.

CAFxX on 8 Apr 2020

No change in consensus here, so accepting.

Let's start with the simple per-Decoder interning on by default and go from there.
But let's plan to land this at the start of a cycle, so not for Go 1.15.
That will give time to soak and maybe discover reasons it's a bad idea.

rsc on 8 Apr 2020

I can rebase this old CL unless there are other plans: https://go-review.googlesource.com/c/go/+/188500

bserdar on 8 Apr 2020

@rsc I think we should answer the two questions I raised again in https://github.com/golang/go/issues/32779#issuecomment-609479347 before getting into a CL. To repeat my current understanding:

1) Would this only affect map keys, or all strings? This is unclear, and IMO should be decided before we jump to an implementation.

2) The change, as proposed right now, would silently introduce memory leaks in programs that currently keep a Decoder alive for a long time. You say:

If you want to avoid the memoization, you could use multiple json.Decoders.

mvdan on 9 Apr 2020

👍7

Maybe this is a dumb question but, would the strings being memoized be from the types, or strings that just flew by? If it's the types themselves I would be comfortable running this in prod, but if it's strings that happened to be passed from a user (like the values in a map or in a list or "unknown" keys to a struct) this sounds like a possible DoS.

I don't know how you could keep an encoder around for a long time in a web context other than maybe web sockets???

frioux on 10 Apr 2020

@mvdan, what if we added some sort of decay behavior or LRU caching, so that infrequently used-strings would expire instead of being memoized forever?

That would at least prevent _unbounded_ memory leaks for infrequent keys, while still providing most of the memory reduction of memoization.

bcmills on 10 Apr 2020

@mvdan, what if we added some sort of decay behavior or LRU caching, so that infrequently used-strings would expire instead of being memoized forever?

That would at least prevent _unbounded_ memory leaks for infrequent keys, while still providing most of the memory reduction of memoization.

Agreed, this is similar in spirit to what you get with josharian/intern, and it worked well in practice for us. Worth noting that we were interning only the field values, not the field keys (we didn't have maps in the decoded struct, but we had quite some string fields whose values were likely to repeat).

Would this only affect map keys, or all strings?

Given what I wrote above, I think we should do the latter. If the worry is keeping too many strings alive it should be easy enough to cap the size of the interning map (even just random replacement would likely be enough, and trivial to implement; as the current proposal is to have an interning map per Decoder, it will likely not make much of a difference to implement a better eviction policy since multiple Decoders will allocate the same strings anyway).

CAFxX on 12 Apr 2020

👍1

Some sort of LRU, or mechanism to limit the amount of memory we hold onto at once, seems like an OK tradeoff to me. I guess we can work this out on the CL itself, but it seemed clearly backwards incompatible to me to make this entirely unbounded by design.

mvdan on 20 Apr 2020

👍1

I've been experimenting with several possible implementations of this.

Here are some of my thoughts:

As a string gets longer, the probability of a cache hit drops and the cache lookup cost grows. This makes sense since longer strings have an exponentially larger number of possible representations and also take longer to compare or hash. Thus, we should only do caching for small strings (e.g., <= 16B).
The Go memory allocator is already fairly fast in allocating small strings. For example, a 16B string takes 35ns on my machine. Granted, this is the up-front CPU cost, and doesn't include the asynchronous costs associated with GC heap scanning and memory freeing. Any cache lookup cost should be comparable to or faster than 35ns.
For most realistic JSON data, strings in JSON object names have a high degree of locality. That is, seeing a specific object name has a high probability of seeing it again within the next few dozen object names. This is especially true for a JSON array containing homogenous JSON objects, but is also for tree-like representations of JSON objects where each node has a similar set of fields. Thus, a small cache can be highly beneficial for JSON object names.
For most realistic JSON data, identical strings in general JSON values exhibit poor locality. Repeated strings in data tend to occur either for enum-like values (e.g., "SUCCESS") or what are functionally "pointers" to create a graph out of JSON data (e.g., a ID reference to some other part of the JSON data). In order to detect repeated usages, we would need a relatively large cache. Also since most of strings will be a cache miss, there is the wasted cost of looking up a string in the common case where there is no benefit.
A naive map[string]string is not a good data structure for a string cache since its 1) hard to prevent unbounded growth 2) has double memory use (one for the key and one for the value; even though they hold the same value), and 3) holds a reference to a string value that may no longer be used by the application.

Here's a simple data structure that I tried that does seem sufficiently effective to use by default:

type stringCache [64]string

func (c *stringCache) make(b []byte) string {
    if len(b) < 2 || len(b) > 16 {
        return string(b)
    }
    h := hashBytes(b) % len(*c)
    if s := (*c)[h]; s == string(b) {
        return s
    }
    s := string(b)
    (*c)[h] = s
    return s
}

It's a small, fixed-size cache. Thus it avoids any concerns about unbounded memory growth or overtly wasting too much space holding onto old, unused strings. At worst, the memory usage for this is len(stringCache)*(unsafe.Sizeof(string) + maxStringSize) where maxStringSize is the maximum string length we try caching (16B in the above implementation). The effectiveness of this is highly dependent on the speed of hashBytes. The size of stringCache should be determined based on how often we expect to see the same strings repeat themselves and according to the birthday problem.

Using runtime.memhash as the implementation (see #42710) of hashBytes, I get the following numbers:

BenchmarkMalloc/FrequentNames       48031    25150 ns/op     9600 B/op  1000 allocs/op
BenchmarkCache/FrequentNames        66622    18019 ns/op       96 B/op    10 allocs/op // 0.7x slower than Malloc
BenchmarkMap/FrequentNames          40252    31818 ns/op      678 B/op    11 allocs/op // 1.3x slower than Malloc

BenchmarkMalloc/PathalogicalNames   76096    15348 ns/op     8000 B/op   500 allocs/op
BenchmarkCache/PathalogicalNames    48894    24289 ns/op     8000 B/op   500 allocs/op // 1.6x slower than Malloc
BenchmarkMap/PathalogicalNames      10000   123864 ns/op    91250 B/op   524 allocs/op // 8.1x slower than Malloc

This tests three different approaches:

Malloc is using a naive []byte to string conversion.
Cache is using stringCache.
Map is using a map[string]string as the cache.

This tests two extreme sets of data:

FrequentNames represents the best-case scenario where the same 10 strings occur over repeatedly.
Pathological represents the worst-case scenario where every string is exactly 16B long, but all different, resulting in a 100% cache miss. This is pathological for several reasons: 1) every entry in the cache will be populated, 2) the string in every entry is also 16B, so we must do a byte-for-byte comparison, 3) only the last few bytes of the string differ, 4) every entry results in a miss. Thus, for every string we do O(n) work to hash the string, and O(n) work to compare the string (where n is the string length).

Some notes:

A map[string]string rarely out-performs malloc and seems to be a net loss.
Assuming that most data look more like FrequentNames than PathologicalNames, the use of stringCache is probably a net win.
At least the pathological case for stringCache is only 1.3x slower (surprisingly 8x slower for map[string]string).

dsnet on 21 Nov 2020

👍2

@dsnet nice analysis. One more thing to try: use a fixed size array of strings, iterate instead of hashing for lookups, and do random replacement. Something like https://github.com/josharian/go/commit/ab074c9fa67a227ad89565941c501fda18979a88. (There are many variations on this theme available.)

josharian on 21 Nov 2020

Move-to-front is also a good caching scheme when there are a few very common keys.

randall77 on 21 Nov 2020

On Sat, Nov 21, 2020 at 12:18 PM Joe Tsai notifications@github.com wrote:

I've been experimenting with several possible implementations of this.

Here are some of my thoughts:

As a string gets longer, the probability of a cache hit drops and
the cache lookup cost grows. This makes sense since longer strings have an
exponentially larger number of possible representations and also take
longer to compare or hash. Thus, we should only do caching for small
strings (e.g., <= 16B).

The Go memory allocator is already fairly fast in allocating small
strings. For example, a 16B string takes 35ns on my machine. Granted, this
is the up-front CPU cost, and doesn't include the asynchronous costs
associated with GC heap scanning and memory freeing. Any cache lookup cost
should be comparable to or faster than 35ns.

For most realistic JSON data, strings in JSON object names have a
high degree of locality. That is, seeing a specific object name has a high
probability of seeing it again within the next few dozen object names. This
is especially true for a JSON array containing homogenous JSON objects, but
is also for tree-like representations of JSON objects where each node has a
similar set of fields. Thus, a small cache can be highly beneficial for
JSON object names.

Here's a realistic JSON data set:

https://storage.googleapis.com/synthea-public/synthea_sample_data_fhir_r4_sep2019.zip

This data set contains many large JSON documents, generated health data.
These are the kind of documents that caching keys can make a real
difference.

Several observations:

There is little point in caching values of key:value pairs in these
docs. However, caching keys make a significant difference.
There are many unique keys
- Note that a map[string]string still has a single copy of the string:
  
  map[s]=s
- The problem of unbounded growth can be dealt with using a separate
  
  string cache for each JSON document. Once the document ends, the cache can
  
  be forgotten.

For most realistic JSON data, identical strings in general JSON
values exhibit poor locality. Repeated strings in data tend to occur either
for enum-like values (e.g., "SUCCESS") or what are functionally "pointers"
to create a graph out of JSON data (e.g., a ID reference to some other part
of the JSON data). In order to detect repeated usages, we would need a
relatively large cache. Also since most of strings will be a cache miss,
there is the wasted cost of looking up a string in the common case where
there is no benefit.

A naive map[string]string is not a good data structure for a string
cache since its 1) hard to prevent unbounded growth 2) has double memory
use (one for the key and one for the value; even though they hold the same
value), and 3) holds a reference to a string value that may no longer be
used by the application.

Here's a simple data structure that I tried that does seem sufficiently
effective to use by default:

type stringCache [64]string
func (c stringCache) make(b []byte) string {
if len(b) < 2 || len(b) > 16 {
return string(b)
}
h := hashBytes(b) % len(c)
if s := (c)[h]; s == string(b) {
return s
}
s := string(b)
(c)[h] = s
return s
}

It's a small, fixed-size cache. Thus it avoids any concerns about
unbounded memory growth or overtly wasting too much space holding onto old,
unused strings. At worst, the memory usage for this is len(stringCache)*(unsafe.Sizeof(string)

maxStringSize) where maxStringSize is the maximum string length we try
caching (16B in the above implementation). The effectiveness of this is
highly dependent on the speed of hashBytes. The size of stringCache
should be determined based on how often we expect to see the same strings
repeat themselves and according to the birthday problem
https://en.wikipedia.org/wiki/Birthday_problem.

Using runtime.memhash as the implementation (see #42710
https://github.com/golang/go/issues/42710) of hashBytes, I get the
following numbers https://play.golang.org/p/7e6w9w1V8dt:

BenchmarkMalloc/FrequentNames 48031 25150 ns/op 9600 B/op 1000 allocs/op
BenchmarkCache/FrequentNames 66622 18019 ns/op 96 B/op 10 allocs/op // 0.7x slower than Malloc
BenchmarkMap/FrequentNames 40252 31818 ns/op 678 B/op 11 allocs/op // 1.3x slower than Malloc

BenchmarkMalloc/PathalogicalNames 76096 15348 ns/op 8000 B/op 500 allocs/op
BenchmarkCache/PathalogicalNames 48894 24289 ns/op 8000 B/op 500 allocs/op // 1.6x slower than Malloc
BenchmarkMap/PathalogicalNames 10000 123864 ns/op 91250 B/op 524 allocs/op // 8.1x slower than Malloc

This tests three different approaches:

Malloc is using a naive []byte to string conversion.

Cache is using stringCache.

Map is using a map[string]string as the cache.

This tests two extreme sets of data:

FrequentNames represents the best-case scenario where the same 10
strings occur over repeatedly.

Pathological represents the worst-case scenario where every string
is exactly 16B long, but all different, resulting in a 100% cache miss.
This is pathological for several reasons: 1) every entry in the cache will
be populated, 2) the string in every entry is also 16B, so we must do a
byte-for-byte comparison, 3) only the last few bytes of the string differ,
4) every entry results in a miss. Thus, for every string we do O(n) work to
hash the string, and O(n) work to compare the string (where n is the string
length).

Some notes:

A map[string]string rarely out-performs malloc and seems to be a net
loss.

Assuming that most data look more like FrequentNames than
PathologicalNames, the use of stringCache is probably a net win.

At least the pathological case for stringCache is only 1.3x slower
(surprisingly 8x slower for map[string]string).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/32779#issuecomment-731623919, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AA4AGDKAYEOPE45EW7GYE6TSRAG65ANCNFSM4H3MJFLQ
.

bserdar on 23 Nov 2020

@josharian and @randall77. I tried a simple linear search in some of my earlier approaches, and it does decent where there is a hit. However, I felt like it performed poorly in the cases where there were never any matches since the miss cost was proportional to the number of maximum elements that we linear search over.

One advantage of a linear search (with move-to-front) is that it provides a heuristic for which entry to evict. This avoids the collision problem that my naive hashing approach exhibits (due to the birthday problem). In order to have a decent hit rate on a list of strings with only N unique entries, the cache size needs to be approximately 2*N.

dsnet on 23 Nov 2020

We could consider any the frequent item algorithms described here: http://dimacs.rutgers.edu/~graham/pubs/papers/freq.pdf

It might be worth dividing the set of keys up in to buckets, hashing to the bucket, and performing one of the above algorithms on a single bucket.

randall77 on 23 Nov 2020

...with the cheapest possible “hash” being string length, since we are only considering lengths 2..16.

josharian on 23 Nov 2020

@randall77 and @josharian. Thanks for the idea. I haven't yet applied any of the specific algorithms in the paper you referenced, but I suspect they won't perform well in this specific application since allocation of small strings is already so dang fast that any book-keeping is often just as expensive as malloc itself (defeating the point of string interning).

I incorporated some of the ideas above and some new ideas:

Using the data provided by @bserdar, I computed a histogram of the string lengths of all unique object names:

 1 |> 0
 2 |=> 1
 3 |======> 6
 4 |==========> 10
 5 |===========> 11
 6 |====================> 20
 7 |==============> 14
 8 |================> 16
 9 |===============> 15
10 |=========> 9
11 |========> 8
12 |========> 8
13 |=====> 5
14 |=======> 7
15 |=======> 7
16 |===> 3

Combining the above data with the idea to bucketize based on the length, it seems that 8 buckets using the lowest 3 bits of the length might provide an okay uniform distribution across all buckets (i.e., we bucketize 8B and 16B together, 1B and 9B together, 2B and 10B together, etc). This produces a histogram as follows:

 1, 9 |===============> 15
 2,10 |==========> 10
 3,11 |==============> 14
 4,12 |==================> 18
 5,13 |================> 16
 6,14 |===========================> 27
 7,15 |=====================> 21
 8,16 |===================> 19

Bucketizing based on length has the advantage of isolating the effects of a given string length from affecting another bucket. However, it has the disadvantage that certain lengths may be under-utilized (i.e, the 2B,10B bucket for the specific data I'm playing with), leading to wasted space.

I also observed that most general-purpose hashing algorithms perform slowly for small strings since 1) they are often not inlineable, 2) may not benefit from any wide-integer instructions (at least from my observation). Thus, I tried performing a hash of a short string by relying on wide-integer loads and hashing the resulting integers.

The current algorithm I came up with is:

type stringCache [256]string

func (c *stringCache) make(b []byte) string {
    if len(b) < 2 || len(b) > 16 {
        return string(b)
    }

    var h uint64
    {
        // Read the string in two parts using wide-integer loads.
        // The prefix and suffix may overlap, which is fine.
        switch {
        case len(b) >= 8:
            h ^= uint64(binary.LittleEndian.Uint64(b[:8]))
            h *= 0x00000100000001B3 // inspired by FNV-64
            h ^= uint64(binary.LittleEndian.Uint64(b[len(b)-8:]))
        case len(b) >= 4:
            h ^= uint64(binary.LittleEndian.Uint32(b[:4]))
            h *= 0x01000193 // inspired by FNV-32
            h ^= uint64(binary.LittleEndian.Uint32(b[len(b)-4:]))
        default:
            h ^= uint64(binary.LittleEndian.Uint16(b[:2]))
            h *= 0x010f // inspired by hypothetical FNV-16
            h ^= uint64(binary.LittleEndian.Uint16(b[len(b)-2:]))
        }

        // Collapse a 64-bit, 32-bit, or 16-bit hash into an 8-bit hash.
        h ^= h >> 32
        h ^= h >> 16
        h ^= h >> 8

        // The cache has 8 buckets based on the lower 3-bits of the length.
        h = ((h << 3) | uint64(len(b)&0b111)) % uint64(len(*c))
    }

    if s := (*c)[h]; s == string(b) {
        return s
    }
    s := string(b)
    (*c)[h] = s
    return s
}

Relative to my earlier algorithm, this is decently faster and results is fewer misses:

BenchmarkMalloc/Best    41395   26997 ns/op 9600 B/op   1000 allocs/op
BenchmarkCache1/Best    66477   18359 ns/op   96 B/op     10 allocs/op // 1.47x faster
BenchmarkCache2/Best    88276   13699 ns/op   96 B/op     10 allocs/op // 1.97x faster
BenchmarkMap/Best       39099   30543 ns/op  678 B/op     11 allocs/op // 0.88x faster

BenchmarkMalloc/Real    2503    505818 ns/op    161456 B/op 18529 allocs/op
BenchmarkCache1/Real    2769    440636 ns/op     28960 B/op  2475 allocs/op // 1.15x faster
BenchmarkCache2/Real    4138    302386 ns/op     17712 B/op  1139 allocs/op // 1.67x faster
BenchmarkMap/Real       1668    741258 ns/op     31706 B/op   467 allocs/op // 0.68x faster

BenchmarkMalloc/Worst   72957    16689 ns/op     8000 B/op  500 allocs/op
BenchmarkCache1/Worst   46144    26646 ns/op     8000 B/op  500 allocs/op // 1.60x slower
BenchmarkCache2/Worst   49250    23897 ns/op     8000 B/op  500 allocs/op // 1.43x slower
BenchmarkMap/Worst      10000   135081 ns/op    91233 B/op  524 allocs/op // 8.09x slower

where Cache2 is the algorithm in this post, and Cache1 is the earlier algorithm. The Real benchmark was switched to using object names from the dataset provided by @bserdar.

dsnet on 30 Nov 2020

Change https://golang.org/cl/277376 mentions this issue: WIP: runtime: add string interning profile

gopherbot on 12 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings