Badger's Value used to obtain the buffer with the content of a Get query indicates that the buffer's content is valid only within the domain of the transaction where the query was performed. This causes IPFS to (correctly) copy its value to decouple the datastore (which entails the Badger transaction model) from the upper layers.
The result is a severe performance impact (hence I'm assigning this issue a high priority, my assessment could be exaggerated and this might be adjusted later). The CPU time spent on memmove_amd64.s is in the same order of magnitude (and sometimes even bigger) than the get operation itself from the levelHandler which should be the core consumer of CPU time at this point in the program. I'll report some concrete pprof statistics to quantify this impact.
I need to research this further with Badger's people to understand why they recycle buffer slices instead of relying on Go's GC system (the obvious advantage seems to be avoiding the re-allocation of them but there may be a more fundamental reason) and to see if the API could be generalized to permit the user to take ownership of the buffer it receives, the restriction of operating within the domain of the current transaction seems a bit too severe.
Compare this situation with the current default flatfs datastore which just returns the buffer received from ReadFile without making a decoupling copy call, as this buffer was created specifically for the read operation in readAll, so there is no extra performance penalty for its use outside the datastore context (flatfs is still slower than badger for other reasons regarding its disk write performance).
I think the right way forward with this is to change the datastore interfaces to allow the caller to operate on the data within a closure. This way we could unmarshal our objects directly from the underlying buffer, instead of making a copy that we only use to call unmarshal with.
That aside, I'm curious to see what the cpu usage between badger and flatfs would be for similar operations. I'm thinking it will still be much better despite the extra copy.
As far as I know the reason they do that is because the data you get is not copied, it is a slice backed by mmap of a segment of a file. They don't mmap whole files, nor all of them. After the closure is done they can change the mapping to different segment or file.
This also means that they do not allocate slices, they just allocate slice header and it is possible that those headers are allocated on stack (no GC pressure) if everything is done correctly.
But I am not 100percent sure about that.
@Kubuxu Thanks for pointing me to mmap. What I was seeing as a CPU intensive memory copy was actually the kernel reading the file and loading those memory pages during the copy operation, the pprof statistics pointed to the VMOVDQU (SI), Y0 instruction (and related) in charge of loading the source memory as the most expensive ones and pprof doesn't have a way to know that it is actually the kernel at work (and I should have known better). I'm removing the P1 tag, the copy is still an expensive operation (especially the malloc'ing of new buffers) but it is no longer so easy to distinguish its real cost (if I profile a DB with no mmap, the alloc/copy operation CPU time is still a 1/5th of the reading file calls, so it's definitively not negligible).
With respect to buffer ownership, having slices pointing to the same underlying array as the fmap of a value log file doesn't seem like enough of a constraint, in the next mmap a new array will be created and the buffer returned by Get still will point to the old (valid) one, Go GC can take care of when to reclaim it. The mapped regions are read-only (as far as I've seen) so the array won't be modified (except for other consumers of Get, but that's another issue -- that should be documented in the Badger's API), the downside I can see is that (for example) a 2 MB mapped region may remain in memory just because someone is holding a 1 KB slice of it, which could potentially cause some memory bloat.
As you said, the fmap slice is created from an mmap pointer through the unsafe API, but I'm guessing the GC will still take care of that pointer even when converted to a slice.
https://github.com/golang/sys/blob/0deb464c5a1b48110355376f934e733e43b483bf/unix/syscall_unix.go#L77-L91
@whyrusleeping
That aside, I'm curious to see what the cpu usage between badger and flatfs would be for similar operations. I'm thinking it will still be much better despite the extra copy.
Yes, badger is clearly much more efficient than flatfs (which is reading all around the file system for the scattered files/blocks), I only mentioned it to illustrate the difference between buffer ownerships, but despite all these (minor) analyses badger is definitively the way to go.
I think @whyrusleeping's suggestion is the right way to approach the problem.
Why make a copy if we are only going to unmarhsal data out of it? Better to just pass a closure to operate on the buffer directly.
Yes, from what I'm seeing in the PBNode.Unmarshal function a new array is being created for the extracted data so there would be a buffer isolation (making the previous copy unnecessary), and I'm assuming all the protobuf family of unmarshalers work the same. Still I would like to have a clear API boundary of the ownership of the buffers being handled here.
@schomatis Yeah, i've been wanting to do a refactor for a while to make this better: https://github.com/ipfs/go-datastore/issues/64
With respect to buffer ownership, having slices pointing to the same underlying array as the fmap of a value log file doesn't seem like enough of a constraint
This is true unless badger closes the underlying file. We should check this.
This is true unless badger closes the underlying file. We should check this.
Yes, I'm raising an issue in Badger to check with them and discuss this further.
@schomatis afaik, their structure doesn't allow them to write them to the same table (that is what they call mmap'able segments) if there are readers still open (so valid buffers).
@Kubuxu Could you elaborate a bit more on that please? I'm not sure I'm understanding your last comment.
There are Write and Get (or some other names in badger) operations.
While Get operation is working on some table (internal shard in badger) other Get operations can also use that table but to Write into the table all Get operation closures have to finish before it is allowed to proceed. This is what I understood.
Writing the issue for Badger I realized the dependency is not so much in the underlying array itself but in the mapped address space of the log file the array is pointing to. Closing the file itself is not a problem but Badger eventually unmaps the address space causing invalid references, so the buffer can't be held indefinitely by upper layers. @Kubuxu mentioned something of the sort about mapped regions being reused but I didn't worry as the addr used in mmap were always NULL, but the regions will always be eventually unmapped (which is the correct behavior).
This doesn't seem to be a problem in the case of accessing log files through standard I/O (where buffers are created specifically for the read operation) but I understand that performs much lower than memory-mapping, defeating the initial purpose of this issue.
I'm closing this issue in favor of @whyrusleeping suggestion to use closures as Unmarshal will always force a copy anyways.