As part of our overall project to improve on-disk buffering, we need to collect data to inform our decisions. I'd like to improve our benchmarks to collect more performance data on various on-disk buffering scenarios. For example, we should have a benchmark the reproduces #1179.
The end goal should help us answer questions like:
Currently, vector has two types of buffers available to users. One, is an in memory buffer that holds up to a fixed amount of events (default is 500). Second, is a disk based buffer which is backed by leveldb which is a embedded key-value database. Implementation wise both of the buffers basically act like a multi-producer single-consumer channel. They each provide a reader and writer end, where the reader is clonable. The memory buffer in this case is just a futures::sync::mpsc channel, while the disk backed buffer is a custom implementation that can be found in src/buffers/disk.rs.
The performance of the memory buffer is about 80% better than the performance of the disk buffer. This is mainly due to the implementation of the disk buffer. The current implementation of the disk buffer needs to encode the event, write the event to the database, then trigger the reader task to read the next event, once the event has been read from the database, we then decode the event and send it to the sink. This process is quite expensive and thus shows why the performance of the disk buffer is poor.
Another thing, that we have not very clearly defined in our docs is what our disk buffer durability guarantees are. Currently, we buffer at least 100 events in memory that are already encoded. Once, we have added the 100th event we will then write these events to leveldb. This in turn actually only writes this batch to the operating system鈥檚 memory and then will asynchronously write them to durable disk storage. Because of this, we have no guarantees around when data will be persistent to a machine crash. That said, we are still persistent to process crashes and panics.
In the end, I don鈥檛 really feel like its vector鈥檚 job to be a durable store for log data. Instead, this job should be offloaded to something like kafka or s3.
Our current implementation of the disk buffer is decently simple and easy to maintain. Through my testing of the benchmarks and tests, I鈥檝e noticed that its is actually quite stable in its current state and works as expected across all three of our supported operating systems. Both of the issues that lead us to think about needing to fix the disk buffer turned out not to be directly related to our use of leveldb. This leads me to suggest that we don鈥檛 replace our current disk buffer implementation. I don鈥檛 think we have seen many users (maybe I鈥檓 the only one? 馃檪 ) complain about its performance. Therefore it doesn鈥檛 make much sense right now to invest heavily into fixing the performance.
Even though we may not want to change our disk buffer right now, it doesn鈥檛 mean that we may want to in the future. One idea might be to offload our disk buffer work to a background task. The idea here would be to create a variant of the in memory buffer that will use the disk buffer as extra storage space for events instead of applying back pressure up the topology path. The goal is to optimize for the happy path where we still write to disk but in the critical path of sending events only need to do the same work as the memory buffer. Since, we don鈥檛 actually provide any durability guarantees writing to disk asynchronously in a background task will not change how we present the disk buffer to users. This means, that we can totally remove encoding, writing to the database, switching tasks, and decoding from the critical poll path of our sinks. That said, this implementation comes with much more complexity and would need to be heavily tested. This additional complexity may not be worth it.
Other possible solutions would be to adopt a async implementation of https://github.com/postmates/hopper which has a slightly different design than the one mentioned above. Instead, hopper will start to fill the disk once the in memory buffer is full. This means that it can鈥檛 handle storing all events across a restart like vector currently is able too. This implementation for now seems like it would decrease the benefits of vector and could provide some inconsistent performance since it is a hybrid approach.
I am going to close this issue as all points have been resolved.
Thanks for writing this up! Just one small thing I noticed:
Currently, we buffer at least 100 events in memory that are already encoded. Once, we have added the 100th event we will then write these events to
leveldb.
This is actually the maximum amount we'll batch into a single write, and batches will be written almost immediately if our input isn't saturated. If the input is saturated, we'll write batches of 100.
@lukesteensen ah you're right, the Forward future will poll_complete on a pending event if its not saturated. Good catch!
Most helpful comment
Current implementation
Currently,
vectorhas two types of buffers available to users. One, is an in memory buffer that holds up to a fixed amount of events (default is 500). Second, is a disk based buffer which is backed byleveldbwhich is a embedded key-value database. Implementation wise both of the buffers basically act like a multi-producer single-consumer channel. They each provide areaderandwriterend, where the reader is clonable. The memory buffer in this case is just afutures::sync::mpscchannel, while the disk backed buffer is a custom implementation that can be found insrc/buffers/disk.rs.The performance of the memory buffer is about 80% better than the performance of the disk buffer. This is mainly due to the implementation of the disk buffer. The current implementation of the disk buffer needs to encode the event, write the event to the database, then trigger the reader task to read the next event, once the event has been read from the database, we then decode the event and send it to the sink. This process is quite expensive and thus shows why the performance of the disk buffer is poor.
Another thing, that we have not very clearly defined in our docs is what our disk buffer durability guarantees are. Currently, we buffer at least 100 events in memory that are already encoded. Once, we have added the 100th event we will then write these events to
leveldb. This in turn actually only writes this batch to the operating system鈥檚 memory and then will asynchronously write them to durable disk storage. Because of this, we have no guarantees around when data will be persistent to a machine crash. That said, we are still persistent to process crashes and panics.In the end, I don鈥檛 really feel like its vector鈥檚 job to be a durable store for log data. Instead, this job should be offloaded to something like
kafkaors3.Our current implementation of the disk buffer is decently simple and easy to maintain. Through my testing of the benchmarks and tests, I鈥檝e noticed that its is actually quite stable in its current state and works as expected across all three of our supported operating systems. Both of the issues that lead us to think about needing to fix the disk buffer turned out not to be directly related to our use of
leveldb. This leads me to suggest that we don鈥檛 replace our current disk buffer implementation. I don鈥檛 think we have seen many users (maybe I鈥檓 the only one? 馃檪 ) complain about its performance. Therefore it doesn鈥檛 make much sense right now to invest heavily into fixing the performance.Possible Solutions
Even though we may not want to change our disk buffer right now, it doesn鈥檛 mean that we may want to in the future. One idea might be to offload our disk buffer work to a background task. The idea here would be to create a variant of the in memory buffer that will use the disk buffer as extra storage space for events instead of applying back pressure up the topology path. The goal is to optimize for the happy path where we still write to disk but in the critical path of sending events only need to do the same work as the memory buffer. Since, we don鈥檛 actually provide any durability guarantees writing to disk asynchronously in a background task will not change how we present the disk buffer to users. This means, that we can totally remove encoding, writing to the database, switching tasks, and decoding from the critical poll path of our sinks. That said, this implementation comes with much more complexity and would need to be heavily tested. This additional complexity may not be worth it.
Other possible solutions would be to adopt a async implementation of https://github.com/postmates/hopper which has a slightly different design than the one mentioned above. Instead,
hopperwill start to fill the disk once the in memory buffer is full. This means that it can鈥檛 handle storing all events across a restart like vector currently is able too. This implementation for now seems like it would decrease the benefits of vector and could provide some inconsistent performance since it is a hybrid approach.I am going to close this issue as all points have been resolved.