Hello Community,
I have been diving into memory/span lately, due to part to the great article here by @aalmada.
I took the code found in the article and made my own fork and branch focusing on enumerations, adjusting the ItemsBufferCount property to the following results:
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-4820K CPU 3.70GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.401
[Host] : .NET Core 2.1.3 (CoreCLR 4.6.26725.06, CoreFX 4.6.26725.05), 64bit RyuJIT
Job=ShortRun Toolchain=InProcessToolchain IterationCount=5
LaunchCount=1 WarmupCount=3
| Method | ItemsCount | ItemsBufferCount | Mean | Error | StdDev | Scaled | ScaledSD | Gen 0 | Allocated |
|-------------- |----------- |----------------- |---------:|----------:|----------:|-------:|---------:|-------:|----------:|
| RawStackalloc | 1000000 | 64 | 1.985 ms | 0.0394 ms | 0.0102 ms | 1.00 | 0.00 | - | 0 B |
| RawForIndexer | 1000000 | 64 | 1.727 ms | 0.0954 ms | 0.0248 ms | 0.87 | 0.01 | - | 1048 B |
| | | | | | | | | | |
| RawStackalloc | 1000000 | 128 | 1.845 ms | 0.0556 ms | 0.0144 ms | 1.00 | 0.00 | - | 0 B |
| RawForIndexer | 1000000 | 128 | 1.965 ms | 0.0653 ms | 0.0170 ms | 1.06 | 0.01 | - | 2072 B |
| | | | | | | | | | |
| RawStackalloc | 1000000 | 256 | 1.822 ms | 0.0583 ms | 0.0151 ms | 1.00 | 0.00 | - | 0 B |
| RawForIndexer | 1000000 | 256 | 1.680 ms | 0.0555 ms | 0.0144 ms | 0.92 | 0.01 | - | 4120 B |
| | | | | | | | | | |
| RawStackalloc | 1000000 | 512 | 1.705 ms | 0.0551 ms | 0.0143 ms | 1.00 | 0.00 | - | 0 B |
| RawForIndexer | 1000000 | 512 | 1.687 ms | 0.0839 ms | 0.0218 ms | 0.99 | 0.01 | - | 8216 B |
| | | | | | | | | | |
| RawStackalloc | 1000000 | 768 | 1.777 ms | 0.3436 ms | 0.0893 ms | 1.00 | 0.00 | - | 0 B |
| RawForIndexer | 1000000 | 768 | 1.822 ms | 0.1926 ms | 0.0500 ms | 1.03 | 0.05 | 1.9531 | 12312 B |
| | | | | | | | | | |
| RawStackalloc | 1000000 | 1024 | 1.834 ms | 0.0887 ms | 0.0230 ms | 1.00 | 0.00 | - | 0 B |
| RawForIndexer | 1000000 | 1024 | 1.911 ms | 0.3025 ms | 0.0786 ms | 1.04 | 0.04 | 1.9531 | 16408 B |
My question is: which result is the "winner" here? It would seem that RawForIndexer @ 256 is the best, but it also creates allocations. My understanding is that allocations are bad because they add to the garbage collection process that eventually gets picked imposing a cost at a later time. So, is it really the "winner" here or is there another selection (or obvious consideration) that I am overlooking?
I guess another aspect to my inquiry here is: do the results above include GC time incurred? That is, is GC "cost" included in these results?
The other question of course is, why is RawStackalloc not faster? :) These methods are identical in function with the sole exception that RawStackalloc uses stackalloc while RawForIndexer uses a new array allocation. From my understanding, since this array is not on the heap but rather on the stack with RawStackalloc that should see better performance.
Any insight here would be greatly appreciated. Thank you in advance!
My question is: which result is the "winner" here?
It depends on what is your goal. Do you want to make sure that there will be no GC so your code always executes in the same amount of time (low latency)? Or a GC.Collect from time to time is ok because you optimize for the maximum Throughput?
do the results above include GC time incurred? That is, is GC "cost" included in these results?
I have described this on my blog. I can see that you are using InProcessToolchain here, which can't prevent from the side effects of previous benchmarks which had allocated the memory. My personal recommendation for InProcessToolchain is to use it only when your benchmarks have no side effects (like memory allocation), an example can be a simple CPU benchmark.
From my understanding, since this array is not on the heap but rather on the stack with RawStackalloc
The best answer you can get is "What鈥檚 the Difference Between Stack and Heap?" paragraph from the Pro .NET Performance book. I highly recommend to read the entire book, it's the best book about .NET Performance.
I can see that you are using InProcessToolchain here,
Ah that is a good point, @adamsitnik! I should have known better than to run InProcess. Here it is in a compiled context, via MediumRun:
// * Summary *
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-4820K CPU 3.70GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.401
[Host] : .NET Core 2.1.3 (CoreCLR 4.6.26725.06, CoreFX 4.6.26725.05), 64bit RyuJIT
MediumRun : .NET Core 2.1.3 (CoreCLR 4.6.26725.06, CoreFX 4.6.26725.05), 64bit RyuJIT
Job=MediumRun IterationCount=5 LaunchCount=2
WarmupCount=10
Method | ItemsCount | ItemsBufferCount | Mean | Error | StdDev | Scaled | ScaledSD | Allocated |
-------------- |----------- |----------------- |---------:|----------:|----------:|-------:|---------:|----------:|
RawStackalloc | 1000000 | 64 | 2.140 ms | 0.0773 ms | 0.0511 ms | 1.00 | 0.00 | 0 B |
RawForIndexer | 1000000 | 64 | 2.126 ms | 0.0267 ms | 0.0176 ms | 0.99 | 0.02 | 1048 B |
| | | | | | | | |
RawStackalloc | 1000000 | 128 | 1.972 ms | 0.0448 ms | 0.0296 ms | 1.00 | 0.00 | 0 B |
RawForIndexer | 1000000 | 128 | 2.084 ms | 0.0381 ms | 0.0252 ms | 1.06 | 0.02 | 2072 B |
| | | | | | | | |
RawStackalloc | 1000000 | 256 | 1.978 ms | 0.0527 ms | 0.0349 ms | 1.00 | 0.00 | 0 B |
RawForIndexer | 1000000 | 256 | 1.986 ms | 0.0506 ms | 0.0335 ms | 1.00 | 0.02 | 4120 B |
| | | | | | | | |
RawStackalloc | 1000000 | 512 | 1.845 ms | 0.0320 ms | 0.0212 ms | 1.00 | 0.00 | 0 B |
RawForIndexer | 1000000 | 512 | 1.916 ms | 0.0933 ms | 0.0617 ms | 1.04 | 0.03 | 8216 B |
| | | | | | | | |
RawStackalloc | 1000000 | 768 | 1.831 ms | 0.0522 ms | 0.0346 ms | 1.00 | 0.00 | 0 B |
RawForIndexer | 1000000 | 768 | 1.816 ms | 0.0570 ms | 0.0377 ms | 0.99 | 0.03 | 12312 B |
| | | | | | | | |
RawStackalloc | 1000000 | 1024 | 1.855 ms | 0.1060 ms | 0.0701 ms | 1.00 | 0.00 | 0 B |
RawForIndexer | 1000000 | 1024 | 1.831 ms | 0.0534 ms | 0.0353 ms | 0.99 | 0.04 | 16408 B |
It's a little more in line with expectations, but RawForIndexer still squeeks out a few times, which is still unexpected.
Or a GC.Collect from time to time is ok because you optimize for the maximum Throughput?
I guess that depends on the performance impact of the GC sweep which is what I am trying to ascentain (more below).
I have described this on my blog.
I was actually going to make mention of that but am now kicking myself for not doing so. I do see "inclusive" but if I understand correctly that is in the context of _allocations_ (that is, the cost in _putting_ objects _on_ the heap) rather than the context of GC (that is, the cost of _taking_ objects _off_ the heap), correct? What I am interested in knowing is if the reported results factor in the GC's that happen or if GC is not factored in the results at all.
I highly recommend to read the entire book, it's the best book about .NET Performance.
I did also see your recommendation on your blog as well for this book. I guess I just walked myself into getting a copy of my own now. 馃槅 Thanks again for you patience and assistance!
By default, before and after every iteration BDN engine induces GC.Collect. This is the MemoryCleanup in the pseudocode from our docs
// every iteration invokes the method (invokeCount / unrollFactor) times
Measurement RunIteration(Method method, long invokeCount, long unrollFactor)
{
IterationSetup();
MemoryCleanup();
var clock = Clock.Start();
for (long i = 0; i < invokeCount / unrollFactor; i++)
{
// we perform manual loop unrolling!!
method(); // 1st call
method(); // 2nd call
method(); // (unrollFactor - 1)'th call
method(); // unrollFactor'th call
}
var clockSpan = clock.GetElapsed();
IterationCleanup();
MemoryCleanup();
return Measurement(clockSpan);
}
If GC decides to collect the memory during the actual workload (between Clock.Start and clock.GetElapsed) then it's included in the final results and the Gen X columns appears in the results. So in that case, the benchmark included the allocation and deallocation cost. If given benchmark iteration performs too few invocations and GC does not cleanup the memory then it does not contain the cost of deallocation. However BDN typically performs so many invocations per iteration that it's almost never an issue.
Awesome. Thank you @adamsitnik that was exactly the (very!) valuable information that I needed. Thank you again for taking the time here to help out!
To add some additional context and (hopefully correct) knowledge here.
Consider the following simple class and benchmarks:
public sealed class BasicClass
{
public BasicClass(object reference) => Reference = reference;
public object Reference { get; }
}
// ...
[Benchmark]
public void Collect()
{
GC.Collect(0);
}
[Benchmark]
public void MeasureClass()
{
new BasicClass(null);
}
The results of which are:
Method | Mean | Error | StdDev | Gen 0 | Allocated |
------------- |--------------:|------------:|------------:|----------:|----------:|
Collect | 21,002.145 ns | 382.8546 ns | 470.1798 ns | 1000.0000 | 0 B |
MeasureClass | 3.007 ns | 0.0853 ns | 0.0838 ns | 0.0046 | 24 B |
So, if every generation 0 garbage collection takes 21,002.145 ns to complete, and every allocation causes garbage collection to kick in ~4.6 times every 1,000 executions, this results in a a total of 21,002.145 x 4.6 = 96,609.867ns spent in garbage collection during that time.
That further means that there is an approximate "hidden" GC cost of .0046 x 96,609.867 = ~444.41ns per operation which involves an allocation.
That's my impression, at least. As always, please feel free to correct my math/misunderstanding here.
EDIT: I noticed that (naturally) after I posted my results, that the GC.Collect was doing a FULL sweep, including generation 1 and 2. I have updated the numbers to account for generation 0 only.
Most helpful comment
By default, before and after every iteration BDN engine induces GC.Collect. This is the
MemoryCleanupin the pseudocode from our docsIf GC decides to collect the memory during the actual workload (between
Clock.Startandclock.GetElapsed) then it's included in the final results and theGen Xcolumns appears in the results. So in that case, the benchmark included the allocation and deallocation cost. If given benchmark iteration performs too few invocations and GC does not cleanup the memory then it does not contain the cost of deallocation. However BDN typically performs so many invocations per iteration that it's almost never an issue.