Stl: Parallel sort algorithm performance on Threadripper

Created on 27 Aug 2020  路  32Comments  路  Source: microsoft/STL

Hi,

I tried to use std::sort(std::execution::par parallel implementation using custom predicate on gigabytes of data.
Unfortunately its result gave no much benefit comparably to single-thread std::sort.
Simple custom implementation of merge-sort gave much better benefit:

        auto TotalThreads = std::thread::hardware_concurrency();

        template<class Iter, class Pred>
        void quick_merge_sort(Iter first, Iter last, Pred pred)
        {
            auto length = last - first;
            auto chunkSize = length / TotalThreads;
            auto chunks = length / chunkSize;
            if (chunkSize > 1)
            {
                auto tail = length - (chunkSize * chunks);
                auto hasTail = tail > 0;
                auto tailStart = first + chunks * chunkSize;
                {
                    std::deque<std::future<void>> tasks;

                    for (auto thread = 0; thread < chunks; ++thread) {
                        auto chunkStart = first + thread * chunkSize;
                        auto chunkEnd = chunkStart + chunkSize;
                        tasks.emplace_back(std::async(std::launch::async, &std::sort<Iter, Pred>, chunkStart, chunkEnd, pred));
                    }

                    if (hasTail)
                        std::sort(tailStart, last, pred);

                    while (chunkSize < length)
                    {
                        auto bigChunkStart = first;
                        auto bigChunkMiddle = bigChunkStart;
                        auto bigChunkEnd = bigChunkStart;
                        for (auto thread = 0; thread < chunks; ++thread) {
                            tasks.pop_front();
                            bigChunkEnd = bigChunkMiddle + chunkSize;
                            if (thread % 2) {
                                tasks.emplace_back(std::async(std::launch::async, &std::inplace_merge<Iter, Pred>, bigChunkStart, bigChunkMiddle, bigChunkEnd, pred));
                                bigChunkStart = bigChunkEnd;
                            }
                            bigChunkMiddle += chunkSize;
                        }

                        chunkSize *= 2;
                        chunks = length / chunkSize;
                    }
                } // ensure tasks completion

                if (hasTail) {
                    std::inplace_merge(first, tailStart, last, pred);
                }
            }
            else if (length > 1) {
                std::sort(first, last, pred);
            }

            assert(std::is_sorted(first, last, pred));
        }
    }

This code still has range-way to improve CPU threads utilization, but it gave much more benefit of parallelization.
How much threads are used to sort with parallel executor?

performance wontfix

All 32 comments

@mnatsuhara

Simple custom implementation of merge-sort gave much better benefit:

Did you try std::stable_sort with execution::par?

Implementation of std::sort with parallel executor: https://github.com/microsoft/STL/blob/392fb6d857573b27c162ebd00d824f7171c4b3c7/stl/inc/execution#L2749

Function that determines the number of threads: https://github.com/microsoft/STL/blob/392fb6d857573b27c162ebd00d824f7171c4b3c7/stl/src/parallel_algorithms.cpp#L26

Can you provide a complete test case demonstrating the performance issue? (With something that generates the data, of course; we don't care about the data or even about what the predicate does, but we need to have an example of both of them to understand how they influence the implementation.)

Hi @StephanTLavavej,

That would be much extra effort.
Do you really need it?
I was hoping you can see the difference using own tests.

Thanks,
Serg

I could launch some tests on my local environment for gathering some statistics if it helps?

Can't speak for STL, but fixing a performance problem without having some representative test case along with compilation flags and stuff can be extremely difficult when there isn't something really obvious going on.

I agree with what @MikeGitb said. We ran performance tests when we first implemented these algorithms, to tune their implementations and to verify that they were worth parallelizing at all - if we had observed what you're observing, we would have addressed the issue. So yes, we do need a test case.

I could launch some tests on my local environment for gathering some statistics if it helps?

Could you run the code with your build options:

#include <iostream>
#include <algorithm>
#include <execution>

int main()
{
  std::cout << __std_parallel_algorithms_hw_threads() << '\n';
}

and could you run your code with std::stable_sort with execution::par instead of std::sort?

Ok, I can do that. I'll let you know.

std::deque of 1513002 272-byte records
__std_parallel_algorithms_hw_threads: 64
std::stable_sort
3131872300 ns
quick_merge_sort
726697900 ns
std::sort
3355967900 ns
std::stable_sort(std::execution::par, ...
598296600 ns
std::sort(std::execution::par, ...
277424400 ns

Wait, wrong data. That was half-minute. I'll find that sample. The measurement is for release configuration.

Here.

std::deque of 1417301 272-byte records
__std_parallel_algorithms_hw_threads: 64
std::stable_sort
2051602100 ns
quick_merge_sort
109611500 ns
std::sort
489360300 ns
std::stable_sort(std::execution::par, ...
508801200 ns
std::sort(std::execution::par, ...
107548100 ns

sort/sort-par = 489360300 / 107548100 = 4.55 which is a bit far from 64, but it should be near to 128.
Let me try to improve utilization to handling merging ranges.
I'll let you know.

Ok, I did that:

    auto TotalThreads = std::thread::hardware_concurrency();

    template<class Iter, class Pred>
    void quick_merge_sort(Iter first, Iter last, Pred pred)
    {
        auto length = last - first;
        auto chunkSize = length / TotalThreads;
        auto chunks = length / chunkSize;
        if (chunkSize > 1)
        {
            auto tail = length - (chunkSize * chunks);
            auto hasTail = tail > 0;
            auto tailStart = first + chunks * chunkSize;
            {
                std::list<
                    std::pair<
                        std::future<void>,
                        std::pair<
                            Iter, // task chunk start
                            Iter> // task chunk end
                        >
                    > tasks;

                for (auto thread = 0; thread < chunks; ++thread) {
                    auto chunkStart = first + thread * chunkSize;
                    auto chunkEnd = chunkStart + chunkSize;
                    tasks.emplace_back(
                        std::async(std::launch::async, &std::sort<Iter, Pred>, chunkStart, chunkEnd, pred),
                        std::make_pair(chunkStart, chunkEnd)
                        );
                }

                if (hasTail)
                    tasks.emplace_back(
                        std::async(std::launch::async, &std::sort<Iter, Pred>, tailStart, last, pred),
                        std::make_pair(tailStart, last)
                    );

                while (tasks.size()>1)
                {
                    bool ready = {};
                    auto it = tasks.begin(), prev = it;
                    auto e = tasks.end();
                    auto nonReadyYet = true;
                    while (it != e) {
                        auto thisTaskReady = it->first.valid()
                            && it->first.wait_for(std::chrono::nanoseconds()) == std::future_status::ready;
                        if (thisTaskReady && ready) {
                            prev->first = std::async(std::launch::async, &std::inplace_merge<Iter, Pred>, prev->second.first, prev->second.second, it->second.second, pred);
                            prev->second.second = it->second.second;
                            it = tasks.erase(it);
                            nonReadyYet = ready = {};
                        }
                        else {
                            prev = it++;
                            ready = thisTaskReady;
                        }
                    }
                    if (nonReadyYet && tasks.size() > 3) {
                        std::this_thread::yield();
                    }
                }
            }
        }
        else if (length > 1) {
            std::sort(first, last, pred);
        }

        assert(std::is_sorted(first, last, pred));
    }
}

std::deque of 1417301 272-byte records
__std_parallel_algorithms_hw_threads: 64
quick_merge_sort
76938400 ns
std::stable_sort
2033208700 ns
std::sort
485405800 ns
std::stable_sort(std::execution::par, ...
510451700 ns
std::sort(std::execution::par, ...
109773900 ns

Now large data...

std::deque of 33587200 272-byte records
__std_parallel_algorithms_hw_threads: 64
quick_merge_sort
55331423400 ns
std::stable_sort(std::execution::par, ...
145887763600 ns
std::sort(std::execution::par, ...
18138303100 ns

That is all. I hope this helps.

sort/sort-par = 489360300 / 107548100 = 4.55 which is a bit far from 64, but it should be near to 128

The current implementation is quicksort, and we effectively "fork" to recruse into each partition, so our maximum fanout is proportional to lg n (assuming we get good pivots). Sort is not an "embarrassingly parallel" problem; one should not expect a 64x or 128x speedup. Libraries that report huge speedups (e.g. Thrust) tend to report using integers, where they implement a radix sort, which is more parallelizable.

Your algorithm is to static partition the input, plain sort each partition, then bottom-up inplace_merge the partitions together. In that sense it is similar to how our stable_sort works, except our stable_sort always uses initial partitions of 16 elements we insertion sort, while your algorithm starts with partitions of N/T elements initially std::sorted.

I monospaced your results and lined up the least significant digits to make it easier to compare:

std::deque of 1417301 272-byte records
__std_parallel_algorithms_hw_threads: 64
quick_merge_sort
  76938400 ns
std::stable_sort
2033208700 ns
std::sort
 485405800 ns
std::stable_sort(std::execution::par, ...
 510451700 ns
std::sort(std::execution::par, ...
 109773900 ns

Did I misread this? It says that our serial std::sort beats the above parallel algorithm by ~2x and sort(par wins by around 8x. That would suggest that our current partition-and-fork strategy is better than the static-partition-and-bottom-up-merge strategy proposed.

std::deque of 33587200 272-byte records
__std_parallel_algorithms_hw_threads: 64
quick_merge_sort
 55331423400 ns
std::stable_sort(std::execution::par, ...
145887763600 ns
std::sort(std::execution::par, ...
 18138303100 ns

This still has sort(par winning by ~5x.

@BillyONeal : I think you made a mistake when counting the number of digits. The proposed quick merge is about 77ms (not 770) and parallel std::sort is about 109 ms:

quick_merge_sort       76938400 ns 
std::stable_sort     2033208700 ns 
std::sort.            485405800 ns
std::stable_sort(par) 510451700 ns 
std::sort(par)        109773900 ns

@MikeGitb Ah, yes. With the 1417301 records one I'm off by a digit so it's a bit faster instead of 1/8th as fast, but with the 33587200 records one I'm not.

The proposed alternate strategy is interesting in that it would let us get rid of the work stealing infrastructure as sort is the only algorithm that uses it.

I'm still not super convinced making the trade of additional data movement in exchange for better fanout, as this does, is a worthwhile one. Particularly because it will lose badly when you have ~4 cores instead of 60.

Thinking about this again I think a bigger problem with the proposed algorithm is that it makes the sort nondeterministic. That is, equivalent keys will end up in different orders if you sort them on machines with different numbers of cores. It might be worth doing that if we got a huge speedup over it, but IMO we don't have enough evidence presented here of it.

I suppose the decision here ultimately lands on the current maintainers, most probably @mnatsuhara

(I also note that the proposed algorithm could be faster if we applied the better constant factors of the way our parallel stable_sort works; say we might cut element movement in half, so it would be better if we productized it. So the comparison of our current fairly optimized implementation against this 'demonstrate another way' implementation might not be entirely fair)

@BillyONeal Thanks for the background and analysis!

Given that the benchmarks show a modest performance improvement from the alternate approach in one case and our current implementation still wins in the case with bigger data, I'm not inclined to make changes to the current approach. Any change carries some level of risk and overhead, and the bar for making changes for improved performance (rather than addressing a correctness issue) is high.

Thank you @ohhmm for bringing this up and building up the test cases!

Thank you Everyone,

The most important thing to me here in this context is that we have 64 threads working instead of 128 hardware available threads.

Thanks,
Serg

According to my understanding, the 64-thread limitation is due to the Windows threadpool that we're using - special Windows APIs are required to access groups of more than 64 threads at a time.

Note that this 64 limitation is also the limitation of all our components, including std::thread (which models CreateThread), and std::async (which models TP_WORK); these all only create threads in the process' default cpu group. This is an operating system limitation; targeting more than 64 hardware threads requires explicitly creating threads according to the hardware topology in the thread groups you desire with CreateRemoteThreadEx (and those threads are scheduled only in those respective groups). We want to use the system thread pool TP_WORK in the parallel algorithms because we value the thread pool's special deal with the kernel over the ability to target more than 64 hardware threads. Particularly for an algorithm like sort that typically only achieves a fanout of ~6.

Hi @StephanTLavavej & @BillyONeal

How to sort with this STL using par executioner with >64 cores?

Thanks,
Serg

Thanks @BillyONeal

I watched the video and I do agree that std::async definitely had to use OS thread pool for Windows implementations.

Thanks,
Serg

How to sort with this STL using par executioner with >64 cores?

There is no way to do that at present time. Even if there was a standard mechanism to ask for it, we don't have a sort algorithm which can effectively scale to that. Our current algorithm's fanout is limited by the number of partitions generated by quicksort. Your proposed algorithm generates more partitions but the running time becomes dominated by the merge steps for which a truly parallel solution is not known.

My key notes:

  • I used only 64 threads in quick_merge_sort too which is only half of hardware, @mnatsuhara
  • The merge steps are easily parallelable with GPU, @BillyONeal
  • std::async may be reimplemented to make use of all the 128 thread using the OS thread pool or DPC queue, @StephanTLavavej

Regards,
Serg

Unless I'm mistaken, the OS thread pool is not process group aware either, for backwards compatibility concerns.

GPU offloading is a bit too complex for the STL considering the vast range of hardware it must support.

@sylveon,
It is trivial to use 128 threads and OpenCL.

I used only 64 threads in quick_merge_sort too which is only half of hardware,

I think you missed the point that your quick_merge_sort was no faster than the implementation we are shipping. Sort is not an "embarrassingly parallel" problem and you should not expect perfect scaling with cores.

The merge steps are easily parallelable with GPU, @BillyONeal

We don't get to assume that the system has a programmable GPU and no implementation we presently target targets a GPU.

std::async may be reimplemented to make use of all the 128 thread using the OS thread pool or DPC queue, @StephanTLavavej

The OS thread pool is similarly limited to one CPU group (this is in fact why the parallel algorithms library is limited to 64 logical CPUs, because we use the OS thread pool).

@sylveon,
It is trivial to use 128 threads and OpenCL.

We don't target any implementations of OpenCL.

It is still trivial tasks.
It is trivial to use 128 hardware threads.
There's no good reason to ignore this fact.

It is still trivial tasks.

Parallel sort is not "trivial", and your implementation is not faster than ours despite using more threads.

It is trivial to use 128 hardware threads.

If someone supplies a parallel sort algorithm which can usably scale to 128 threads, we might have to cross that line. At this time no such algorithm presently exists.

There's no good reason to ignore this fact.

I believe "the algorithm you're asking for does not exist" is plenty reason to "ignore" this fact.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ohhmm picture ohhmm  路  16Comments

StephanTLavavej picture StephanTLavavej  路  10Comments

StephanTLavavej picture StephanTLavavej  路  11Comments

AlexGuteniev picture AlexGuteniev  路  18Comments

Andersama picture Andersama  路  16Comments