Clickhouse: Quantile - different methodology - MS Excel PERCENTILE.EXC

Created on 5 Jul 2019  Â·  2Comments  Â·  Source: ClickHouse/ClickHouse

Hi!

quantileExact currently has 1 specific methodology of calculation, that is the one of MS Excel's PERCENTILE.INC. Is there a way to calculate this differently?

Specyfically, I am looking for recreating MS Excel's PERCENTILE.EXC function.

Many thanks!
Maciek

question

Most helpful comment

I've modified quantileExact function to be consistent with R-6 from here https://en.wikipedia.org/wiki/Quantile. Could you review and add to next release?

#pragma once

#include <Common/PODArray.h>
#include <Common/NaNUtils.h>
#include <Core/Types.h>
#include <IO/WriteBuffer.h>
#include <IO/ReadBuffer.h>
#include <IO/VarInt.h>


namespace DB
{

namespace ErrorCodes
{
    extern const int NOT_IMPLEMENTED;
}

/** Calculates quantile by collecting all values into array
  *  and applying n-th element (introselect) algorithm for the resulting array.
  *
  * It uses O(N) memory and it is very inefficient in case of high amount of identical values.
  * But it is very CPU efficient for not large datasets.
  */
template <typename Value>
struct QuantileEXC
{
    /// The memory will be allocated to several elements at once, so that the state occupies 64 bytes.
    static constexpr size_t bytes_in_arena = 64 - sizeof(PODArray<Value>);
    using Array = PODArrayWithStackMemory<Value, bytes_in_arena>;
    Array array;

    void add(const Value & x)
    {
        /// We must skip NaNs as they are not compatible with comparison sorting.
        if (!isNaN(x))
            array.push_back(x);
    }

    template <typename Weight>
    void add(const Value &, const Weight &)
    {
        throw Exception("Method add with weight is not implemented for QuantileEXC", ErrorCodes::NOT_IMPLEMENTED);
    }

    void merge(const QuantileEXC & rhs)
    {
        array.insert(rhs.array.begin(), rhs.array.end());
    }

    void serialize(WriteBuffer & buf) const
    {
        size_t size = array.size();
        writeVarUInt(size, buf);
        buf.write(reinterpret_cast<const char *>(array.data()), size * sizeof(array[0]));
    }

    void deserialize(ReadBuffer & buf)
    {
        size_t size = 0;
        readVarUInt(size, buf);
        array.resize(size);
        buf.read(reinterpret_cast<char *>(array.data()), size * sizeof(array[0]));
    }

    /// Get the value of the `level` quantile. The level must be between 0 and 1.
    Value get(Float64 level)
    {
        if (!array.empty())
        {
            size_t n = level < 1
                ? level * (array.size() + 1)
                : level / 100 * (array.size() + 1);

                Float64 dec = n - floor(n);
                n = floor(n);

                std::nth_element(array.begin(), array.begin() + n, array.end());
                std::nth_element(array.begin() + n + 1, array.begin() + n + 1, array.end());

            return array[n] + dec * ( array[n + 1] - array[n]);
        }

        return std::numeric_limits<Value>::quiet_NaN();
    }


    /// The same, but in the case of an empty state, NaN is returned.
    Float64 getFloat(Float64) const
    {
        throw Exception("Method getFloat is not implemented for QuantileEXC", ErrorCodes::NOT_IMPLEMENTED);
    }

    void getManyFloat(const Float64 *, const size_t *, size_t, Float64 *) const
    {
        throw Exception("Method getManyFloat is not implemented for QuantileEXC", ErrorCodes::NOT_IMPLEMENTED);
    }
};

}

All 2 comments

I've modified quantileExact function to be consistent with R-6 from here https://en.wikipedia.org/wiki/Quantile. Could you review and add to next release?

#pragma once

#include <Common/PODArray.h>
#include <Common/NaNUtils.h>
#include <Core/Types.h>
#include <IO/WriteBuffer.h>
#include <IO/ReadBuffer.h>
#include <IO/VarInt.h>


namespace DB
{

namespace ErrorCodes
{
    extern const int NOT_IMPLEMENTED;
}

/** Calculates quantile by collecting all values into array
  *  and applying n-th element (introselect) algorithm for the resulting array.
  *
  * It uses O(N) memory and it is very inefficient in case of high amount of identical values.
  * But it is very CPU efficient for not large datasets.
  */
template <typename Value>
struct QuantileEXC
{
    /// The memory will be allocated to several elements at once, so that the state occupies 64 bytes.
    static constexpr size_t bytes_in_arena = 64 - sizeof(PODArray<Value>);
    using Array = PODArrayWithStackMemory<Value, bytes_in_arena>;
    Array array;

    void add(const Value & x)
    {
        /// We must skip NaNs as they are not compatible with comparison sorting.
        if (!isNaN(x))
            array.push_back(x);
    }

    template <typename Weight>
    void add(const Value &, const Weight &)
    {
        throw Exception("Method add with weight is not implemented for QuantileEXC", ErrorCodes::NOT_IMPLEMENTED);
    }

    void merge(const QuantileEXC & rhs)
    {
        array.insert(rhs.array.begin(), rhs.array.end());
    }

    void serialize(WriteBuffer & buf) const
    {
        size_t size = array.size();
        writeVarUInt(size, buf);
        buf.write(reinterpret_cast<const char *>(array.data()), size * sizeof(array[0]));
    }

    void deserialize(ReadBuffer & buf)
    {
        size_t size = 0;
        readVarUInt(size, buf);
        array.resize(size);
        buf.read(reinterpret_cast<char *>(array.data()), size * sizeof(array[0]));
    }

    /// Get the value of the `level` quantile. The level must be between 0 and 1.
    Value get(Float64 level)
    {
        if (!array.empty())
        {
            size_t n = level < 1
                ? level * (array.size() + 1)
                : level / 100 * (array.size() + 1);

                Float64 dec = n - floor(n);
                n = floor(n);

                std::nth_element(array.begin(), array.begin() + n, array.end());
                std::nth_element(array.begin() + n + 1, array.begin() + n + 1, array.end());

            return array[n] + dec * ( array[n + 1] - array[n]);
        }

        return std::numeric_limits<Value>::quiet_NaN();
    }


    /// The same, but in the case of an empty state, NaN is returned.
    Float64 getFloat(Float64) const
    {
        throw Exception("Method getFloat is not implemented for QuantileEXC", ErrorCodes::NOT_IMPLEMENTED);
    }

    void getManyFloat(const Float64 *, const size_t *, size_t, Float64 *) const
    {
        throw Exception("Method getManyFloat is not implemented for QuantileEXC", ErrorCodes::NOT_IMPLEMENTED);
    }
};

}

Hi,

Sorry for the late response but I’m on vacation right now. It was a part of the equation to calculate those percentiles or simply I made a mistake but I don’t suppose.

I believe this could give you an idea what I meant:

http://www.utdallas.edu/~serfling/3332/ComputingPercentiles.pdf

Regards,

Luke

From: dimarub2000 [mailto:[email protected]]
Sent: Tuesday, August 13, 2019 1:05 PM
To: yandex/ClickHouse
Cc: LukeCzmoch; Mention
Subject: Re: [yandex/ClickHouse] Quantile - different methodology - MS Excel PERCENTILE.EXC (#5885)

@LukeCzmoch https://github.com/LukeCzmoch : level / 100 * (array.size() + 1); why is this needed?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/yandex/ClickHouse/issues/5885?email_source=notifications&email_token=AMTXYHOUWB54RO5RJWOU7A3QEKIO7A5CNFSM4H6J7BO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4FKDJQ#issuecomment-520790438 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMTXYHPMPST4ZVI4RXUNWX3QEKIO7ANCNFSM4H6J7BOQ . https://github.com/notifications/beacon/AMTXYHPA5BV246STH24PTWTQEKIO7A5CNFSM4H6J7BO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4FKDJQ.gif


Ta wiadomość została sprawdzona na obecność wirusów przez oprogramowanie antywirusowe Avast.
https://www.avast.com/antivirus

Was this page helpful?
0 / 5 - 0 ratings