Hi!
quantileExact currently has 1 specific methodology of calculation, that is the one of MS Excel's PERCENTILE.INC. Is there a way to calculate this differently?
Specyfically, I am looking for recreating MS Excel's PERCENTILE.EXC function.
Many thanks!
Maciek
I've modified quantileExact function to be consistent with R-6 from here https://en.wikipedia.org/wiki/Quantile. Could you review and add to next release?
#pragma once
#include <Common/PODArray.h>
#include <Common/NaNUtils.h>
#include <Core/Types.h>
#include <IO/WriteBuffer.h>
#include <IO/ReadBuffer.h>
#include <IO/VarInt.h>
namespace DB
{
namespace ErrorCodes
{
extern const int NOT_IMPLEMENTED;
}
/** Calculates quantile by collecting all values into array
* and applying n-th element (introselect) algorithm for the resulting array.
*
* It uses O(N) memory and it is very inefficient in case of high amount of identical values.
* But it is very CPU efficient for not large datasets.
*/
template <typename Value>
struct QuantileEXC
{
/// The memory will be allocated to several elements at once, so that the state occupies 64 bytes.
static constexpr size_t bytes_in_arena = 64 - sizeof(PODArray<Value>);
using Array = PODArrayWithStackMemory<Value, bytes_in_arena>;
Array array;
void add(const Value & x)
{
/// We must skip NaNs as they are not compatible with comparison sorting.
if (!isNaN(x))
array.push_back(x);
}
template <typename Weight>
void add(const Value &, const Weight &)
{
throw Exception("Method add with weight is not implemented for QuantileEXC", ErrorCodes::NOT_IMPLEMENTED);
}
void merge(const QuantileEXC & rhs)
{
array.insert(rhs.array.begin(), rhs.array.end());
}
void serialize(WriteBuffer & buf) const
{
size_t size = array.size();
writeVarUInt(size, buf);
buf.write(reinterpret_cast<const char *>(array.data()), size * sizeof(array[0]));
}
void deserialize(ReadBuffer & buf)
{
size_t size = 0;
readVarUInt(size, buf);
array.resize(size);
buf.read(reinterpret_cast<char *>(array.data()), size * sizeof(array[0]));
}
/// Get the value of the `level` quantile. The level must be between 0 and 1.
Value get(Float64 level)
{
if (!array.empty())
{
size_t n = level < 1
? level * (array.size() + 1)
: level / 100 * (array.size() + 1);
Float64 dec = n - floor(n);
n = floor(n);
std::nth_element(array.begin(), array.begin() + n, array.end());
std::nth_element(array.begin() + n + 1, array.begin() + n + 1, array.end());
return array[n] + dec * ( array[n + 1] - array[n]);
}
return std::numeric_limits<Value>::quiet_NaN();
}
/// The same, but in the case of an empty state, NaN is returned.
Float64 getFloat(Float64) const
{
throw Exception("Method getFloat is not implemented for QuantileEXC", ErrorCodes::NOT_IMPLEMENTED);
}
void getManyFloat(const Float64 *, const size_t *, size_t, Float64 *) const
{
throw Exception("Method getManyFloat is not implemented for QuantileEXC", ErrorCodes::NOT_IMPLEMENTED);
}
};
}
Hi,
Sorry for the late response but I’m on vacation right now. It was a part of the equation to calculate those percentiles or simply I made a mistake but I don’t suppose.
I believe this could give you an idea what I meant:
http://www.utdallas.edu/~serfling/3332/ComputingPercentiles.pdf
Regards,
Luke
From: dimarub2000 [mailto:[email protected]]
Sent: Tuesday, August 13, 2019 1:05 PM
To: yandex/ClickHouse
Cc: LukeCzmoch; Mention
Subject: Re: [yandex/ClickHouse] Quantile - different methodology - MS Excel PERCENTILE.EXC (#5885)
@LukeCzmoch https://github.com/LukeCzmoch : level / 100 * (array.size() + 1); why is this needed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/yandex/ClickHouse/issues/5885?email_source=notifications&email_token=AMTXYHOUWB54RO5RJWOU7A3QEKIO7A5CNFSM4H6J7BO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4FKDJQ#issuecomment-520790438 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMTXYHPMPST4ZVI4RXUNWX3QEKIO7ANCNFSM4H6J7BOQ . https://github.com/notifications/beacon/AMTXYHPA5BV246STH24PTWTQEKIO7A5CNFSM4H6J7BO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4FKDJQ.gif
Ta wiadomość została sprawdzona na obecność wirusów przez oprogramowanie antywirusowe Avast.
https://www.avast.com/antivirus
Most helpful comment
I've modified quantileExact function to be consistent with R-6 from here https://en.wikipedia.org/wiki/Quantile. Could you review and add to next release?