i see the discussion of treating median
and mean
identically re. NaN. has it also been discussed whether they should treat empty arrays identically? currently they do not:
julia> mean(Float64[])
NaN
julia> median(Float64[])
ERROR: ArgumentError: median of an empty array is undefined, Float64[]
in median!(::Array{Float64,1}) at ./statistics.jl:565
in median(::Array{Float64,1}) at ./statistics.jl:582
really simple change to make median
return NaN. all that has to change is the first line.
The difference here makes sense. The median is by definition either a data value (if your list is of odd length) or the arithmetic mean of the middle two values (if your list is of even length). This definition assumes that the length is either even or odd, so if you have a length of zero, there's no reasonable definition.
Mean on the other hand is the sum of the elements in the list divided by the number of elements. If there are 0 elements, regardless of how you define the sum, the division will produce NaN
.
Hello,
I think median
should give NaN
for empty sets, because median
should match mean
. My reasoning is that the median and mean are strongly analogous, including in the case of an empty set. Conceptually for me,
When the set is empty, there can be no way of preferring any particular location for the centre, so the answer is undefined (in both cases). For me, allowing a generalised definition for this boundary case, or restricting the function domain by returning an error, should be consistent.
NaN
only exists for floating point numbers, so we would still need to raise an error for other return types. That would introduce another inconsistency; not sure which one is the least problematic.
We already have that inconsistency for mean
:
julia> mean(Float64[])
NaN
julia> mean(Rational{Int64}[])
ERROR: DivideError: integer division error
in //(::Rational{Int64}, ::Rational{Int64}) at ./rational.jl:33
in mean(::Array{Rational{Int64},1}) at ./statistics.jl:28
Ref #5234
In my opinion this is a great use case for a Nullable (or a Result type).
Adding to the list of things that don't work:
julia> var([])
ERROR: MethodError: no method matching zero(::Type{Any})
@bjarthur Do you have a use case where it would be convenient to handle the NaN
s instead of the exception?
@johnmyleswhite, var([])
gives a MethodError for the same reason that sum([])
does. If you do var(Float64[])
, it gives NaN
.
I don't like the idea of using Nullable here; there is a big difference between a number not being available (Nullable) versus the answer being undefined (an exception or NaN), and we don't use Nullable for this purpose anywhere else (e.g. sqrt(-1)
). Nullable would make this function much harder to use. Throwing an exception rather than returning NaN makes sense to me, however.
I agree with those issues, but that still leaves a choice between a result type (which I believe several folks are seriously considered building into the language) and throwing an exception. The issue is IMO a design decision about whether you want control flow to happen automatically or whether you want people to get a wrapper object and then decide whether to throw based on the presence or non-presence of a valid result in the wrapper.
Using a Nullable/Result type here would be a major PITA because it would force you to call get
on the result, when in (I suspect) 99.99% of real applications the array will never be empty. That's why an exception seems more appropriate.
i stumbled across this difference btw mean
and median
when trying to make an analysis pipeline more robust to outliers by exchanging the former for the latter. it was easy enough to just add some logic to check for an empty vector and then act accordingly.
were a change made, i'd prefer having mean
throw an exception. being told that you're trying to do something strange (i.e. taking the mean of an empty vector) would help make the code more accurate.
Throwing an error from mean
, var
, etc. seems reasonable to me.
A situation where throwing an error is really annoying is when computing summary statistics over groups in the presence of missing values. See this Discourse post for an illustration. That can also happen if there are empty groups.
Now that we have got rid of Nullable
, we could imagine returning missing
for empty inputs. That would be strictly more convenient than throwing an error, since at worst you'd get an error later. But to be consistent with mean
we should return NaN
instead. Also I'm not sure it's a good use case for missing
in terms of semantics (nothing is unknown in this case, it's just undefined); but FWIW that's what R does when asked to skip missing values and all entries are missing.
We could also simply add an argument to specify the return value you want if the input is empty, just like for reductions. Anyway I think we should do something about this as it's a painful difference for users coming from other languages (R, Pandas...), which don't throw an error.
this would be a breaking change and have to wait to 2.0, no?
No, AFAIK turning an error into something else is allowed.
Most helpful comment
Using a Nullable/Result type here would be a major PITA because it would force you to call
get
on the result, when in (I suspect) 99.99% of real applications the array will never be empty. That's why an exception seems more appropriate.