Julia: median of an empty array is undefined

Created on 28 Nov 2016  ·  15Comments  ·  Source: JuliaLang/julia

i see the discussion of treating median and mean identically re. NaN. has it also been discussed whether they should treat empty arrays identically? currently they do not:

julia> mean(Float64[])
NaN

julia> median(Float64[])
ERROR: ArgumentError: median of an empty array is undefined, Float64[]
 in median!(::Array{Float64,1}) at ./statistics.jl:565
 in median(::Array{Float64,1}) at ./statistics.jl:582

really simple change to make median return NaN. all that has to change is the first line.

Most helpful comment

Using a Nullable/Result type here would be a major PITA because it would force you to call get on the result, when in (I suspect) 99.99% of real applications the array will never be empty. That's why an exception seems more appropriate.

All 15 comments

The difference here makes sense. The median is by definition either a data value (if your list is of odd length) or the arithmetic mean of the middle two values (if your list is of even length). This definition assumes that the length is either even or odd, so if you have a length of zero, there's no reasonable definition.

Mean on the other hand is the sum of the elements in the list divided by the number of elements. If there are 0 elements, regardless of how you define the sum, the division will produce NaN.

Hello,

I think median should give NaN for empty sets, because median should match mean. My reasoning is that the median and mean are strongly analogous, including in the case of an empty set. Conceptually for me,

  • The mean is the centre of mass of a set.
  • The median is a kind of 'centre of ranking' of the set (the centre of mass when the set is considered as an ordered set of points without weights).

When the set is empty, there can be no way of preferring any particular location for the centre, so the answer is undefined (in both cases). For me, allowing a generalised definition for this boundary case, or restricting the function domain by returning an error, should be consistent.

NaN only exists for floating point numbers, so we would still need to raise an error for other return types. That would introduce another inconsistency; not sure which one is the least problematic.

We already have that inconsistency for mean:

julia> mean(Float64[])
NaN

julia> mean(Rational{Int64}[])
ERROR: DivideError: integer division error
 in //(::Rational{Int64}, ::Rational{Int64}) at ./rational.jl:33
 in mean(::Array{Rational{Int64},1}) at ./statistics.jl:28

Ref #5234

In my opinion this is a great use case for a Nullable (or a Result type).

Adding to the list of things that don't work:

julia> var([])
ERROR: MethodError: no method matching zero(::Type{Any})

@bjarthur Do you have a use case where it would be convenient to handle the NaNs instead of the exception?

@johnmyleswhite, var([]) gives a MethodError for the same reason that sum([]) does. If you do var(Float64[]), it gives NaN.

I don't like the idea of using Nullable here; there is a big difference between a number not being available (Nullable) versus the answer being undefined (an exception or NaN), and we don't use Nullable for this purpose anywhere else (e.g. sqrt(-1)). Nullable would make this function much harder to use. Throwing an exception rather than returning NaN makes sense to me, however.

I agree with those issues, but that still leaves a choice between a result type (which I believe several folks are seriously considered building into the language) and throwing an exception. The issue is IMO a design decision about whether you want control flow to happen automatically or whether you want people to get a wrapper object and then decide whether to throw based on the presence or non-presence of a valid result in the wrapper.

Using a Nullable/Result type here would be a major PITA because it would force you to call get on the result, when in (I suspect) 99.99% of real applications the array will never be empty. That's why an exception seems more appropriate.

i stumbled across this difference btw mean and median when trying to make an analysis pipeline more robust to outliers by exchanging the former for the latter. it was easy enough to just add some logic to check for an empty vector and then act accordingly.

were a change made, i'd prefer having mean throw an exception. being told that you're trying to do something strange (i.e. taking the mean of an empty vector) would help make the code more accurate.

Throwing an error from mean, var, etc. seems reasonable to me.

A situation where throwing an error is really annoying is when computing summary statistics over groups in the presence of missing values. See this Discourse post for an illustration. That can also happen if there are empty groups.

Now that we have got rid of Nullable, we could imagine returning missing for empty inputs. That would be strictly more convenient than throwing an error, since at worst you'd get an error later. But to be consistent with mean we should return NaN instead. Also I'm not sure it's a good use case for missing in terms of semantics (nothing is unknown in this case, it's just undefined); but FWIW that's what R does when asked to skip missing values and all entries are missing.

We could also simply add an argument to specify the return value you want if the input is empty, just like for reductions. Anyway I think we should do something about this as it's a painful difference for users coming from other languages (R, Pandas...), which don't throw an error.

this would be a breaking change and have to wait to 2.0, no?

No, AFAIK turning an error into something else is allowed.

Was this page helpful?
0 / 5 - 0 ratings