Julia: count(f, x) should not be equivalent to sum(f, x)

Created on 2 Feb 2017 · 22Comments · Source: JuliaLang/julia

I took a look at the count implementation for #20403 (cc @cossio), and was surprised to discover that the current definition is nearly identical to sum (albeit with a type instability):

function count(pred, itr)
    n = 0
    for x in itr
        n += pred(x)
    end
    return n
end

For example, count(sqrt, [1,2,3,5]) returns 6.382332347441762.

We could disallow this case entirely, but one possibility would be to make count(f, x) equivalent to countnz for non-boolean data. Indeed, I'm not sure why we have both countnz and count — couldn't we merge the two functions into a single count function?

collections

Source

stevengj

👍4

Most helpful comment

Because it's unclear. I think being more explicit and saying count(!iszero, x) or find(!iszero, x) makes the intention completely obvious. IMO the one-argument versions should stick to boolean data, where there's no room for confusion, and the two-argument version gives you the ability to supply any kind of data with an appropriate predicate. I think we should be striving for clarity here rather than terseness.

ararslan on 2 Feb 2017

❤2

All 22 comments

i.e. define

iszero(b::Bool) = !b # maybe more efficient than the fallback b == zero(b) definition
function count{F}(pred::F, itr)
    n = 0
    for x in itr
        n += !iszero(pred(x))
    end
    return n
end
count(itr) = count(identity, itr)
@deprecate countnz(itr) count(itr)

stevengj on 2 Feb 2017

👎1 👍1

+1, that would be more consistent with find.

nalimilan on 2 Feb 2017

I think I'd prefer something like

function count(f, itr)
    n = 0
    for x in itr
        if x
            n += 1
        end
    end
    return n
end
count(itr) = count(identity, itr)
@deprecate countnz(itr) count(!iszero, itr)

This has the benefit of throwing an error for non-boolean data when using count directly.

ararslan on 2 Feb 2017

@ararslan, that might actually be slower because it forces a branch. One could do n += f(x)::Bool, I suppose.

However, I'd prefer to have one function that does more rather than restrict the functionality to Bool here… I'm not sure that throwing an error for non-boolean data is a feature. And, as @nalimilan says, this behavior is consistent with the find functions.

stevengj on 2 Feb 2017

I'm not sure that throwing an error for non-boolean data is a feature

I disagree because without a predicate for non-boolean data, the name count doesn't mean much. What is it counting? I have similar issues with find, and IIRC there's a proposal somewhere to improve the consistency and clarity of find and related functions.

Having f(x)::Bool seems fine to me, especially if a branch is costly, though I'd be curious to see whether the branch makes an appreciable difference in benchmarks.

ararslan on 2 Feb 2017

❤1

@ararslan, your version with the branch (corrected to call if f(x)) is more than 2x slower than Base.count for @btime count(iseven, 1:10^6) evals=1.

stevengj on 2 Feb 2017

Dang. f(x)::Bool it is, then, in my proposal.

ararslan on 2 Feb 2017

I disagree because without a predicate for non-boolean data, the name count doesn't mean much. What is it counting? I have similar issues with find, and IIRC there's a proposal somewhere to improve the consistency and clarity of find and related functions.

That's https://github.com/JuliaLang/Juleps/blob/master/Find.md, but it doesn't really address the question of whether find should return non-zero entries or true entries. I agree the former behavior can be surprising at first, but it's also more general and potentially useful, so why not keep it?

nalimilan on 2 Feb 2017

ararslan on 2 Feb 2017

❤2

Now that we can do !iszero, I'd tend to agree with you.

nalimilan on 2 Feb 2017

Probably should continue to have a countnz function if the alternative is to require everyone to do count(!iszero, x). e.g. we want specialized versions for sparse arrays, and I don't know if that is possible with !iszero, because the ! will generate a specialized anonymous function for each call and hence defining a specialized count(::typeof(!iszero), x) method won't work. Of course, we could define a specialized !(::typeof(iszero)) = notiszero too, with notiszero(x) = !iszero(x), I guess, but that seems to be getting messy (and is harder for external packages to overload).

stevengj on 2 Feb 2017

For sparse there's nnz, which should arguably be renamed, though that's somewhat tangential. I did a bit of playing around and it seems that you actually _can_ specialize on typeof(!iszero), which is super cool. So count(::typeof(!iszero), x) is a valid overloadable method.

ararslan on 2 Feb 2017

There's an important difference between nnz and countnz though, the former is a structural property whereas the latter tests for and does not include stored zeros. I can't remember if there's a good reason countnz and count are separate though.

tkelman on 2 Feb 2017

I'd imagine most uses of count would be for the boolean case, in which case sum works just fine. If someone wants the number of non-zero values, why not just sum a generator: sum(!iszero(x) for x in X)?

bramtayl on 3 Feb 2017

👎1

@ararslan Why do you think having count even if it does the same thing as sum but only for boolean is a good idea? For safety?

nalimilan on 3 Feb 2017

👍1

@nalimilan Good point. I think count only makes sense if it does something different. (And the difference cannot be that count throws an error with non-booleans while sum doesn't).

cossio on 3 Feb 2017

👎1

@nalimilan I think it's a good idea because it provides both clarity and safety in terms of knowing exactly what you're getting. I think counting nonzeros is a rather odd and surprising default. I assume the reason that we don't currently have a one-argument count method is that it's equivalent to sum in that case, but should we make Bool not a subtype of Number (see #19168), it makes more sense to count non-numbers than to sum them, even if we do still define arithmetic on Bools.

ararslan on 3 Feb 2017

To solve the speed point, just add @simd in front of the loop:

function fast_count(pred, itr)
    n = 0
    @simd for x in itr
        n += pred(x)::Bool
    end
    return n
end

a = randn(1000000)
@benchmark fast_count(x->x>1.96, a) # median time:      535.944 μs (0.00% GC)
@benchmark sum(x->x>1.96, a)        # median time:      565.832 μs (0.00% GC)
@benchmark count(x->x>1.96, a)      # median time:        1.370 ms (0.00% GC)