Julia: count(f, x) should not be equivalent to sum(f, x)

Created on 2 Feb 2017  Â·  22Comments  Â·  Source: JuliaLang/julia

I took a look at the count implementation for #20403 (cc @cossio), and was surprised to discover that the current definition is nearly identical to sum (albeit with a type instability):

function count(pred, itr)
    n = 0
    for x in itr
        n += pred(x)
    end
    return n
end

For example, count(sqrt, [1,2,3,5]) returns 6.382332347441762.

We could disallow this case entirely, but one possibility would be to make count(f, x) equivalent to countnz for non-boolean data. Indeed, I'm not sure why we have both countnz and count — couldn't we merge the two functions into a single count function?

collections

Most helpful comment

Because it's unclear. I think being more explicit and saying count(!iszero, x) or find(!iszero, x) makes the intention completely obvious. IMO the one-argument versions should stick to boolean data, where there's no room for confusion, and the two-argument version gives you the ability to supply any kind of data with an appropriate predicate. I think we should be striving for clarity here rather than terseness.

All 22 comments

i.e. define

iszero(b::Bool) = !b # maybe more efficient than the fallback b == zero(b) definition
function count{F}(pred::F, itr)
    n = 0
    for x in itr
        n += !iszero(pred(x))
    end
    return n
end
count(itr) = count(identity, itr)
@deprecate countnz(itr) count(itr)

+1, that would be more consistent with find.

I think I'd prefer something like

function count(f, itr)
    n = 0
    for x in itr
        if x
            n += 1
        end
    end
    return n
end
count(itr) = count(identity, itr)
@deprecate countnz(itr) count(!iszero, itr)

This has the benefit of throwing an error for non-boolean data when using count directly.

@ararslan, that might actually be slower because it forces a branch. One could do n += f(x)::Bool, I suppose.

However, I'd prefer to have one function that does more rather than restrict the functionality to Bool here… I'm not sure that throwing an error for non-boolean data is a feature. And, as @nalimilan says, this behavior is consistent with the find functions.

I'm not sure that throwing an error for non-boolean data is a feature

I disagree because without a predicate for non-boolean data, the name count doesn't mean much. What is it counting? I have similar issues with find, and IIRC there's a proposal somewhere to improve the consistency and clarity of find and related functions.

Having f(x)::Bool seems fine to me, especially if a branch is costly, though I'd be curious to see whether the branch makes an appreciable difference in benchmarks.

@ararslan, your version with the branch (corrected to call if f(x)) is more than 2x slower than Base.count for @btime count(iseven, 1:10^6) evals=1.

Dang. f(x)::Bool it is, then, in my proposal.

I disagree because without a predicate for non-boolean data, the name count doesn't mean much. What is it counting? I have similar issues with find, and IIRC there's a proposal somewhere to improve the consistency and clarity of find and related functions.

That's https://github.com/JuliaLang/Juleps/blob/master/Find.md, but it doesn't really address the question of whether find should return non-zero entries or true entries. I agree the former behavior can be surprising at first, but it's also more general and potentially useful, so why not keep it?

Because it's unclear. I think being more explicit and saying count(!iszero, x) or find(!iszero, x) makes the intention completely obvious. IMO the one-argument versions should stick to boolean data, where there's no room for confusion, and the two-argument version gives you the ability to supply any kind of data with an appropriate predicate. I think we should be striving for clarity here rather than terseness.

Now that we can do !iszero, I'd tend to agree with you.

Probably should continue to have a countnz function if the alternative is to require everyone to do count(!iszero, x). e.g. we want specialized versions for sparse arrays, and I don't know if that is possible with !iszero, because the ! will generate a specialized anonymous function for each call and hence defining a specialized count(::typeof(!iszero), x) method won't work. Of course, we could define a specialized !(::typeof(iszero)) = notiszero too, with notiszero(x) = !iszero(x), I guess, but that seems to be getting messy (and is harder for external packages to overload).

For sparse there's nnz, which should arguably be renamed, though that's somewhat tangential. I did a bit of playing around and it seems that you actually _can_ specialize on typeof(!iszero), which is super cool. So count(::typeof(!iszero), x) is a valid overloadable method.

There's an important difference between nnz and countnz though, the former is a structural property whereas the latter tests for and does not include stored zeros. I can't remember if there's a good reason countnz and count are separate though.

I'd imagine most uses of count would be for the boolean case, in which case sum works just fine. If someone wants the number of non-zero values, why not just sum a generator: sum(!iszero(x) for x in X)?

@ararslan Why do you think having count even if it does the same thing as sum but only for boolean is a good idea? For safety?

@nalimilan Good point. I think count only makes sense if it does something different. (And the difference cannot be that count throws an error with non-booleans while sum doesn't).

@nalimilan I think it's a good idea because it provides both clarity and safety in terms of knowing exactly what you're getting. I think counting nonzeros is a rather odd and surprising default. I assume the reason that we don't currently have a one-argument count method is that it's equivalent to sum in that case, but should we make Bool not a subtype of Number (see #19168), it makes more sense to count non-numbers than to sum them, even if we do still define arithmetic on Bools.

To solve the speed point, just add @simd in front of the loop:

function fast_count(pred, itr)
    n = 0
    @simd for x in itr
        n += pred(x)::Bool
    end
    return n
end
a = randn(1000000)
@benchmark fast_count(x->x>1.96, a) # median time:      535.944 μs (0.00% GC)
@benchmark sum(x->x>1.96, a)        # median time:      565.832 μs (0.00% GC)
@benchmark count(x->x>1.96, a)      # median time:        1.370 ms (0.00% GC)

I wouldn't have expected @simd to make a difference here. I thought it mainly allowed floating point optimizations?

There are many SIMD instructions for integers.

Yes, but I would expect them to be enabled by default. Do they change the behavior of the code?

Perhaps some aliasing checks are turned off. I'm not sure, I don't get SIMD without the macro at least.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dpsanders picture dpsanders  Â·  3Comments

omus picture omus  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

ararslan picture ararslan  Â·  3Comments

tkoolen picture tkoolen  Â·  3Comments