I took a look at the count
implementation for #20403 (cc @cossio), and was surprised to discover that the current definition is nearly identical to sum
(albeit with a type instability):
function count(pred, itr)
n = 0
for x in itr
n += pred(x)
end
return n
end
For example, count(sqrt, [1,2,3,5])
returns 6.382332347441762.
We could disallow this case entirely, but one possibility would be to make count(f, x)
equivalent to countnz
for non-boolean data. Indeed, I'm not sure why we have both countnz
and count
— couldn't we merge the two functions into a single count
function?
i.e. define
iszero(b::Bool) = !b # maybe more efficient than the fallback b == zero(b) definition
function count{F}(pred::F, itr)
n = 0
for x in itr
n += !iszero(pred(x))
end
return n
end
count(itr) = count(identity, itr)
@deprecate countnz(itr) count(itr)
+1, that would be more consistent with find
.
I think I'd prefer something like
function count(f, itr)
n = 0
for x in itr
if x
n += 1
end
end
return n
end
count(itr) = count(identity, itr)
@deprecate countnz(itr) count(!iszero, itr)
This has the benefit of throwing an error for non-boolean data when using count
directly.
@ararslan, that might actually be slower because it forces a branch. One could do n += f(x)::Bool
, I suppose.
However, I'd prefer to have one function that does more rather than restrict the functionality to Bool
here… I'm not sure that throwing an error for non-boolean data is a feature. And, as @nalimilan says, this behavior is consistent with the find
functions.
I'm not sure that throwing an error for non-boolean data is a feature
I disagree because without a predicate for non-boolean data, the name count
doesn't mean much. What is it counting? I have similar issues with find
, and IIRC there's a proposal somewhere to improve the consistency and clarity of find
and related functions.
Having f(x)::Bool
seems fine to me, especially if a branch is costly, though I'd be curious to see whether the branch makes an appreciable difference in benchmarks.
@ararslan, your version with the branch (corrected to call if f(x)
) is more than 2x slower than Base.count
for @btime count(iseven, 1:10^6) evals=1
.
Dang. f(x)::Bool
it is, then, in my proposal.
I disagree because without a predicate for non-boolean data, the name count doesn't mean much. What is it counting? I have similar issues with find, and IIRC there's a proposal somewhere to improve the consistency and clarity of find and related functions.
That's https://github.com/JuliaLang/Juleps/blob/master/Find.md, but it doesn't really address the question of whether find
should return non-zero entries or true entries. I agree the former behavior can be surprising at first, but it's also more general and potentially useful, so why not keep it?
Because it's unclear. I think being more explicit and saying count(!iszero, x)
or find(!iszero, x)
makes the intention completely obvious. IMO the one-argument versions should stick to boolean data, where there's no room for confusion, and the two-argument version gives you the ability to supply any kind of data with an appropriate predicate. I think we should be striving for clarity here rather than terseness.
Now that we can do !iszero
, I'd tend to agree with you.
Probably should continue to have a countnz
function if the alternative is to require everyone to do count(!iszero, x)
. e.g. we want specialized versions for sparse arrays, and I don't know if that is possible with !iszero
, because the !
will generate a specialized anonymous function for each call and hence defining a specialized count(::typeof(!iszero), x)
method won't work. Of course, we could define a specialized !(::typeof(iszero)) = notiszero
too, with notiszero(x) = !iszero(x)
, I guess, but that seems to be getting messy (and is harder for external packages to overload).
For sparse there's nnz
, which should arguably be renamed, though that's somewhat tangential. I did a bit of playing around and it seems that you actually _can_ specialize on typeof(!iszero)
, which is super cool. So count(::typeof(!iszero), x)
is a valid overloadable method.
There's an important difference between nnz
and countnz
though, the former is a structural property whereas the latter tests for and does not include stored zeros. I can't remember if there's a good reason countnz
and count
are separate though.
I'd imagine most uses of count would be for the boolean case, in which case sum
works just fine. If someone wants the number of non-zero values, why not just sum a generator: sum(!iszero(x) for x in X)
?
@ararslan Why do you think having count
even if it does the same thing as sum
but only for boolean is a good idea? For safety?
@nalimilan Good point. I think count
only makes sense if it does something different. (And the difference cannot be that count
throws an error with non-booleans while sum
doesn't).
@nalimilan I think it's a good idea because it provides both clarity and safety in terms of knowing exactly what you're getting. I think counting nonzeros is a rather odd and surprising default. I assume the reason that we don't currently have a one-argument count
method is that it's equivalent to sum
in that case, but should we make Bool
not a subtype of Number
(see #19168), it makes more sense to count
non-numbers than to sum
them, even if we do still define arithmetic on Bool
s.
To solve the speed point, just add @simd
in front of the loop:
function fast_count(pred, itr)
n = 0
@simd for x in itr
n += pred(x)::Bool
end
return n
end
a = randn(1000000)
@benchmark fast_count(x->x>1.96, a) # median time: 535.944 μs (0.00% GC)
@benchmark sum(x->x>1.96, a) # median time: 565.832 μs (0.00% GC)
@benchmark count(x->x>1.96, a) # median time: 1.370 ms (0.00% GC)
I wouldn't have expected @simd
to make a difference here. I thought it mainly allowed floating point optimizations?
There are many SIMD instructions for integers.
Yes, but I would expect them to be enabled by default. Do they change the behavior of the code?
Perhaps some aliasing checks are turned off. I'm not sure, I don't get SIMD without the macro at least.
Most helpful comment
Because it's unclear. I think being more explicit and saying
count(!iszero, x)
orfind(!iszero, x)
makes the intention completely obvious. IMO the one-argument versions should stick to boolean data, where there's no room for confusion, and the two-argument version gives you the ability to supply any kind of data with an appropriate predicate. I think we should be striving for clarity here rather than terseness.