Wemake-python-styleguide: Forbid slow `in` inside comprehensions

Created on 9 Oct 2019  ·  16Comments  ·  Source: wemake-services/wemake-python-styleguide

Rule request

Thesis

len with comprehensions

Iterators have no len, and sometimes I forgetting it.

Bad:

len(1 for el in a if el in b)

Better:

len([1 for el in a if el in b])

in inside comprehensions

Good:

sum(1 for el in a if el in b)

Twice slower, but also ok:

sum(el in b for el in a)

Reasoning

Detect runtime TypeError in advance. We could also detect len from yield-like iterators, but resolving symbols in python always is a non-trivial thing, unfortunately.

Hacktoberfest help wanted starter rule request

Most helpful comment

Would like to take this! Can u plz assign this to me

All 16 comments

Good idea!

But, we don't assume types as the best practice. mypy catches this case perfectly:

# ex.py
len(1 for el in a if el in b)

Output:

ex.py:1: error: Argument 1 to "len" has incompatible type "Generator[int, None, None]"; expected "Sized"

But, sum is the whole new case. That's about performance and the best practice.
That's something I want to have.

We can forbid to use x in y inside value part of the comprehension. And force people to write 1 or True and if x in y.

We need to measure the speed for different types with timeit
And make a decision based on the results.

From worst to best:

from random import randint 
elements = list(randint(-100000, 100000) for _ in range(1000000))  

%timeit sum(a > 0 for a in elements) 
# 130 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sum(True for a in elements if a > 0) 
# 92.1 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
s = 0
for a in elements:
  if a > 0:
    s += 1
# 73.1 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sum(1 for a in elements if a > 0)  
# 71.8 ms ± 720 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit len([1 for a in elements if a > 0])  
# 62.4 ms ± 9.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The last one has a big std deviation. So, I'm not sure is it really the best or some optimization for repetitive list creation.

Would like to take this! Can u plz assign this to me

From worst to best:

from random import randint 
elements = list(randint(-100000, 100000) for _ in range(1000000))  

%timeit sum(a > 0 for a in elements) 
# 130 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sum(True for a in elements if a > 0) 
# 92.1 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
s = 0
for a in elements:
  if a > 0:
    s += 1
# 73.1 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sum(1 for a in elements if a > 0)  
# 71.8 ms ± 720 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit len([1 for a in elements if a > 0])  
# 62.4 ms ± 9.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, which one should be considered as a best practice??

sum(1 for a in elements if a in b) and len([1 for a in elements if a > 0]) win!

It means that it is better to use sum(1 for a in elements if a in b) then sum(a in b for a in elements). And this rule is perfectly valid.

So, I'm done with the local setup and ready to implement the changes. But, I'm quite confused about how should I approach the issue...
I suppose I should create a new visitor and violation, or is there another simpler way to do this??
Any help is appreciated, Thanks!

@ManishAradwad you need to create a new violation in best_practices, then we can create a new visitor here: https://github.com/wemake-services/wemake-python-styleguide/blob/master/wemake_python_styleguide/visitors/ast/loops.py

We need to visit:

  • ast.ListComp
  • ast.SetComp
  • ast.DictComp
  • ast.GeneratorExp

You can add a new method in the visitor: self._check_slow_in_expression(node)

Here's how our bad node ((a > 0 for a in elements)) looks like (just one example):

GeneratorExp(elt=Compare(left=Name(id='a', ctx=Load(), lineno=1, col_offset=4), ops=[Gt()], comparators=[Num(n=0, lineno=1, col_offset=8)], lineno=1, col_offset=4), generators=[comprehension(target=Name(id='a', ctx=Store(), lineno=1, col_offset=14), iter=Name(id='elements', ctx=Load(), lineno=1, col_offset=19), ifs=[], is_async=0)], lineno=1, col_offset=4)

Then you write the required logic, test it, and submit a PR. I am here to help.

@ManishAradwad sorry for misleading you. This is a refactoring violation.

sum(1 for a in elements if a in b) and len([1 for a in elements if a > 0]) win!

We are using both sum(1 for a in elements if a in b) and len([1 for a in elements if a > 0]) to find the length of the list, right??

Then what do u mean by

It means that it is better to use sum(1 for a in elements if a in b) then sum(a in b for a in elements). And this rule is perfectly valid.

Also are you saying that I should add the function self._check_slow_in_expression(node) in the Wrong ComprehensionVisitor??

We don't check sum or len function. We check a comprehension inside them or inside any other python code: 1 for a in elements if a in b. And yes, we check it inside WrongComprehensionVisitor

Hi @ManishAradwad, how's it going? Do you need any help?

We don't check sum or len function.

I think some clarification on exactly what is forbidden is needed here. I've just read through this twice, and I'm still not clear.

Are we forbidding slow sum/len calls (in which case we do need to check the functions), or something in all comprehensions?

Because some of the bad examples don't make sense to forbid outside of a len/sum call. e.g. [a > 0 for a in elements] might be used as a sequence of True/False values, there's no reason to forbid this unless you know only the True values are actually used (e.g. in sum()).

Was this page helpful?
0 / 5 - 0 ratings