I had a look again at pytest performance, and was struck again by how many stat calls pytest performs. This seems to boil down to using the path classes from py for checking file existence, walking directory structures and such. Unfortunately instances of these are passed to plugins so aren't internal implementation details.
I suggest we think of using plain strings as paths that are passed to plugins. This is obviously a breaking change so can't be done until pytest 6. It's not clear to me how we'd make deprecation warnings for this though :(
To take a concrete example of this being a problem, the test suite for the product I work on calls stat 79k times just for the collect phase. If I monkey patch stat to log the paths there are ~3k unique paths in the output. I can get a little performance boost by monkey patching stat to be cached:
orig_stat = os.stat
cache = {}
def monkey_stat(*args, **kwargs):
a = args[0]
if a in cache:
return cache[a]
r = orig_stat(*args, **kwargs)
cache[a] = r
return r
That this can improve the performance is pretty silly :P
It seems like pytest could just use os.walk once and then use that data for the rest of the run of the program...
Killing the pylib dependency in general is a planned long term change
I suggest we think of using plain strings as paths that are passed to plugins.
We could have objects based on pathlib, with pylib methods mixed in that would issue a DeprecationWarning maybe.
But I am more in favor of just having a hard cut there.
If we have Path objects already it makes sense to pass them through, but using strings would be more future proof.
In the middle of the night I remembered that the hooks have DI parameters, so it should be simple to start passing new arguments (with new names) and if the signature contains the old parameter names, issue the deprecation warning there. I think it would be nice if we split the parameter names for dirs and files too. Currently it's "path" and plugins have to check if it's a file if they are unsure. If it was "dirname" and "filename" it would be clearer.
If we also make the instantiation of the Path objects lazy at the call site and the core code uses pure strings, then the cutoff could be pretty clean and if you have no deprecated usage you would mostly not have the performance hit of the old stuff.
I also had the idea that we could go through the entire filesystem from the very beginning (respecting ignore patterns?) and then look at an in-memory data structure instead of asking the filesystem over and over. I don't think pytest can, and wants to, support the filesystem changing significantly under its feet during a test session anyway.
IIRC we discussed something like this (mainly for tmpdir) at the last pytest sprint in 2016 already. I think there, the plan was to have a wrapper layer over pathlib objects which provides a py.path-like API?
For tmpdir, we ended up having a separate tmp_path fixture instead, also see the existing discussion in #2230, #3985, #3988. I do think that's the better solution, as people coming from outside of pytest will be more used to pathlib rather than py.path.
Of course it remains to be seen if pathlib is more performant than py.path, but I'd bet that it is.
The py.path API isn't very good, isn't well documented and worse than that it's very different from the rest of the python ecosystem, so keeping it seems like a bad idea long term.
@The-Compiler after initial expieriments a few years back im absolutely certain we should let the pylib api die
there wont be a wrapper, just backward compat properties that will eventually go
@boxed
I also had the idea that we could go through the entire filesystem from the very beginning (respecting ignore patterns?) and then look at an in-memory data structure instead of asking the filesystem over and over. I don't think pytest can, and wants to, support the filesystem changing significantly under its feet during a test session anyway.
Yeah, that sounds like a good idea in general (separate issue - please consider creating / owning it then).