The project is already able to do linting in a parallel process.
Wouldnt it make sense to run the fixers in parallel per file?
(Assuming this would speedup the process considerably)
Whats your opinion?
it's not possible to run fixers per file in parallel, as there are dependencies between fixers, they have to run in proper order.
it's possible to run files in parallel
Thats what I mean. Run all files in parallel but the fixers per file in the same process.
In case such a thing would be welcome?
in general - yes.
but please, before raising the PR with code, raise a technical proposal how to achieve it ;)
For a first prototype I would use a fixed number of subprocesses and let run php-cs-fixer per file while managing the remaining work and the result printing from a master process.
Just for the goal to check how much time parallelzation might yield.
More or less parallelizing this loop
https://github.com/FriendsOfPHP/PHP-CS-Fixer/blob/41dc9e79779b8a0b89f811f93b077af25613c01f/src/Runner/Runner.php#L131
:-1: for that.
bootstrapping PHP CS Fixer is costly, so if we have 1k files, I don't want to bootstrap PHP CS Fixer 1k times.
also, during fixing of one file, we are already linting next one in background.
finally, if there are 4 cores, it's pointless to create 1000 processes.
what we need is a WorkerPool / JobPool, that would be feeded with next and next jobs (files) and assign file to one of fixed amount of workers in the pool (== number of cores)
I totally agree, and thats also how I imagine it to work in the final version.
for the first shot, I want to get a feeling how much speedup we can get, without a big investment in code-changes.
after we are confident that its worth it (e.g. a 5x improvement) I would work for something more usefull.
then, just use parallel cli tool ( http://manpages.ubuntu.com/manpages/xenial/man1/parallel.1.html ) and run it against each file individually
some initial numbers:
$ time php ./PHP-CS-Fixer/php-cs-fixer fix ./sabre-dav/lib/
Loaded config default.
1) sabre-dav/lib/CalDAV/CalendarRoot.php
2) sabre-dav/lib/CalDAV/SharedCalendar.php
...
218) sabre-dav/lib/DAV/Exception/TooManyMatches.php
219) sabre-dav/lib/DAV/Exception/NotImplemented.php
220) sabre-dav/lib/DAV/Exception/NotFound.php
221) sabre-dav/lib/DAV/Exception/MethodNotAllowed.php
222) sabre-dav/lib/DAV/Exception/LockTokenMatchesRequestUri.php
223) sabre-dav/lib/DAV/Tree.php
Fixed all files in 10.706 seconds, 16.000 MB memory used
real 0m10.945s
user 0m9.764s
sys 0m0.180s
time find ./sabre-dav/lib/ | parallel -j 4 php ./PHP-CS-Fixer/php-cs-fixer fix {}
Fixed all files in 0.016 seconds, 10.000 MB memory used
Loaded config default.
Fixed all files in 0.039 seconds, 10.000 MB memory used
Loaded config default.
...
Fixed all files in 0.012 seconds, 10.000 MB memory used
Loaded config default.
Fixed all files in 0.059 seconds, 10.000 MB memory used
Loaded config default.
real 0m19.342s
user 0m52.620s
sys 0m14.296s
with the following setup:
php -v
PHP 7.0.32-1+ubuntu16.04.1+deb.sury.org+1 (cli) (built: Oct 1 2018 11:45:35) ( NTS )
Copyright (c) 1997-2017 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2017 Zend Technologies
with Zend OPcache v7.0.32-1+ubuntu16.04.1+deb.sury.org+1, Copyright (c) 1999-2017, by Zend Technologies
with blackfire v1.23.1~linux-x64-non_zts70, https://blackfire.io, by Blackfire
on a 4 CPUs ubuntu16 vmware VM
takeaway
=> A) so a initial obvious optimization would be to try to speedup php-cs-fixer for the special case of fixing only 1 file
=> B) search for a way to make it parallel without processes e.g. co-routines (in-process) or threads or similar
=> A) so a initial obvious optimization would be to try to speedup php-cs-fixer for the special case of fixing only 1 file
=> B) search for a way to make it parallel without processes e.g. co-routines (in-process) or threads or similar
so, as I said, WorkerPool / JobPool
after having another a more in-depth look into the codebase, I guess we can speedup things by using async io for some components.
step1:
the default Linter TokenizerLinter does blocking IO in lintFile.
in case we could read those files in a non blocking fashion we might be able to do the tokenizing while files are beeing read from the filesystem (tokenize/fix file A while file B is read).
please find some initial coding on how a async io based linter could look like:
https://github.com/FriendsOfPHP/PHP-CS-Fixer/compare/master...staabm:async1
yes, we need to read the file to lint in in TokenizerLinter - but what we just loaded to internal buffer and what we just gonna lint, in next step we gonna fix - for which we need to have the file read from IO anyway
After thinking a bit more about the problem and possible solutions I came to another idea:
I will try to use pcntl fork, so we dont have to pay the bootstrap cost for each worker.
I will try to use pcntl fork
Don't waste your time:
Note: This extension is not available on Windows platforms.
_- PCNTL Introduction_
I am fine with a feature detected parallization feature (which can work on windows via linux subsystem) - in case it doesnt require ugly code and leads to a decent perf improvement
I'm picking up by the comment from @staabm https://github.com/FriendsOfPHP/PHP-CS-Fixer/issues/4024#issuecomment-428994403
time find ./sabre-dav/lib/ | parallel -j 4 php ./PHP-CS-Fixer/php-cs-fixer fix {}
The issue here is the massive overhead of spawning a new process for each file 😅 as was pointed out already.
Besides coding a solution, with xargs it's possible to better tune this and see noticeable improvements but it has drawbacks.
First, for this experiment, php-cs-fixer needs a one-line change because currently it does not accept multiple paths provided on the command line:
diff --git a/src/Console/ConfigurationResolver.php b/src/Console/ConfigurationResolver.php
index 72352c350..29041bf5e 100644
--- a/src/Console/ConfigurationResolver.php
+++ b/src/Console/ConfigurationResolver.php
@@ -599,7 +599,7 @@ private function computeConfigFiles()
if ($this->isStdIn() || 0 === \count($path)) {
$configDir = $this->cwd;
} elseif (1 < \count($path)) {
- throw new InvalidConfigurationException('For multiple paths config parameter is required.');
+ $configDir = $this->cwd;
} elseif (!is_file($path[0])) {
$configDir = $path[0];
} else {
This now allows to call php-cs-fixer fix <file1> … <file n>
My sample set (some private project):
$ find app tests -type f -iname \*php | wc -l
2465
Running a single invocation:
$ time find app tests -type f -iname \*php | xargs ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Paths from configuration file have been overridden by paths provided as command arguments.
Fixed all files in 98.335 seconds, 52.000 MB memory used
real 1m38.607s
user 1m37.876s
sys 0m0.564s
Now, using xargs:
$ time find app tests -type f -iname \*php | xargs -n 620 -P 4 ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Paths from configuration file have been overridden by paths provided as command arguments.
Paths from configuration file have been overridden by paths provided as command arguments.
Paths from configuration file have been overridden by paths provided as command arguments.
Paths from configuration file have been overridden by paths provided as command arguments.
Fixed all files in 9.588 seconds, 26.000 MB memory used
Fixed all files in 14.088 seconds, 22.000 MB memory used
Fixed all files in 32.412 seconds, 44.000 MB memory used
Fixed all files in 45.755 seconds, 42.000 MB memory used
real 0m46.104s
user 1m41.549s
sys 0m0.812s
Or going more extreme:
$ time find app tests -type f -iname \*php | xargs -n 310 -P 8 ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
…
real 0m28.043s
user 1m45.297s
sys 0m0.990s
or
$ time find app tests -type f -iname \*php | xargs -n 155 -P 16 ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
…
real 0m23.520s
user 2m43.397s
The improvements don't increase linearly with the number of processes but still are very measurable.
_Note: tests were conducted on a MacBook Pro (16-inch, 2019), 2.4Ghz 8-core i9_
This is due to the diversity of files to scan, something which also already was pointed out
so a initial obvious optimization would be to try to speedup php-cs-fixer for the special case of fixing only 1 file
However, I do not agree with the conclusion
takeaway
…
- doing it in parallel as is is even slower then linear
But the answer was given via the follow-up comment https://github.com/FriendsOfPHP/PHP-CS-Fixer/issues/4024#issuecomment-429207767
so, as I said, WorkerPool / JobPool
The xargs attempt of course is a "poor pool emulation", as due to the diversity of source files to scan and the way xargs (not really) distributes the input, we end up with processes having same amount of files but far more work to do.
Further:
proc_open, so communication happens via stdin/stdout/stderrphp process per file too and uses native php linting, so the solution might not apply here. I don't see pooling.I guess some challenges are the flow of data back:
ps: any idea about the "For multiple paths config parameter is required" limitation?
any idea about the "For multiple paths config parameter is required" limitation?
git blame to the rescue: https://github.com/FriendsOfPHP/PHP-CS-Fixer/pull/1903#discussion_r114083457nice reading, @mfn ;)
It was never a big prio for core project maintainers, as usually the performance problem is happening only for FIRST run of the tool. Then, when we have .php_cs, one does not have that many files modified to be in deep need of parallelization.
If we would build-in some solution, i would suggest to either use some 3rd party solution, so we can use all the hard work from open source community and not come up with something custom, eg swoole php or other lib.
Or, maybe we wait for PHP build-in solution (and maybe have it polyfilled for older PHP runtimes), like https://wiki.php.net/rfc/fiber
Thx for the followup.
We see php-cs fixing taking several minutes in CI builds - which was the initial motivation for this change.
I think the most strait forward (and less maintenance work) way forward would be to allow several file paths as cli arguments and do the fork/parallel stuff at bash level with xargs.
Do you guys agree?
We see php-cs fixing taking several minutes in CI builds
do you have the cache file in your CI? if not, consider adding it
I think the most strait forward (and less maintenance work) way forward would be to allow several file paths as cli arguments
possible since creation of this github issue. just explicitly fingerpoint the config that should be used.
possible since creation of this github issue. just explicitly fingerpoint the config that should be used.
🤦
Now it hit me, I didn't get this from the other issue. I can confirm, this works with an official release:
time find app tests -type f -iname \*php | xargs -n 155 -P 16 ./bin/php-cs-fixer-v2.phar --config=.php_cs.dist fix --using-cache=no
"problem solved" ;) then I guess
The drawback here, btw, is that you can't use the "finder" as defined per the .php_cs; I neglected to mention this so far.
I initially tried to come up with some super smart separation of the finder config from the .php_cs to first extract the definite list of files (for feeding it back via xargs), but that wouldn't work for project where the phar is used.
At least I couldn't make it work: the phar can be includeed, but you can't just expect to have access to it's libraries that way, as it always runs the stub.php. But granted, at this point I did not investigate further.
do you have the cache file in your CI? if not, consider adding it
i tried doing so with several differents attempts but it did not work. Maybe the reason is git does not properly set modification stamps of files when cloning or similar.. don‘t know whats going on..
The drawback here, btw, is that you can't use the "finder" as defined per the .php_cs; I neglected to mention this so far.
maybe it would make sense to add a command which prints a list of files (like e.g. find would) but based on thr $finder used in the config, so we can use this result with xargs to parallelize stuff?
maybe it would make sense to add a command which prints a list of files (like e.g. find would) but based on thr $finder used in the config, so we can use this result with xargs to parallelize stuff?
phpunit supports for example --list-tests/--list-tests-xml; so maybe something like --list-files?
I can't believe I bothered coming up with this hack, but this solves my needs for faster fixing in CI (where, as pointed out, the cache does not work reliable => I've the same issues with phpstan and CI caching, FWIW) by re-using the defined "finder":
$ cat .php_cs.finder.php
<?php
declare(strict_types = 1);
$finder = PhpCsFixer\Finder
::create()
->in(__DIR__)
->exclude([
// etc.
'tmp',
]);
if ('true' === getenv('PHP_CS_FIXER_LIST_FILES')) {
foreach ($finder as $file) {
echo $file->getPathname(), "\n";
}
exit(0);
}
return $finder;
$ head -n 8 .php_cs.dist
<?php
declare(strict_types = 1);
$finder = include './.php_cs.finder.php';
return PhpCsFixer\Config::create()
->setFinder($finder)
=> PHP_CS_FIXER_LIST_FILES=true ./bin/php-cs-fixer-v2.phar fix | xargs -n 171 -P 16 ./bin/php-cs-fixer-v2.phar fix --config=.php_cs.dist
🤦 but works.
just went ahead and added a new list-files command based on the latest observations within this thread: