Php-cs-fixer: Async fixing

Created on 10 Oct 2018  Â·  26Comments  Â·  Source: FriendsOfPHP/PHP-CS-Fixer

The project is already able to do linting in a parallel process.

Wouldnt it make sense to run the fixers in parallel per file?
(Assuming this would speedup the process considerably)

Whats your opinion?

kinenhancement kinquestion

All 26 comments

it's not possible to run fixers per file in parallel, as there are dependencies between fixers, they have to run in proper order.

it's possible to run files in parallel

Thats what I mean. Run all files in parallel but the fixers per file in the same process.

In case such a thing would be welcome?

in general - yes.
but please, before raising the PR with code, raise a technical proposal how to achieve it ;)

For a first prototype I would use a fixed number of subprocesses and let run php-cs-fixer per file while managing the remaining work and the result printing from a master process.
Just for the goal to check how much time parallelzation might yield.

More or less parallelizing this loop
https://github.com/FriendsOfPHP/PHP-CS-Fixer/blob/41dc9e79779b8a0b89f811f93b077af25613c01f/src/Runner/Runner.php#L131

:-1: for that.
bootstrapping PHP CS Fixer is costly, so if we have 1k files, I don't want to bootstrap PHP CS Fixer 1k times.

also, during fixing of one file, we are already linting next one in background.

finally, if there are 4 cores, it's pointless to create 1000 processes.

what we need is a WorkerPool / JobPool, that would be feeded with next and next jobs (files) and assign file to one of fixed amount of workers in the pool (== number of cores)

I totally agree, and thats also how I imagine it to work in the final version.

for the first shot, I want to get a feeling how much speedup we can get, without a big investment in code-changes.
after we are confident that its worth it (e.g. a 5x improvement) I would work for something more usefull.

then, just use parallel cli tool ( http://manpages.ubuntu.com/manpages/xenial/man1/parallel.1.html ) and run it against each file individually

some initial numbers:

$ time php ./PHP-CS-Fixer/php-cs-fixer fix ./sabre-dav/lib/
Loaded config default.
   1) sabre-dav/lib/CalDAV/CalendarRoot.php
   2) sabre-dav/lib/CalDAV/SharedCalendar.php
...
 218) sabre-dav/lib/DAV/Exception/TooManyMatches.php
 219) sabre-dav/lib/DAV/Exception/NotImplemented.php
 220) sabre-dav/lib/DAV/Exception/NotFound.php
 221) sabre-dav/lib/DAV/Exception/MethodNotAllowed.php
 222) sabre-dav/lib/DAV/Exception/LockTokenMatchesRequestUri.php
 223) sabre-dav/lib/DAV/Tree.php

Fixed all files in 10.706 seconds, 16.000 MB memory used

real    0m10.945s
user    0m9.764s
sys     0m0.180s
time find ./sabre-dav/lib/ | parallel -j 4 php ./PHP-CS-Fixer/php-cs-fixer fix {}
Fixed all files in 0.016 seconds, 10.000 MB memory used
Loaded config default.

Fixed all files in 0.039 seconds, 10.000 MB memory used
Loaded config default.

...

Fixed all files in 0.012 seconds, 10.000 MB memory used
Loaded config default.

Fixed all files in 0.059 seconds, 10.000 MB memory used
Loaded config default.

real    0m19.342s
user    0m52.620s
sys     0m14.296s

with the following setup:

php -v
PHP 7.0.32-1+ubuntu16.04.1+deb.sury.org+1 (cli) (built: Oct  1 2018 11:45:35) ( NTS )
Copyright (c) 1997-2017 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2017 Zend Technologies
    with Zend OPcache v7.0.32-1+ubuntu16.04.1+deb.sury.org+1, Copyright (c) 1999-2017, by Zend Technologies
    with blackfire v1.23.1~linux-x64-non_zts70, https://blackfire.io, by Blackfire

on a 4 CPUs ubuntu16 vmware VM

takeaway

  • as you already pointed out process overhead seems to be huge
  • doing it in parallel as is is even slower then linear

=> A) so a initial obvious optimization would be to try to speedup php-cs-fixer for the special case of fixing only 1 file
=> B) search for a way to make it parallel without processes e.g. co-routines (in-process) or threads or similar

=> A) so a initial obvious optimization would be to try to speedup php-cs-fixer for the special case of fixing only 1 file
=> B) search for a way to make it parallel without processes e.g. co-routines (in-process) or threads or similar

so, as I said, WorkerPool / JobPool

after having another a more in-depth look into the codebase, I guess we can speedup things by using async io for some components.

step1:
the default Linter TokenizerLinter does blocking IO in lintFile.
in case we could read those files in a non blocking fashion we might be able to do the tokenizing while files are beeing read from the filesystem (tokenize/fix file A while file B is read).

please find some initial coding on how a async io based linter could look like:

https://github.com/FriendsOfPHP/PHP-CS-Fixer/compare/master...staabm:async1

yes, we need to read the file to lint in in TokenizerLinter - but what we just loaded to internal buffer and what we just gonna lint, in next step we gonna fix - for which we need to have the file read from IO anyway

After thinking a bit more about the problem and possible solutions I came to another idea:

I will try to use pcntl fork, so we dont have to pay the bootstrap cost for each worker.

I will try to use pcntl fork

Don't waste your time:

Note: This extension is not available on Windows platforms.
_- PCNTL Introduction_

I am fine with a feature detected parallization feature (which can work on windows via linux subsystem) - in case it doesnt require ugly code and leads to a decent perf improvement

I'm picking up by the comment from @staabm https://github.com/FriendsOfPHP/PHP-CS-Fixer/issues/4024#issuecomment-428994403

time find ./sabre-dav/lib/ | parallel -j 4 php ./PHP-CS-Fixer/php-cs-fixer fix {}

The issue here is the massive overhead of spawning a new process for each file 😅 as was pointed out already.

Besides coding a solution, with xargs it's possible to better tune this and see noticeable improvements but it has drawbacks.

First, for this experiment, php-cs-fixer needs a one-line change because currently it does not accept multiple paths provided on the command line:

diff --git a/src/Console/ConfigurationResolver.php b/src/Console/ConfigurationResolver.php
index 72352c350..29041bf5e 100644
--- a/src/Console/ConfigurationResolver.php
+++ b/src/Console/ConfigurationResolver.php
@@ -599,7 +599,7 @@ private function computeConfigFiles()
         if ($this->isStdIn() || 0 === \count($path)) {
             $configDir = $this->cwd;
         } elseif (1 < \count($path)) {
-            throw new InvalidConfigurationException('For multiple paths config parameter is required.');
+            $configDir = $this->cwd;
         } elseif (!is_file($path[0])) {
             $configDir = $path[0];
         } else {

This now allows to call php-cs-fixer fix <file1> … <file n>

My sample set (some private project):

$ find app tests -type f -iname \*php | wc -l
    2465

Running a single invocation:

$ time find app tests -type f -iname \*php | xargs ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Paths from configuration file have been overridden by paths provided as command arguments.

Fixed all files in 98.335 seconds, 52.000 MB memory used

real    1m38.607s
user    1m37.876s
sys 0m0.564s

Now, using xargs:

  • assume we want 4 processes
  • note the number of total files 2465; divided by 4 processes let's make this 620 per process
$ time find app tests -type f -iname \*php | xargs -n 620 -P 4 ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Loaded config default from "/Users/neo/src/company/project/.php_cs.dist".
Paths from configuration file have been overridden by paths provided as command arguments.
Paths from configuration file have been overridden by paths provided as command arguments.
Paths from configuration file have been overridden by paths provided as command arguments.
Paths from configuration file have been overridden by paths provided as command arguments.

Fixed all files in 9.588 seconds, 26.000 MB memory used

Fixed all files in 14.088 seconds, 22.000 MB memory used

Fixed all files in 32.412 seconds, 44.000 MB memory used

Fixed all files in 45.755 seconds, 42.000 MB memory used

real    0m46.104s
user    1m41.549s
sys 0m0.812s

Or going more extreme:

$ time find app tests -type f -iname \*php | xargs -n 310 -P 8 ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
…
real    0m28.043s
user    1m45.297s
sys 0m0.990s

or

$ time find app tests -type f -iname \*php | xargs -n 155 -P 16 ~/src/php-cs-fixer/php-cs-fixer fix --using-cache=no
…
real    0m23.520s
user    2m43.397s

The improvements don't increase linearly with the number of processes but still are very measurable.

_Note: tests were conducted on a MacBook Pro (16-inch, 2019), 2.4Ghz 8-core i9_

This is due to the diversity of files to scan, something which also already was pointed out

so a initial obvious optimization would be to try to speedup php-cs-fixer for the special case of fixing only 1 file

However, I do not agree with the conclusion

takeaway
…

  • doing it in parallel as is is even slower then linear

But the answer was given via the follow-up comment https://github.com/FriendsOfPHP/PHP-CS-Fixer/issues/4024#issuecomment-429207767

so, as I said, WorkerPool / JobPool

The xargs attempt of course is a "poor pool emulation", as due to the diversity of source files to scan and the way xargs (not really) distributes the input, we end up with processes having same amount of files but far more work to do.


Further:

  • As can be seen in the test, I disabled caching:

    • first: to get comparable numbers

    • second: I'm not sure if running things in parallel this way would interfere with each others cache, I wasn't interested in finding this out _yet_

  • there's https://github.com/FriendsOfPHP/PHP-CS-Fixer/issues/2803#issuecomment-437411859 which gives some pointers regarding architecture how to solve this in-code

I guess some challenges are the flow of data back:

  • file processing progress report across the pool of processes
  • collect data back for constructing a full cache
  • probably more I forgot, I hardly know the internals of php-cs-fixer 😄

ps: any idea about the "For multiple paths config parameter is required" limitation?

any idea about the "For multiple paths config parameter is required" limitation?

nice reading, @mfn ;)

It was never a big prio for core project maintainers, as usually the performance problem is happening only for FIRST run of the tool. Then, when we have .php_cs, one does not have that many files modified to be in deep need of parallelization.

If we would build-in some solution, i would suggest to either use some 3rd party solution, so we can use all the hard work from open source community and not come up with something custom, eg swoole php or other lib.
Or, maybe we wait for PHP build-in solution (and maybe have it polyfilled for older PHP runtimes), like https://wiki.php.net/rfc/fiber

Thx for the followup.

We see php-cs fixing taking several minutes in CI builds - which was the initial motivation for this change.

I think the most strait forward (and less maintenance work) way forward would be to allow several file paths as cli arguments and do the fork/parallel stuff at bash level with xargs.

Do you guys agree?

We see php-cs fixing taking several minutes in CI builds

do you have the cache file in your CI? if not, consider adding it

I think the most strait forward (and less maintenance work) way forward would be to allow several file paths as cli arguments

possible since creation of this github issue. just explicitly fingerpoint the config that should be used.

possible since creation of this github issue. just explicitly fingerpoint the config that should be used.

🤦

Now it hit me, I didn't get this from the other issue. I can confirm, this works with an official release:
time find app tests -type f -iname \*php | xargs -n 155 -P 16 ./bin/php-cs-fixer-v2.phar --config=.php_cs.dist fix --using-cache=no

"problem solved" ;) then I guess

The drawback here, btw, is that you can't use the "finder" as defined per the .php_cs; I neglected to mention this so far.

I initially tried to come up with some super smart separation of the finder config from the .php_cs to first extract the definite list of files (for feeding it back via xargs), but that wouldn't work for project where the phar is used.
At least I couldn't make it work: the phar can be includeed, but you can't just expect to have access to it's libraries that way, as it always runs the stub.php. But granted, at this point I did not investigate further.

do you have the cache file in your CI? if not, consider adding it

i tried doing so with several differents attempts but it did not work. Maybe the reason is git does not properly set modification stamps of files when cloning or similar.. don‘t know whats going on..

The drawback here, btw, is that you can't use the "finder" as defined per the .php_cs; I neglected to mention this so far.

maybe it would make sense to add a command which prints a list of files (like e.g. find would) but based on thr $finder used in the config, so we can use this result with xargs to parallelize stuff?

maybe it would make sense to add a command which prints a list of files (like e.g. find would) but based on thr $finder used in the config, so we can use this result with xargs to parallelize stuff?

phpunit supports for example --list-tests/--list-tests-xml; so maybe something like --list-files?

I can't believe I bothered coming up with this hack, but this solves my needs for faster fixing in CI (where, as pointed out, the cache does not work reliable => I've the same issues with phpstan and CI caching, FWIW) by re-using the defined "finder":

$ cat .php_cs.finder.php
<?php

declare(strict_types = 1);

$finder = PhpCsFixer\Finder
  ::create()
  ->in(__DIR__)
  ->exclude([
    // etc.
    'tmp',
  ]);

if ('true' === getenv('PHP_CS_FIXER_LIST_FILES')) {
    foreach ($finder as $file) {
        echo $file->getPathname(), "\n";
    }
    exit(0);
}

return $finder;
$ head -n 8 .php_cs.dist
<?php

declare(strict_types = 1);

$finder = include './.php_cs.finder.php';

return PhpCsFixer\Config::create()
  ->setFinder($finder)

=> PHP_CS_FIXER_LIST_FILES=true ./bin/php-cs-fixer-v2.phar fix | xargs -n 171 -P 16 ./bin/php-cs-fixer-v2.phar fix --config=.php_cs.dist

🤦 but works.

just went ahead and added a new list-files command based on the latest observations within this thread:

https://github.com/FriendsOfPHP/PHP-CS-Fixer/pull/5390

Was this page helpful?
0 / 5 - 0 ratings

Related issues

amitbisht511 picture amitbisht511  Â·  3Comments

EvgenyOrekhov picture EvgenyOrekhov  Â·  3Comments

Bilge picture Bilge  Â·  3Comments

Bilge picture Bilge  Â·  3Comments

aidantwoods picture aidantwoods  Â·  3Comments