@guolinke I am just wondering if recommending using mnative can yield better performance for those installing directly from install_github (default is mcore2 in R).
Installation log when installing using install_github in Windows for instance: we can see it is tuned for Core 2 architecture:
c:/Rtools/mingw_64/bin/g++ -m64 -std=c++0x -I"C:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/include" -DNDEBUG -I../..//include -DUSE_SOCKET -I"C:/swarm/workspace/External-R-3.3.2/vendor/extsoft/include" -fopenmp -pthread -std=c++11 -O2 -Wall -mtune=core2 -c lightgbm-all.cpp -o lightgbm-all.o
c:/Rtools/mingw_64/bin/g++ -m64 -std=c++0x -I"C:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/include" -DNDEBUG -I../..//include -DUSE_SOCKET -I"C:/swarm/workspace/External-R-3.3.2/vendor/extsoft/include" -fopenmp -pthread -std=c++11 -O2 -Wall -mtune=core2 -c lightgbm_R.cpp -o lightgbm_R.o
c:/Rtools/mingw_64/bin/g++ -m64 -shared -s -static-libgcc -o lightgbm.dll tmp.def ./lightgbm-all.o ./lightgbm_R.o -fopenmp -pthread -lws2_32 -liphlpapi -LC:/swarm/workspace/External-R-3.3.2/vendor/extsoft/lib/x64 -LC:/swarm/workspace/External-R-3.3.2/vendor/extsoft/lib -LC:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/bin/x64 -lR
This would require adding in the README.md of the R-package that to maximize performance, adding -march=native should be done but might break packages.
Regarding -O3 (if we were to push for even more), I know it is refused by CRAN for compatibility issues (some packages are breaking with -O3).
@Laurae2 , did that means that we should alter the c++ build rather than just R libraries. I think we can make that a suggestion rather than a compulsory process.
@Laurae2
I remember the difference between O2 and O3 in LightGBM is very small.
You can try some benchmarks on this.
@chivee no, this would just be a suggestion to users if they want to achieve better local training speed. I'm not sure if it has a major impact though, I'll test all that thoroughly before I make a PR. As @guolinke there are very small differences just for O2 and O3 flag alone.
@guolinke when I get time on my server I'll try O3 and march=native to see what happens to the speed. I'm collecting a lot of (long) benchmarks since last month on xgboost and LightGBM to understand their performance (in ranking predictions (AUC), and speed) behavior depending on parameters.
I'll get back here once my new benchmarks are done.
@guolinke Some results here. Not posting the exact details for the benchmark because there will be more at a mini-conference I am doing next month.
Settings:
v1 is LightGBM v1v2 is LightGBM v2 @1bf7bbddefault means compiled with -O2 -mtune=core2O3 means compiled with -O3 -march=nativeO3-fmath means compiled with -O3 -ffast-math -march=nativeO2 means compiled with -O2 -march=nativeOs means compiled with -OsBest means the best flags for compilation for maximum speed, with default settings overriding all the others if the difference is not significant (<~1%) and not consistent (similar flags giving results off).
Summary (tl;dr)
We notice LightGBM v2 with O3 -march=native (specifically -O3 is benefiting for the performance. LightGBM v1 has no visible benefits from using any other flags than the defaults currently. Depending on the model parameters, different flags provide different performance (like: LightGBM v2 + -march=native performance boost is kicking off when building deeper trees, or if the overhead is low/large like for 1 thread runs).
Therefore, the following recommendations could be made:
-O2 -mtune=core2 for LightGBM v1 for maximum performance.-O3 -march=native for LightGBM v2 for maximum performance.I will follow up with more in the next month.
Bosch, 12 threads, LightGBM v1:
| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 724.18s | 903.17s | 725.38s | 729.89s | 723.23s | default |
| depth=6 | 579.29s | 685.88s | 584.64s | 584.59s | 583.89s | default |
| depth=9 | 395.23s | 454.56s | 398.25s | 400.50s | 398.93s | default |
| depth=12 | 596.55s | 654.80s | 596.90s | 608.39s | 604.25s | default |
Bosch, 12 threads, LightGBM v2:
| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 873.08s | 1104.39s | 861.57s | 861.99s | 872.17s | O2 |
| depth=6 | 730.06s | 872.77s | 724.59s | 722.88s | 724.98s | O3 |
| depth=9 | 567.59s | 634.52s | 570.66s | 556.12s | 614.80s | O3 |
| depth=12 | 854.97s | 923.84s | 845.12s | 834.60s | 847.38s | O3 |
Bosch, 6 threads, LightGBM v1:
| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 913.44s | 1208.02s | 903.01s | 921.13s | 915.41s | O2 |
| depth=6 | 718.29s | 885.44s | 722.16s | 723.94s | 726.72s | default |
| depth=9 | 449.03s | 533.58s | 451.60s | 455.08s | 452.59s | default |
| depth=12 | 622.24s | 704.10s | 623.36s | 618.28s | 619.96s | O3 |
Bosch, 6 threads, LightGBM v2:
| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3| 956.25s | 1248.24s | 965.32s | 969.56s | 975.95s | default |
| depth=6 | 787.95s | 952.82s | 795.35s | 782.70s | 788.41s | ??? |
| depth=9` | 548.84s | 639.46s | 546.65s | 547.61s | 547.05s | ??? |
| depth=12 | 770.47s | 862.75s | 766.49s | 773.30s | 762.61s | ??? |
Bosch, 1 thread, LightGBM v1:
| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 2360.10s | 3314.84s | 2389.20s | 2406.67s | 2337.28s | O3-fmath |
| depth=6 | 1757.84s | 2335.01s | 1810.60s | 1816.25s | 1769.16s | default |
| depth=9 | 968.05s | 1250.17s | 994.99s | 1007.10s | 975.83s | default |
| depth=12 | 1202.59s | 1468.61s | 1238.31s | 1246.01s | 1216.62s | default |
Bosch, 1 thread, LightGBM v2:
| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 2477.49s | 3316.81s | 2437.84s | 2342.69s | 2412.35s | O3 |
| depth=6 | 1850.66s | 2334.77s | 1830.01s | 1745.34s | 1799.20s | O3 |
| depth=9 | 1003.35s | 1243.15s | 990.65s | 954.06s | 970.39s | O3 |
| depth=12 | 1236.83s | 1469.03s | 1216.49s | 1159.22s | 1191.33s | O3 |
@Laurae2 Thanks for your benchmark 馃憤 .
If change to O3 is needed, you can create a PR for it.
@guolinke I'll open a PR to add a recommendation when I get some good charts ready and when the mini-conference will be ready soon (early next month), I'll link to it on the PR.
I also have xgboost benchmarks for comparison, do you want to see them? (I also got for nthread={1, 2, 3, 4, 5, 6, 12} and depth={3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, but then it gets very large in GitHub, I plan to make a blog post on it instead)
Sure. The comparison benchmarks are always welcome. It can help to find out which part we can further improve.
@guolinke Here for xgboost:
b4d97d3-O2 -mtune=core2O3 means compiled with -O3 -march=native -funroll-loopsO3-fmath means compiled with -O3 -ffast-math -march=native -funroll-loops-funroll-loops is added because it is xgboost's default (actually, not even seeing a difference with or without)xgboost was "slow", I skipped -O2 -march=native and -Os (it took 2 days for each full benchmark per thread count, the singlethreaded run was very long to do).
To compare xgboost and LightGBM, best is copy&paste into Excel (or anything similar) and make charts. See the end of this comment for the Excel table example.
Default run:

Default flag:


-O3 flag:


-O3 -fast-math flag:


Summary (tl;dr)
Configuration to choose, difference might be large depending on case:
-O2 -mtune=core2-O3 -ffast-math -march=native -funroll-loops-O3 -march=native -funroll-loopsOne can see https://github.com/dmlc/xgboost/issues/1950 more for understand xgboost implementation details.
More to come soon next month (on 10 May).
Bosch, 12 threads, xgboost depth-wise at b4d97d3:
| Parameters | dw + default | dw + O3 | dw + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 1049.86s | 1037.48s | 1026.85s | O3-fmath |
| depth=6 | 832.13s | 843.74s | 789.30s | O3-fmath |
| depth=9 | 790.78s | 799.14s | 788.94s | default |
| depth=12 | 1288.12s | 1303.58s | 1323.37s | default |
Bosch, 12 threads, xgboost loss guide at b4d97d3:
| Parameters | lg + default | lg + O3 | lg + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---:
| depth=3 | 1047.75s | 1042.41s | 1030.32s | O3-fmath |
| depth=6 | 844.80s | 841.92s | 838.87s | O3-fmath |
| depth=9 | 799.60s | 802.58s | 797.94s | default |
| depth=12 | 1263.58s | 1292.64s | 1330.31s | default |
Bosch, 6 threads, xgboost depth-wise at b4d97d3:
| Parameters | dw + default | dw + O3 | dw + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 1222.31s | 1194.40s | 1171.52s | O3-fmath |
| depth=6 | 865.96s | 866.79s | 833.08s | O3-fmath |
| depth=9 | 696.18s | 710.25s | 703.25s | default |
| depth=12 | 1036.29s | 1062.12s | 1070.23s | default |
Bosch, 6 threads, xgboost loss guide at b4d97d3:
| Parameters | lg + default | lg + O3 | lg + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 1215.27s | 1194.47s | 1176.07s | O3-fmath |
| depth=6 | 871.79s | 860.68s | 855.88s | O3-fmath |
| depth=9 | 717.43s | 714.81s | 705.16s | O3-fmath |
| depth=12 | 1061.09s | 1077.32s | 1089.91s | default |
Bosch, 1 thread, xgboost depth-wise at b4d97d3:
| Parameters | dw + default | dw + O3 | dw + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 3122.58s | 2719.62s | 2885.43s | O3 |
| depth=6 | 2076.36s | 1909.22s | 1967.32s | O3 |
| depth=9 | 1296.96s | 1215.27s | 1260.41s | O3 |
| depth=12 | 1684.07s | 1520.32s | 1577.45s | O3 |
Bosch, 1 thread, xgboost loss guide at b4d97d3:
| Parameters | lg + default | lg + O3 | lg + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 3032.19s | 2771.35s | 2944.40s | O3 |
| depth=6 | 2049.57s | 1941.74s | 1934.76s | O3-fmath |
| depth=9 | 1304.50s | 1208.21s | 1265.47s | O3 |
| depth=12 | 1571.86s | 1503.40s | 1615.36s | O3 |
Excel table example:
Copy & paste:
=INDEX($A$1:$AA$5,F8,G8) on E8, then double click the small box at bottom right on the cell to paste down=NUMBERVALUE(LEFT(E8, LEN(E8)-1)) on D8, then double click the small box at bottom right on the cell to paste down| Model | Flag | Depth | Speed | CellVal | Row | Column |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| LightGBM v1 | default | 3 | 2360.1 | | 2 | 2 |
| LightGBM v1 | default | 6 | 1757.84 | | 3 | 2 |
| LightGBM v1 | default | 9 | 968.05 | | 4 | 2 |
| LightGBM v1 | default | 12 | 1202.59 | | 5 | 2 |
| LightGBM v1 | Os | 3 | 3314.84 | | 2 | 3 |
| LightGBM v1 | Os | 6 | 2335.01 | | 3 | 3 |
| LightGBM v1 | Os | 9 | 1250.17 | | 4 | 3 |
| LightGBM v1 | Os | 12 | 1468.61 | | 5 | 3 |
| LightGBM v1 | O2 | 3 | 2389.2 | | 2 | 4 |
| LightGBM v1 | O2 | 6 | 1810.6 | | 3 | 4 |
| LightGBM v1 | O2 | 9 | 994.99 | | 4 | 4 |
| LightGBM v1 | O2 | 12 | 1238.31 | | 5 | 4 |
| LightGBM v1 | O3 | 3 | 2406.67 | | 2 | 5 |
| LightGBM v1 | O3 | 6 | 1816.25 | | 3 | 5 |
| LightGBM v1 | O3 | 9 | 1007.1 | | 4 | 5 |
| LightGBM v1 | O3 | 12 | 1246.01 | | 5 | 5 |
| LightGBM v1 | O3-fmath | 3 | 2337.28 | | 2 | 6 |
| LightGBM v1 | O3-fmath | 6 | 1769.16 | | 3 | 6 |
| LightGBM v1 | O3-fmath | 9 | 975.83 | | 4 | 6 |
| LightGBM v1 | O3-fmath | 12 | 1216.62 | | 5 | 6 |
| LightGBM v2 | default | 3 | 2477.49 | | 2 | 10 |
| LightGBM v2 | default | 6 | 1850.66 | | 3 | 10 |
| LightGBM v2 | default | 9 | 1003.35 | | 4 | 10 |
| LightGBM v2 | default | 12 | 1236.83 | | 5 | 10 |
| LightGBM v2 | Os | 3 | 3316.81 | | 2 | 11 |
| LightGBM v2 | Os | 6 | 2334.77 | | 3 | 11 |
| LightGBM v2 | Os | 9 | 1243.15 | | 4 | 11 |
| LightGBM v2 | Os | 12 | 1469.03 | | 5 | 11 |
| LightGBM v2 | O2 | 3 | 2437.84 | | 2 | 12 |
| LightGBM v2 | O2 | 6 | 1830.01 | | 3 | 12 |
| LightGBM v2 | O2 | 9 | 990.65 | | 4 | 12 |
| LightGBM v2 | O2 | 12 | 1216.49 | | 5 | 12 |
| LightGBM v2 | O3 | 3 | 2342.69 | | 2 | 13 |
| LightGBM v2 | O3 | 6 | 1745.34 | | 3 | 13 |
| LightGBM v2 | O3 | 9 | 954.06 | | 4 | 13 |
| LightGBM v2 | O3 | 12 | 1159.22 | | 5 | 13 |
| LightGBM v2 | O3-fmath | 3 | 2412.35 | | 2 | 14 |
| LightGBM v2 | O3-fmath | 6 | 1799.2 | | 3 | 14 |
| LightGBM v2 | O3-fmath | 9 | 970.39 | | 4 | 14 |
| LightGBM v2 | O3-fmath | 12 | 1191.33 | | 5 | 14 |
| xgboost-depthwise | default | 3 | 3122.58 | | 2 | 18 |
| xgboost-depthwise | default | 6 | 2076.36 | | 3 | 18 |
| xgboost-depthwise | default | 9 | 1296.96 | | 4 | 18 |
| xgboost-depthwise | default | 12 | 1684.07 | | 5 | 18 |
| xgboost-depthwise | O3 | 3 | 2719.62 | | 2 | 19 |
| xgboost-depthwise | O3 | 6 | 1909.22 | | 3 | 19 |
| xgboost-depthwise | O3 | 9 | 1215.27 | | 4 | 19 |
| xgboost-depthwise | O3 | 12 | 1520.32 | | 5 | 19 |
| xgboost-depthwise | O3-fmath | 3 | 2885.43 | | 2 | 20 |
| xgboost-depthwise | O3-fmath | 6 | 1967.32 | | 3 | 20 |
| xgboost-depthwise | O3-fmath | 9 | 1260.41 | | 4 | 20 |
| xgboost-depthwise | O3-fmath | 12 | 1577.45 | | 5 | 20 |
| xgboost-lossguide | default | 3 | 3032.19 | | 2 | 24 |
| xgboost-lossguide | default | 6 | 2049.57 | | 3 | 24 |
| xgboost-lossguide | default | 9 | 1304.5 | | 4 | 24 |
| xgboost-lossguide | default | 12 | 1571.86 | | 5 | 24 |
| xgboost-lossguide | O3 | 3 | 2771.35 | | 2 | 25 |
| xgboost-lossguide | O3 | 6 | 1941.74 | | 3 | 25 |
| xgboost-lossguide | O3 | 9 | 1208.21 | | 4 | 25 |
| xgboost-lossguide | O3 | 12 | 1503.4 | | 5 | 25 |
| xgboost-lossguide | O3-fmath | 3 | 2944.4 | | 2 | 26 |
| xgboost-lossguide | O3-fmath | 6 | 1934.76 | | 3 | 26 |
| xgboost-lossguide | O3-fmath | 9 | 1265.47 | | 4 | 26 |
| xgboost-lossguide | O3-fmath | 12 | 1615.36 | | 5 | 26 |
Most helpful comment
@guolinke Some results here. Not posting the exact details for the benchmark because there will be more at a mini-conference I am doing next month.
Settings:
v1is LightGBM v1v2is LightGBM v2 @1bf7bbddefaultmeans compiled with-O2 -mtune=core2O3means compiled with-O3 -march=nativeO3-fmathmeans compiled with-O3 -ffast-math -march=nativeO2means compiled with-O2 -march=nativeOsmeans compiled with-OsBest means the best flags for compilation for maximum speed, with default settings overriding all the others if the difference is not significant (<~1%) and not consistent (similar flags giving results off).
Summary (tl;dr)
We notice LightGBM v2 with
O3 -march=native(specifically-O3is benefiting for the performance. LightGBM v1 has no visible benefits from using any other flags than the defaults currently. Depending on the model parameters, different flags provide different performance (like: LightGBM v2 +-march=nativeperformance boost is kicking off when building deeper trees, or if the overhead is low/large like for 1 thread runs).Therefore, the following recommendations could be made:
-O2 -mtune=core2for LightGBM v1 for maximum performance.-O3 -march=nativefor LightGBM v2 for maximum performance.I will follow up with more in the next month.
Bosch, 12 threads, LightGBM v1:
| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 724.18s | 903.17s | 725.38s | 729.89s | 723.23s | default |
| depth=6 | 579.29s | 685.88s | 584.64s | 584.59s | 583.89s | default |
| depth=9 | 395.23s | 454.56s | 398.25s | 400.50s | 398.93s | default |
| depth=12 | 596.55s | 654.80s | 596.90s | 608.39s | 604.25s | default |
Bosch, 12 threads, LightGBM v2:
| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 873.08s | 1104.39s | 861.57s | 861.99s | 872.17s | O2 |
| depth=6 | 730.06s | 872.77s | 724.59s | 722.88s | 724.98s | O3 |
| depth=9 | 567.59s | 634.52s | 570.66s | 556.12s | 614.80s | O3 |
| depth=12 | 854.97s | 923.84s | 845.12s | 834.60s | 847.38s | O3 |
Bosch, 6 threads, LightGBM v1:
| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 913.44s | 1208.02s | 903.01s | 921.13s | 915.41s | O2 |
| depth=6 | 718.29s | 885.44s | 722.16s | 723.94s | 726.72s | default |
| depth=9 | 449.03s | 533.58s | 451.60s | 455.08s | 452.59s | default |
| depth=12 | 622.24s | 704.10s | 623.36s | 618.28s | 619.96s | O3 |
Bosch, 6 threads, LightGBM v2:
| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3
| 956.25s | 1248.24s | 965.32s | 969.56s | 975.95s | default | | depth=6| 787.95s | 952.82s | 795.35s | 782.70s | 788.41s | ??? || depth=9` | 548.84s | 639.46s | 546.65s | 547.61s | 547.05s | ??? |
| depth=12 | 770.47s | 862.75s | 766.49s | 773.30s | 762.61s | ??? |
Bosch, 1 thread, LightGBM v1:
| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 2360.10s | 3314.84s | 2389.20s | 2406.67s | 2337.28s | O3-fmath |
| depth=6 | 1757.84s | 2335.01s | 1810.60s | 1816.25s | 1769.16s | default |
| depth=9 | 968.05s | 1250.17s | 994.99s | 1007.10s | 975.83s | default |
| depth=12 | 1202.59s | 1468.61s | 1238.31s | 1246.01s | 1216.62s | default |
Bosch, 1 thread, LightGBM v2:
| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 2477.49s | 3316.81s | 2437.84s | 2342.69s | 2412.35s | O3 |
| depth=6 | 1850.66s | 2334.77s | 1830.01s | 1745.34s | 1799.20s | O3 |
| depth=9 | 1003.35s | 1243.15s | 990.65s | 954.06s | 970.39s | O3 |
| depth=12 | 1236.83s | 1469.03s | 1216.49s | 1159.22s | 1191.33s | O3 |