Lightgbm: [R-package] Provide recommendation for mnative?

Created on 15 Mar 2017 · 9Comments · Source: microsoft/LightGBM

@guolinke I am just wondering if recommending using mnative can yield better performance for those installing directly from install_github (default is mcore2 in R).

Installation log when installing using install_github in Windows for instance: we can see it is tuned for Core 2 architecture:

c:/Rtools/mingw_64/bin/g++ -m64 -std=c++0x -I"C:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/include" -DNDEBUG -I../..//include -DUSE_SOCKET      -I"C:/swarm/workspace/External-R-3.3.2/vendor/extsoft/include"  -fopenmp -pthread -std=c++11   -O2 -Wall  -mtune=core2 -c lightgbm-all.cpp -o lightgbm-all.o
c:/Rtools/mingw_64/bin/g++ -m64 -std=c++0x -I"C:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/include" -DNDEBUG -I../..//include -DUSE_SOCKET      -I"C:/swarm/workspace/External-R-3.3.2/vendor/extsoft/include"  -fopenmp -pthread -std=c++11   -O2 -Wall  -mtune=core2 -c lightgbm_R.cpp -o lightgbm_R.o
c:/Rtools/mingw_64/bin/g++ -m64 -shared -s -static-libgcc -o lightgbm.dll tmp.def ./lightgbm-all.o ./lightgbm_R.o -fopenmp -pthread -lws2_32 -liphlpapi -LC:/swarm/workspace/External-R-3.3.2/vendor/extsoft/lib/x64 -LC:/swarm/workspace/External-R-3.3.2/vendor/extsoft/lib -LC:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/bin/x64 -lR

This would require adding in the README.md of the R-package that to maximize performance, adding -march=native should be done but might break packages.

Regarding -O3 (if we were to push for even more), I know it is refused by CRAN for compatibility issues (some packages are breaking with -O3).

r-package

Source

Laurae2

Most helpful comment

@guolinke Some results here. Not posting the exact details for the benchmark because there will be more at a mini-conference I am doing next month.

Settings:

v1 is LightGBM v1
v2 is LightGBM v2 @1bf7bbd
default means compiled with -O2 -mtune=core2
O3 means compiled with -O3 -march=native
O3-fmath means compiled with -O3 -ffast-math -march=native
O2 means compiled with -O2 -march=native
Os means compiled with -Os

Best means the best flags for compilation for maximum speed, with default settings overriding all the others if the difference is not significant (<~1%) and not consistent (similar flags giving results off).

CPU: i7-3930K
R + gcc 4.9

Summary (tl;dr)

We notice LightGBM v2 with O3 -march=native (specifically -O3 is benefiting for the performance. LightGBM v1 has no visible benefits from using any other flags than the defaults currently. Depending on the model parameters, different flags provide different performance (like: LightGBM v2 + -march=native performance boost is kicking off when building deeper trees, or if the overhead is low/large like for 1 thread runs).

Therefore, the following recommendations could be made:

-O2 -mtune=core2 for LightGBM v1 for maximum performance.
-O3 -march=native for LightGBM v2 for maximum performance.
When doing cross-validation of models, it is always better running several processes with a small number of threads (like 4x process 1-thread) than a multithreaded single process sequentially (like 1x process 4-threads), even though your RAM might explode.

I will follow up with more in the next month.

Bosch, 12 threads, LightGBM v1:

| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 724.18s | 903.17s | 725.38s | 729.89s | 723.23s | default |
| depth=6 | 579.29s | 685.88s | 584.64s | 584.59s | 583.89s | default |
| depth=9 | 395.23s | 454.56s | 398.25s | 400.50s | 398.93s | default |
| depth=12 | 596.55s | 654.80s | 596.90s | 608.39s | 604.25s | default |

Bosch, 12 threads, LightGBM v2:

| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 873.08s | 1104.39s | 861.57s | 861.99s | 872.17s | O2 |
| depth=6 | 730.06s | 872.77s | 724.59s | 722.88s | 724.98s | O3 |
| depth=9 | 567.59s | 634.52s | 570.66s | 556.12s | 614.80s | O3 |
| depth=12 | 854.97s | 923.84s | 845.12s | 834.60s | 847.38s | O3 |

Bosch, 6 threads, LightGBM v1:

| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 913.44s | 1208.02s | 903.01s | 921.13s | 915.41s | O2 |
| depth=6 | 718.29s | 885.44s | 722.16s | 723.94s | 726.72s | default |
| depth=9 | 449.03s | 533.58s | 451.60s | 455.08s | 452.59s | default |
| depth=12 | 622.24s | 704.10s | 623.36s | 618.28s | 619.96s | O3 |

Bosch, 6 threads, LightGBM v2:

| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3| 956.25s | 1248.24s | 965.32s | 969.56s | 975.95s | default | | depth=6 | 787.95s | 952.82s | 795.35s | 782.70s | 788.41s | ??? |
| depth=9` | 548.84s | 639.46s | 546.65s | 547.61s | 547.05s | ??? |
| depth=12 | 770.47s | 862.75s | 766.49s | 773.30s | 762.61s | ??? |

Bosch, 1 thread, LightGBM v1:

| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 2360.10s | 3314.84s | 2389.20s | 2406.67s | 2337.28s | O3-fmath |
| depth=6 | 1757.84s | 2335.01s | 1810.60s | 1816.25s | 1769.16s | default |
| depth=9 | 968.05s | 1250.17s | 994.99s | 1007.10s | 975.83s | default |
| depth=12 | 1202.59s | 1468.61s | 1238.31s | 1246.01s | 1216.62s | default |

Bosch, 1 thread, LightGBM v2:

| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 2477.49s | 3316.81s | 2437.84s | 2342.69s | 2412.35s | O3 |
| depth=6 | 1850.66s | 2334.77s | 1830.01s | 1745.34s | 1799.20s | O3 |
| depth=9 | 1003.35s | 1243.15s | 990.65s | 954.06s | 970.39s | O3 |
| depth=12 | 1236.83s | 1469.03s | 1216.49s | 1159.22s | 1191.33s | O3 |

Laurae2 on 1 Apr 2017

👍2

All 9 comments

@Laurae2 , did that means that we should alter the c++ build rather than just R libraries. I think we can make that a suggestion rather than a compulsory process.

chivee on 16 Mar 2017

@Laurae2
I remember the difference between O2 and O3 in LightGBM is very small.
You can try some benchmarks on this.

guolinke on 17 Mar 2017

@chivee no, this would just be a suggestion to users if they want to achieve better local training speed. I'm not sure if it has a major impact though, I'll test all that thoroughly before I make a PR. As @guolinke there are very small differences just for O2 and O3 flag alone.

@guolinke when I get time on my server I'll try O3 and march=native to see what happens to the speed. I'm collecting a lot of (long) benchmarks since last month on xgboost and LightGBM to understand their performance (in ranking predictions (AUC), and speed) behavior depending on parameters.

I'll get back here once my new benchmarks are done.

Laurae2 on 18 Mar 2017

@guolinke Some results here. Not posting the exact details for the benchmark because there will be more at a mini-conference I am doing next month.

Settings:

v1 is LightGBM v1
v2 is LightGBM v2 @1bf7bbd
default means compiled with -O2 -mtune=core2
O3 means compiled with -O3 -march=native
O3-fmath means compiled with -O3 -ffast-math -march=native
O2 means compiled with -O2 -march=native
Os means compiled with -Os

CPU: i7-3930K
R + gcc 4.9

Summary (tl;dr)

Therefore, the following recommendations could be made:

-O2 -mtune=core2 for LightGBM v1 for maximum performance.
-O3 -march=native for LightGBM v2 for maximum performance.
When doing cross-validation of models, it is always better running several processes with a small number of threads (like 4x process 1-thread) than a multithreaded single process sequentially (like 1x process 4-threads), even though your RAM might explode.

I will follow up with more in the next month.

Bosch, 12 threads, LightGBM v1:

Bosch, 12 threads, LightGBM v2:

Bosch, 6 threads, LightGBM v1:

Bosch, 6 threads, LightGBM v2:

Bosch, 1 thread, LightGBM v1:

Bosch, 1 thread, LightGBM v2:

Laurae2 on 1 Apr 2017

👍2

@Laurae2 Thanks for your benchmark 👍 .
If change to O3 is needed, you can create a PR for it.

guolinke on 1 Apr 2017

@guolinke I'll open a PR to add a recommendation when I get some good charts ready and when the mini-conference will be ready soon (early next month), I'll link to it on the PR.

I also have xgboost benchmarks for comparison, do you want to see them? (I also got for nthread={1, 2, 3, 4, 5, 6, 12} and depth={3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, but then it gets very large in GitHub, I plan to make a blog post on it instead)

Laurae2 on 3 Apr 2017

Sure. The comparison benchmarks are always welcome. It can help to find out which part we can further improve.

guolinke on 3 Apr 2017

@guolinke Here for xgboost:

xgboost is at commit b4d97d3
default means compiled with -O2 -mtune=core2
O3 means compiled with -O3 -march=native -funroll-loops
O3-fmath means compiled with -O3 -ffast-math -march=native -funroll-loops
-funroll-loops is added because it is xgboost's default (actually, not even seeing a difference with or without)

xgboost was "slow", I skipped -O2 -march=native and -Os (it took 2 days for each full benchmark per thread count, the singlethreaded run was very long to do).

To compare xgboost and LightGBM, best is copy&paste into Excel (or anything similar) and make charts. See the end of this comment for the Excel table example.

Default run:

Default flag:

-O3 flag:

-O3 -fast-math flag:

Summary (tl;dr)

Configuration to choose, difference might be large depending on case:

Deep trees and multithreading: -O2 -mtune=core2
Small trees and multithreading: -O3 -ffast-math -march=native -funroll-loops
No multithreading: -O3 -march=native -funroll-loops

One can see https://github.com/dmlc/xgboost/issues/1950 more for understand xgboost implementation details.

More to come soon next month (on 10 May).

Bosch, 12 threads, xgboost depth-wise at b4d97d3:

| Parameters | dw + default | dw + O3 | dw + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 1049.86s | 1037.48s | 1026.85s | O3-fmath |
| depth=6 | 832.13s | 843.74s | 789.30s | O3-fmath |
| depth=9 | 790.78s | 799.14s | 788.94s | default |
| depth=12 | 1288.12s | 1303.58s | 1323.37s | default |

Bosch, 12 threads, xgboost loss guide at b4d97d3:

| Parameters | lg + default | lg + O3 | lg + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---:
| depth=3 | 1047.75s | 1042.41s | 1030.32s | O3-fmath |
| depth=6 | 844.80s | 841.92s | 838.87s | O3-fmath |
| depth=9 | 799.60s | 802.58s | 797.94s | default |
| depth=12 | 1263.58s | 1292.64s | 1330.31s | default |

Bosch, 6 threads, xgboost depth-wise at b4d97d3:

| Parameters | dw + default | dw + O3 | dw + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 1222.31s | 1194.40s | 1171.52s | O3-fmath |
| depth=6 | 865.96s | 866.79s | 833.08s | O3-fmath |
| depth=9 | 696.18s | 710.25s | 703.25s | default |
| depth=12 | 1036.29s | 1062.12s | 1070.23s | default |

Bosch, 6 threads, xgboost loss guide at b4d97d3:

| Parameters | lg + default | lg + O3 | lg + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 1215.27s | 1194.47s | 1176.07s | O3-fmath |
| depth=6 | 871.79s | 860.68s | 855.88s | O3-fmath |
| depth=9 | 717.43s | 714.81s | 705.16s | O3-fmath |
| depth=12 | 1061.09s | 1077.32s | 1089.91s | default |

Bosch, 1 thread, xgboost depth-wise at b4d97d3:

| Parameters | dw + default | dw + O3 | dw + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 3122.58s | 2719.62s | 2885.43s | O3 |
| depth=6 | 2076.36s | 1909.22s | 1967.32s | O3 |
| depth=9 | 1296.96s | 1215.27s | 1260.41s | O3 |
| depth=12 | 1684.07s | 1520.32s | 1577.45s | O3 |

Bosch, 1 thread, xgboost loss guide at b4d97d3:

| Parameters | lg + default | lg + O3 | lg + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 3032.19s | 2771.35s | 2944.40s | O3 |
| depth=6 | 2049.57s | 1941.74s | 1934.76s | O3-fmath |
| depth=9 | 1304.50s | 1208.21s | 1265.47s | O3 |
| depth=12 | 1571.86s | 1503.40s | 1615.36s | O3 |

Excel table example:

Copy & paste:

LightGBM v1 table with header: on A1
LightGBM v2 table with header: on I1
xgboost-depthwise table with header: on Q1
xgboost-lossguide table with header: on W1
Paste all the table below on A7
Paste formula =INDEX($A$1:$AA$5,F8,G8) on E8, then double click the small box at bottom right on the cell to paste down
Paste formula =NUMBERVALUE(LEFT(E8, LEN(E8)-1)) on D8, then double click the small box at bottom right on the cell to paste down
Do the charts you want (even a pivot chart if you want)

| Model | Flag | Depth | Speed | CellVal | Row | Column |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| LightGBM v1 | default | 3 | 2360.1 | | 2 | 2 |
| LightGBM v1 | default | 6 | 1757.84 | | 3 | 2 |
| LightGBM v1 | default | 9 | 968.05 | | 4 | 2 |
| LightGBM v1 | default | 12 | 1202.59 | | 5 | 2 |
| LightGBM v1 | Os | 3 | 3314.84 | | 2 | 3 |
| LightGBM v1 | Os | 6 | 2335.01 | | 3 | 3 |
| LightGBM v1 | Os | 9 | 1250.17 | | 4 | 3 |
| LightGBM v1 | Os | 12 | 1468.61 | | 5 | 3 |
| LightGBM v1 | O2 | 3 | 2389.2 | | 2 | 4 |
| LightGBM v1 | O2 | 6 | 1810.6 | | 3 | 4 |
| LightGBM v1 | O2 | 9 | 994.99 | | 4 | 4 |
| LightGBM v1 | O2 | 12 | 1238.31 | | 5 | 4 |
| LightGBM v1 | O3 | 3 | 2406.67 | | 2 | 5 |
| LightGBM v1 | O3 | 6 | 1816.25 | | 3 | 5 |
| LightGBM v1 | O3 | 9 | 1007.1 | | 4 | 5 |
| LightGBM v1 | O3 | 12 | 1246.01 | | 5 | 5 |
| LightGBM v1 | O3-fmath | 3 | 2337.28 | | 2 | 6 |
| LightGBM v1 | O3-fmath | 6 | 1769.16 | | 3 | 6 |
| LightGBM v1 | O3-fmath | 9 | 975.83 | | 4 | 6 |
| LightGBM v1 | O3-fmath | 12 | 1216.62 | | 5 | 6 |
| LightGBM v2 | default | 3 | 2477.49 | | 2 | 10 |
| LightGBM v2 | default | 6 | 1850.66 | | 3 | 10 |
| LightGBM v2 | default | 9 | 1003.35 | | 4 | 10 |
| LightGBM v2 | default | 12 | 1236.83 | | 5 | 10 |
| LightGBM v2 | Os | 3 | 3316.81 | | 2 | 11 |
| LightGBM v2 | Os | 6 | 2334.77 | | 3 | 11 |
| LightGBM v2 | Os | 9 | 1243.15 | | 4 | 11 |
| LightGBM v2 | Os | 12 | 1469.03 | | 5 | 11 |
| LightGBM v2 | O2 | 3 | 2437.84 | | 2 | 12 |
| LightGBM v2 | O2 | 6 | 1830.01 | | 3 | 12 |
| LightGBM v2 | O2 | 9 | 990.65 | | 4 | 12 |
| LightGBM v2 | O2 | 12 | 1216.49 | | 5 | 12 |
| LightGBM v2 | O3 | 3 | 2342.69 | | 2 | 13 |
| LightGBM v2 | O3 | 6 | 1745.34 | | 3 | 13 |
| LightGBM v2 | O3 | 9 | 954.06 | | 4 | 13 |
| LightGBM v2 | O3 | 12 | 1159.22 | | 5 | 13 |
| LightGBM v2 | O3-fmath | 3 | 2412.35 | | 2 | 14 |
| LightGBM v2 | O3-fmath | 6 | 1799.2 | | 3 | 14 |
| LightGBM v2 | O3-fmath | 9 | 970.39 | | 4 | 14 |
| LightGBM v2 | O3-fmath | 12 | 1191.33 | | 5 | 14 |
| xgboost-depthwise | default | 3 | 3122.58 | | 2 | 18 |
| xgboost-depthwise | default | 6 | 2076.36 | | 3 | 18 |
| xgboost-depthwise | default | 9 | 1296.96 | | 4 | 18 |
| xgboost-depthwise | default | 12 | 1684.07 | | 5 | 18 |
| xgboost-depthwise | O3 | 3 | 2719.62 | | 2 | 19 |
| xgboost-depthwise | O3 | 6 | 1909.22 | | 3 | 19 |
| xgboost-depthwise | O3 | 9 | 1215.27 | | 4 | 19 |
| xgboost-depthwise | O3 | 12 | 1520.32 | | 5 | 19 |
| xgboost-depthwise | O3-fmath | 3 | 2885.43 | | 2 | 20 |
| xgboost-depthwise | O3-fmath | 6 | 1967.32 | | 3 | 20 |
| xgboost-depthwise | O3-fmath | 9 | 1260.41 | | 4 | 20 |
| xgboost-depthwise | O3-fmath | 12 | 1577.45 | | 5 | 20 |
| xgboost-lossguide | default | 3 | 3032.19 | | 2 | 24 |
| xgboost-lossguide | default | 6 | 2049.57 | | 3 | 24 |
| xgboost-lossguide | default | 9 | 1304.5 | | 4 | 24 |
| xgboost-lossguide | default | 12 | 1571.86 | | 5 | 24 |
| xgboost-lossguide | O3 | 3 | 2771.35 | | 2 | 25 |
| xgboost-lossguide | O3 | 6 | 1941.74 | | 3 | 25 |
| xgboost-lossguide | O3 | 9 | 1208.21 | | 4 | 25 |
| xgboost-lossguide | O3 | 12 | 1503.4 | | 5 | 25 |
| xgboost-lossguide | O3-fmath | 3 | 2944.4 | | 2 | 26 |
| xgboost-lossguide | O3-fmath | 6 | 1934.76 | | 3 | 26 |
| xgboost-lossguide | O3-fmath | 9 | 1265.47 | | 4 | 26 |
| xgboost-lossguide | O3-fmath | 12 | 1615.36 | | 5 | 26 |