Lightgbm: [R-package] Provide recommendation for mnative?

Created on 15 Mar 2017  路  9Comments  路  Source: microsoft/LightGBM

@guolinke I am just wondering if recommending using mnative can yield better performance for those installing directly from install_github (default is mcore2 in R).

Installation log when installing using install_github in Windows for instance: we can see it is tuned for Core 2 architecture:

c:/Rtools/mingw_64/bin/g++ -m64 -std=c++0x -I"C:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/include" -DNDEBUG -I../..//include -DUSE_SOCKET      -I"C:/swarm/workspace/External-R-3.3.2/vendor/extsoft/include"  -fopenmp -pthread -std=c++11   -O2 -Wall  -mtune=core2 -c lightgbm-all.cpp -o lightgbm-all.o
c:/Rtools/mingw_64/bin/g++ -m64 -std=c++0x -I"C:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/include" -DNDEBUG -I../..//include -DUSE_SOCKET      -I"C:/swarm/workspace/External-R-3.3.2/vendor/extsoft/include"  -fopenmp -pthread -std=c++11   -O2 -Wall  -mtune=core2 -c lightgbm_R.cpp -o lightgbm_R.o
c:/Rtools/mingw_64/bin/g++ -m64 -shared -s -static-libgcc -o lightgbm.dll tmp.def ./lightgbm-all.o ./lightgbm_R.o -fopenmp -pthread -lws2_32 -liphlpapi -LC:/swarm/workspace/External-R-3.3.2/vendor/extsoft/lib/x64 -LC:/swarm/workspace/External-R-3.3.2/vendor/extsoft/lib -LC:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/bin/x64 -lR

This would require adding in the README.md of the R-package that to maximize performance, adding -march=native should be done but might break packages.

Regarding -O3 (if we were to push for even more), I know it is refused by CRAN for compatibility issues (some packages are breaking with -O3).

r-package

Most helpful comment

@guolinke Some results here. Not posting the exact details for the benchmark because there will be more at a mini-conference I am doing next month.

Settings:

  • v1 is LightGBM v1
  • v2 is LightGBM v2 @1bf7bbd
  • default means compiled with -O2 -mtune=core2
  • O3 means compiled with -O3 -march=native
  • O3-fmath means compiled with -O3 -ffast-math -march=native
  • O2 means compiled with -O2 -march=native
  • Os means compiled with -Os

Best means the best flags for compilation for maximum speed, with default settings overriding all the others if the difference is not significant (<~1%) and not consistent (similar flags giving results off).

  • CPU: i7-3930K
  • R + gcc 4.9

Summary (tl;dr)

We notice LightGBM v2 with O3 -march=native (specifically -O3 is benefiting for the performance. LightGBM v1 has no visible benefits from using any other flags than the defaults currently. Depending on the model parameters, different flags provide different performance (like: LightGBM v2 + -march=native performance boost is kicking off when building deeper trees, or if the overhead is low/large like for 1 thread runs).

Therefore, the following recommendations could be made:

  • -O2 -mtune=core2 for LightGBM v1 for maximum performance.
  • -O3 -march=native for LightGBM v2 for maximum performance.
  • When doing cross-validation of models, it is always better running several processes with a small number of threads (like 4x process 1-thread) than a multithreaded single process sequentially (like 1x process 4-threads), even though your RAM might explode.

I will follow up with more in the next month.


Bosch, 12 threads, LightGBM v1:

| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 724.18s | 903.17s | 725.38s | 729.89s | 723.23s | default |
| depth=6 | 579.29s | 685.88s | 584.64s | 584.59s | 583.89s | default |
| depth=9 | 395.23s | 454.56s | 398.25s | 400.50s | 398.93s | default |
| depth=12 | 596.55s | 654.80s | 596.90s | 608.39s | 604.25s | default |


Bosch, 12 threads, LightGBM v2:

| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 873.08s | 1104.39s | 861.57s | 861.99s | 872.17s | O2 |
| depth=6 | 730.06s | 872.77s | 724.59s | 722.88s | 724.98s | O3 |
| depth=9 | 567.59s | 634.52s | 570.66s | 556.12s | 614.80s | O3 |
| depth=12 | 854.97s | 923.84s | 845.12s | 834.60s | 847.38s | O3 |


Bosch, 6 threads, LightGBM v1:

| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 913.44s | 1208.02s | 903.01s | 921.13s | 915.41s | O2 |
| depth=6 | 718.29s | 885.44s | 722.16s | 723.94s | 726.72s | default |
| depth=9 | 449.03s | 533.58s | 451.60s | 455.08s | 452.59s | default |
| depth=12 | 622.24s | 704.10s | 623.36s | 618.28s | 619.96s | O3 |


Bosch, 6 threads, LightGBM v2:

| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3| 956.25s | 1248.24s | 965.32s | 969.56s | 975.95s | default | | depth=6 | 787.95s | 952.82s | 795.35s | 782.70s | 788.41s | ??? |
| depth=9` | 548.84s | 639.46s | 546.65s | 547.61s | 547.05s | ??? |
| depth=12 | 770.47s | 862.75s | 766.49s | 773.30s | 762.61s | ??? |


Bosch, 1 thread, LightGBM v1:

| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 2360.10s | 3314.84s | 2389.20s | 2406.67s | 2337.28s | O3-fmath |
| depth=6 | 1757.84s | 2335.01s | 1810.60s | 1816.25s | 1769.16s | default |
| depth=9 | 968.05s | 1250.17s | 994.99s | 1007.10s | 975.83s | default |
| depth=12 | 1202.59s | 1468.61s | 1238.31s | 1246.01s | 1216.62s | default |


Bosch, 1 thread, LightGBM v2:

| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 2477.49s | 3316.81s | 2437.84s | 2342.69s | 2412.35s | O3 |
| depth=6 | 1850.66s | 2334.77s | 1830.01s | 1745.34s | 1799.20s | O3 |
| depth=9 | 1003.35s | 1243.15s | 990.65s | 954.06s | 970.39s | O3 |
| depth=12 | 1236.83s | 1469.03s | 1216.49s | 1159.22s | 1191.33s | O3 |

All 9 comments

@Laurae2 , did that means that we should alter the c++ build rather than just R libraries. I think we can make that a suggestion rather than a compulsory process.

@Laurae2
I remember the difference between O2 and O3 in LightGBM is very small.
You can try some benchmarks on this.

@chivee no, this would just be a suggestion to users if they want to achieve better local training speed. I'm not sure if it has a major impact though, I'll test all that thoroughly before I make a PR. As @guolinke there are very small differences just for O2 and O3 flag alone.

@guolinke when I get time on my server I'll try O3 and march=native to see what happens to the speed. I'm collecting a lot of (long) benchmarks since last month on xgboost and LightGBM to understand their performance (in ranking predictions (AUC), and speed) behavior depending on parameters.

I'll get back here once my new benchmarks are done.

@guolinke Some results here. Not posting the exact details for the benchmark because there will be more at a mini-conference I am doing next month.

Settings:

  • v1 is LightGBM v1
  • v2 is LightGBM v2 @1bf7bbd
  • default means compiled with -O2 -mtune=core2
  • O3 means compiled with -O3 -march=native
  • O3-fmath means compiled with -O3 -ffast-math -march=native
  • O2 means compiled with -O2 -march=native
  • Os means compiled with -Os

Best means the best flags for compilation for maximum speed, with default settings overriding all the others if the difference is not significant (<~1%) and not consistent (similar flags giving results off).

  • CPU: i7-3930K
  • R + gcc 4.9

Summary (tl;dr)

We notice LightGBM v2 with O3 -march=native (specifically -O3 is benefiting for the performance. LightGBM v1 has no visible benefits from using any other flags than the defaults currently. Depending on the model parameters, different flags provide different performance (like: LightGBM v2 + -march=native performance boost is kicking off when building deeper trees, or if the overhead is low/large like for 1 thread runs).

Therefore, the following recommendations could be made:

  • -O2 -mtune=core2 for LightGBM v1 for maximum performance.
  • -O3 -march=native for LightGBM v2 for maximum performance.
  • When doing cross-validation of models, it is always better running several processes with a small number of threads (like 4x process 1-thread) than a multithreaded single process sequentially (like 1x process 4-threads), even though your RAM might explode.

I will follow up with more in the next month.


Bosch, 12 threads, LightGBM v1:

| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 724.18s | 903.17s | 725.38s | 729.89s | 723.23s | default |
| depth=6 | 579.29s | 685.88s | 584.64s | 584.59s | 583.89s | default |
| depth=9 | 395.23s | 454.56s | 398.25s | 400.50s | 398.93s | default |
| depth=12 | 596.55s | 654.80s | 596.90s | 608.39s | 604.25s | default |


Bosch, 12 threads, LightGBM v2:

| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 873.08s | 1104.39s | 861.57s | 861.99s | 872.17s | O2 |
| depth=6 | 730.06s | 872.77s | 724.59s | 722.88s | 724.98s | O3 |
| depth=9 | 567.59s | 634.52s | 570.66s | 556.12s | 614.80s | O3 |
| depth=12 | 854.97s | 923.84s | 845.12s | 834.60s | 847.38s | O3 |


Bosch, 6 threads, LightGBM v1:

| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 913.44s | 1208.02s | 903.01s | 921.13s | 915.41s | O2 |
| depth=6 | 718.29s | 885.44s | 722.16s | 723.94s | 726.72s | default |
| depth=9 | 449.03s | 533.58s | 451.60s | 455.08s | 452.59s | default |
| depth=12 | 622.24s | 704.10s | 623.36s | 618.28s | 619.96s | O3 |


Bosch, 6 threads, LightGBM v2:

| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3| 956.25s | 1248.24s | 965.32s | 969.56s | 975.95s | default | | depth=6 | 787.95s | 952.82s | 795.35s | 782.70s | 788.41s | ??? |
| depth=9` | 548.84s | 639.46s | 546.65s | 547.61s | 547.05s | ??? |
| depth=12 | 770.47s | 862.75s | 766.49s | 773.30s | 762.61s | ??? |


Bosch, 1 thread, LightGBM v1:

| Parameters | v1 + default | v1 + Os | v1 + O2 | v1 + O3 | v1 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 2360.10s | 3314.84s | 2389.20s | 2406.67s | 2337.28s | O3-fmath |
| depth=6 | 1757.84s | 2335.01s | 1810.60s | 1816.25s | 1769.16s | default |
| depth=9 | 968.05s | 1250.17s | 994.99s | 1007.10s | 975.83s | default |
| depth=12 | 1202.59s | 1468.61s | 1238.31s | 1246.01s | 1216.62s | default |


Bosch, 1 thread, LightGBM v2:

| Parameters | v2 + default | v2 + Os | v2 + O2 | v2 + O3 | v2 + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| depth=3 | 2477.49s | 3316.81s | 2437.84s | 2342.69s | 2412.35s | O3 |
| depth=6 | 1850.66s | 2334.77s | 1830.01s | 1745.34s | 1799.20s | O3 |
| depth=9 | 1003.35s | 1243.15s | 990.65s | 954.06s | 970.39s | O3 |
| depth=12 | 1236.83s | 1469.03s | 1216.49s | 1159.22s | 1191.33s | O3 |

@Laurae2 Thanks for your benchmark 馃憤 .
If change to O3 is needed, you can create a PR for it.

@guolinke I'll open a PR to add a recommendation when I get some good charts ready and when the mini-conference will be ready soon (early next month), I'll link to it on the PR.

I also have xgboost benchmarks for comparison, do you want to see them? (I also got for nthread={1, 2, 3, 4, 5, 6, 12} and depth={3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, but then it gets very large in GitHub, I plan to make a blog post on it instead)

Sure. The comparison benchmarks are always welcome. It can help to find out which part we can further improve.

@guolinke Here for xgboost:

  • xgboost is at commit b4d97d3
  • default means compiled with -O2 -mtune=core2
  • O3 means compiled with -O3 -march=native -funroll-loops
  • O3-fmath means compiled with -O3 -ffast-math -march=native -funroll-loops
  • -funroll-loops is added because it is xgboost's default (actually, not even seeing a difference with or without)

xgboost was "slow", I skipped -O2 -march=native and -Os (it took 2 days for each full benchmark per thread count, the singlethreaded run was very long to do).

To compare xgboost and LightGBM, best is copy&paste into Excel (or anything similar) and make charts. See the end of this comment for the Excel table example.

Default run:

image

Default flag:

image

image

-O3 flag:

image

image

-O3 -fast-math flag:

image

image


Summary (tl;dr)

Configuration to choose, difference might be large depending on case:

  • Deep trees and multithreading: -O2 -mtune=core2
  • Small trees and multithreading: -O3 -ffast-math -march=native -funroll-loops
  • No multithreading: -O3 -march=native -funroll-loops

One can see https://github.com/dmlc/xgboost/issues/1950 more for understand xgboost implementation details.

More to come soon next month (on 10 May).


Bosch, 12 threads, xgboost depth-wise at b4d97d3:

| Parameters | dw + default | dw + O3 | dw + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 1049.86s | 1037.48s | 1026.85s | O3-fmath |
| depth=6 | 832.13s | 843.74s | 789.30s | O3-fmath |
| depth=9 | 790.78s | 799.14s | 788.94s | default |
| depth=12 | 1288.12s | 1303.58s | 1323.37s | default |


Bosch, 12 threads, xgboost loss guide at b4d97d3:

| Parameters | lg + default | lg + O3 | lg + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---:
| depth=3 | 1047.75s | 1042.41s | 1030.32s | O3-fmath |
| depth=6 | 844.80s | 841.92s | 838.87s | O3-fmath |
| depth=9 | 799.60s | 802.58s | 797.94s | default |
| depth=12 | 1263.58s | 1292.64s | 1330.31s | default |


Bosch, 6 threads, xgboost depth-wise at b4d97d3:

| Parameters | dw + default | dw + O3 | dw + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 1222.31s | 1194.40s | 1171.52s | O3-fmath |
| depth=6 | 865.96s | 866.79s | 833.08s | O3-fmath |
| depth=9 | 696.18s | 710.25s | 703.25s | default |
| depth=12 | 1036.29s | 1062.12s | 1070.23s | default |


Bosch, 6 threads, xgboost loss guide at b4d97d3:

| Parameters | lg + default | lg + O3 | lg + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 1215.27s | 1194.47s | 1176.07s | O3-fmath |
| depth=6 | 871.79s | 860.68s | 855.88s | O3-fmath |
| depth=9 | 717.43s | 714.81s | 705.16s | O3-fmath |
| depth=12 | 1061.09s | 1077.32s | 1089.91s | default |


Bosch, 1 thread, xgboost depth-wise at b4d97d3:

| Parameters | dw + default | dw + O3 | dw + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 3122.58s | 2719.62s | 2885.43s | O3 |
| depth=6 | 2076.36s | 1909.22s | 1967.32s | O3 |
| depth=9 | 1296.96s | 1215.27s | 1260.41s | O3 |
| depth=12 | 1684.07s | 1520.32s | 1577.45s | O3 |


Bosch, 1 thread, xgboost loss guide at b4d97d3:

| Parameters | lg + default | lg + O3 | lg + O3-fmath | Best |
| --- | ---: | ---: | ---: | ---: |
| depth=3 | 3032.19s | 2771.35s | 2944.40s | O3 |
| depth=6 | 2049.57s | 1941.74s | 1934.76s | O3-fmath |
| depth=9 | 1304.50s | 1208.21s | 1265.47s | O3 |
| depth=12 | 1571.86s | 1503.40s | 1615.36s | O3 |


Excel table example:

Copy & paste:

  • LightGBM v1 table with header: on A1
  • LightGBM v2 table with header: on I1
  • xgboost-depthwise table with header: on Q1
  • xgboost-lossguide table with header: on W1
  • Paste all the table below on A7
  • Paste formula =INDEX($A$1:$AA$5,F8,G8) on E8, then double click the small box at bottom right on the cell to paste down
  • Paste formula =NUMBERVALUE(LEFT(E8, LEN(E8)-1)) on D8, then double click the small box at bottom right on the cell to paste down
  • Do the charts you want (even a pivot chart if you want)

| Model | Flag | Depth | Speed | CellVal | Row | Column |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| LightGBM v1 | default | 3 | 2360.1 | | 2 | 2 |
| LightGBM v1 | default | 6 | 1757.84 | | 3 | 2 |
| LightGBM v1 | default | 9 | 968.05 | | 4 | 2 |
| LightGBM v1 | default | 12 | 1202.59 | | 5 | 2 |
| LightGBM v1 | Os | 3 | 3314.84 | | 2 | 3 |
| LightGBM v1 | Os | 6 | 2335.01 | | 3 | 3 |
| LightGBM v1 | Os | 9 | 1250.17 | | 4 | 3 |
| LightGBM v1 | Os | 12 | 1468.61 | | 5 | 3 |
| LightGBM v1 | O2 | 3 | 2389.2 | | 2 | 4 |
| LightGBM v1 | O2 | 6 | 1810.6 | | 3 | 4 |
| LightGBM v1 | O2 | 9 | 994.99 | | 4 | 4 |
| LightGBM v1 | O2 | 12 | 1238.31 | | 5 | 4 |
| LightGBM v1 | O3 | 3 | 2406.67 | | 2 | 5 |
| LightGBM v1 | O3 | 6 | 1816.25 | | 3 | 5 |
| LightGBM v1 | O3 | 9 | 1007.1 | | 4 | 5 |
| LightGBM v1 | O3 | 12 | 1246.01 | | 5 | 5 |
| LightGBM v1 | O3-fmath | 3 | 2337.28 | | 2 | 6 |
| LightGBM v1 | O3-fmath | 6 | 1769.16 | | 3 | 6 |
| LightGBM v1 | O3-fmath | 9 | 975.83 | | 4 | 6 |
| LightGBM v1 | O3-fmath | 12 | 1216.62 | | 5 | 6 |
| LightGBM v2 | default | 3 | 2477.49 | | 2 | 10 |
| LightGBM v2 | default | 6 | 1850.66 | | 3 | 10 |
| LightGBM v2 | default | 9 | 1003.35 | | 4 | 10 |
| LightGBM v2 | default | 12 | 1236.83 | | 5 | 10 |
| LightGBM v2 | Os | 3 | 3316.81 | | 2 | 11 |
| LightGBM v2 | Os | 6 | 2334.77 | | 3 | 11 |
| LightGBM v2 | Os | 9 | 1243.15 | | 4 | 11 |
| LightGBM v2 | Os | 12 | 1469.03 | | 5 | 11 |
| LightGBM v2 | O2 | 3 | 2437.84 | | 2 | 12 |
| LightGBM v2 | O2 | 6 | 1830.01 | | 3 | 12 |
| LightGBM v2 | O2 | 9 | 990.65 | | 4 | 12 |
| LightGBM v2 | O2 | 12 | 1216.49 | | 5 | 12 |
| LightGBM v2 | O3 | 3 | 2342.69 | | 2 | 13 |
| LightGBM v2 | O3 | 6 | 1745.34 | | 3 | 13 |
| LightGBM v2 | O3 | 9 | 954.06 | | 4 | 13 |
| LightGBM v2 | O3 | 12 | 1159.22 | | 5 | 13 |
| LightGBM v2 | O3-fmath | 3 | 2412.35 | | 2 | 14 |
| LightGBM v2 | O3-fmath | 6 | 1799.2 | | 3 | 14 |
| LightGBM v2 | O3-fmath | 9 | 970.39 | | 4 | 14 |
| LightGBM v2 | O3-fmath | 12 | 1191.33 | | 5 | 14 |
| xgboost-depthwise | default | 3 | 3122.58 | | 2 | 18 |
| xgboost-depthwise | default | 6 | 2076.36 | | 3 | 18 |
| xgboost-depthwise | default | 9 | 1296.96 | | 4 | 18 |
| xgboost-depthwise | default | 12 | 1684.07 | | 5 | 18 |
| xgboost-depthwise | O3 | 3 | 2719.62 | | 2 | 19 |
| xgboost-depthwise | O3 | 6 | 1909.22 | | 3 | 19 |
| xgboost-depthwise | O3 | 9 | 1215.27 | | 4 | 19 |
| xgboost-depthwise | O3 | 12 | 1520.32 | | 5 | 19 |
| xgboost-depthwise | O3-fmath | 3 | 2885.43 | | 2 | 20 |
| xgboost-depthwise | O3-fmath | 6 | 1967.32 | | 3 | 20 |
| xgboost-depthwise | O3-fmath | 9 | 1260.41 | | 4 | 20 |
| xgboost-depthwise | O3-fmath | 12 | 1577.45 | | 5 | 20 |
| xgboost-lossguide | default | 3 | 3032.19 | | 2 | 24 |
| xgboost-lossguide | default | 6 | 2049.57 | | 3 | 24 |
| xgboost-lossguide | default | 9 | 1304.5 | | 4 | 24 |
| xgboost-lossguide | default | 12 | 1571.86 | | 5 | 24 |
| xgboost-lossguide | O3 | 3 | 2771.35 | | 2 | 25 |
| xgboost-lossguide | O3 | 6 | 1941.74 | | 3 | 25 |
| xgboost-lossguide | O3 | 9 | 1208.21 | | 4 | 25 |
| xgboost-lossguide | O3 | 12 | 1503.4 | | 5 | 25 |
| xgboost-lossguide | O3-fmath | 3 | 2944.4 | | 2 | 26 |
| xgboost-lossguide | O3-fmath | 6 | 1934.76 | | 3 | 26 |
| xgboost-lossguide | O3-fmath | 9 | 1265.47 | | 4 | 26 |
| xgboost-lossguide | O3-fmath | 12 | 1615.36 | | 5 | 26 |

Was this page helpful?
0 / 5 - 0 ratings