Lightgbm: GPU support on ppc64le

Created on 23 Mar 2020 · 18Comments · Source: microsoft/LightGBM

Currently there is no way to execute LightGBM on ppc64le with GPU since it only has OpenCL support. What would be the correct approach to add CUDA code along with OpenCL on LightGBM? If I understand correctly, adding CUDA kernels to treelearner would be enough right?
There are other people interested on this, #1905 and #1129 for example.

Source

everton1984

👍1

Most helpful comment

We are almost ready to push the initial candidate for the CUDA branch.

ChipKerchner on 10 Jun 2020

🚀2

All 18 comments

cc @ceseo

ceseo on 23 Mar 2020

👍1

Our plan is to contribute the CUDA kernels (histogram16/64/256) to treelearner and create a common API for OpenCL and CUDA (a common template class), so it minimizes the amount of changes needed in the current code base.

At first, CUDA would be conditioned to ppc64le/linux only.

Does that look reasonable?

ceseo on 31 Mar 2020

🎉2

Thanks @ceseo and @everton1984 , the contribution is very welcome!
ping @huanzhang12 for the detail discussion.

guolinke on 1 Apr 2020

@huanzhang12 As @ceseo mentioned, we want to add a new set of histograms written in cuda, use the current gputreelearner as a base class and add specialisations with non-common code for OpenCL and CUDA. I think this would add minimal changes to code outside gputreelearner. Would that be acceptable? Are you aware of any GPU code outside that part that we might have missed?

everton1984 on 1 Apr 2020

@everton1984 It will be a great contribution if you can add CUDA equivalent kernels to LightGBM! Yes you can inherit the current gputreelearner as the base class, but currently many functions in GPU tree learner involve memory management in OpenCL (e.g., enqueue_write_buffer, enqueue_unmap_buffer, etc), so most of the methods cannot be directly reused. I suggest making an abstraction layer for memory management (e.g., use a wrapper to handle memory related function calls, and forward the call to OpenCL/CUDA backend accordingly), so that many parts of existing gputreelearner can be reused.

huanzhang12 on 2 Apr 2020

@huanzhang12 My idea was to implement something like

#include <memory>
#include <iostream>

class GPUTreeLearner
{
public:
    GPUTreeLearner() {

    };

    /* common GPU code */
    virtual void doStuff(){
    };

    void A() {
        std::cout << "GPU common code\n";
    };
};

class OCLTreeLearner : public GPUTreeLearner
{
public:
    OCLTreeLearner() {
    };

    /* OpenCL specific code */
    void doStuff() {
        std::cout << "OCL only code\n";
    };
};

class CUDATreeLearner : public GPUTreeLearner
{
public:
    CUDATreeLearner() {
    };

    /* CUDA specific code */
    void doStuff() {
        std::cout << "CUDA only code\n";
    };
};

class GPUTreeLearnerFactory
{
public:
    static std::unique_ptr<GPUTreeLearner> GetGPU(bool is_cuda)
    {
        if( is_cuda ){
           return std::make_unique<CUDATreeLearner>();
        } else {
           return std::make_unique<OCLTreeLearner>();
        }
    };
};

int main() {
    /* random code */
    std::unique_ptr<GPUTreeLearner> pcu = GPUTreeLearnerFactory::GetGPU(false);
    std::unique_ptr<GPUTreeLearner> pcl = GPUTreeLearnerFactory::GetGPU(true);
    pcu->A();
    pcu->doStuff();
    pcl->A();
    pcl->doStuff();

    return 0;
}

so GPUTreeLearner would keep the common code. I think this approach is easily maintainable and change very little code outside GPUTreeLearner.

everton1984 on 2 Apr 2020

One current limitation to our CUDA code is that it can NOT coexist (AKA compiled at the same time) with GPU / OCL code (similar to the constraint that CPU and GPU / OCL can NOT coexist). Do you see a need for both simultaneously in the same compilation? If so, hopefully we can overcome this.

We also assume that having parity with GPU / OCL and CUDA is a given? That to us means single precision 16, 64 & 256 histogram support. Our initial CUDA version will "probably" also have double precision histograms. So it would be nice for someone with GPU / OCL expertise to add that functionality into the project.

ChipKerchner on 2 Apr 2020

@everton1984 The implementation skeleton looks great! I suggested an abstraction layer on memory management as it may help reduce code duplication. But you can determine what is the best approach to do this.

@ChipKerchner The OpenCL implementation does support double precision. It can be enabled using the flag gpu_use_dp=true. Double precision can be very slow on NVIDIA GPUs though unless using the very expensive "datacenter" GPUs (P100, V100, etc).

huanzhang12 on 4 Apr 2020

@huanzhang12 the original motivation of this port was exactly running on Tesla V100s on POWER9 systems. Do you have any objections starting the port with double precision support only at first? Most ppc64le systems support only server-grade GPUs anyway.

ceseo on 6 Apr 2020

@huanzhang12 we have also encountered the need for histograms with bin size >256. We have plans for supporting Histograms for larger bins. I think OpenCL implementation would benefit from this capability as well.

bordaw on 8 Apr 2020

@huanzhang12 we have also encountered the need for histograms with bin size >256. We have plans for supporting Histograms for larger bins. I think OpenCL implementation would benefit from this capability as well.

@bordaw This is not relevant to the current issue, we're discussing enablement of GPU on ppc64le. Create another Issue to discuss new features please.

everton1984 on 9 Apr 2020

@ceseo @everton1984 Don't worry about single/double precision too much at this time - as long as we have the double precision version working, it is not difficult to add single precision support. If you look at the current GPU code there are only a few places involving different code path for handling single/double precision, and the differences between the two in handling are not that big. But it is better to keep single precision in mind when designing the code work flow to ease further extensions.

@bordaw It is possible to support large bins but performance can be much worse, because if bin size is less than 256 it is possible to fit them in the local memory in GPU with much higher bandwidth. If you would like to work on this you are very welcome to open a new issue.

huanzhang12 on 9 Apr 2020

@huanzhang12 Yes, I will create another issue. Also, I agree that the code changes required for single/double precision histogram are not significant.

bordaw on 9 Apr 2020

We are almost ready to push the initial candidate for the CUDA branch.

ChipKerchner on 10 Jun 2020

🚀2

The pull request has been active for almost 3 weeks and we're not getting much feedback.

I'd like some assistance with the continuous integration systems. It shows problems but my testing on an AMD64 system with a GPU (OCL) shows no issues with the unit tests. I am unable to produce an environment that allows me to run the automated testing. Please help!
@huanzhang12 @guolinke @StrikerRUS

ChipKerchner on 1 Jul 2020

sorry @ChipKerchner , I will look at this today.

guolinke on 2 Jul 2020

@huanzhang12 Our resources are starting to be moved off this project. We would like to get this feature approved and integrated into master ASAP if possible. Please let us know any outstanding issues.

ChipKerchner on 26 Aug 2020

Implemented in #3160. Many thanks to all involved people for the amazing work! We are looking forward for further CUDA improvements, e.g. Windows support.