Lightgbm: Understanding how feature histograms get populated in data-parallel mode.

Created on 1 Nov 2019 · 12Comments · Source: microsoft/LightGBM

I'm trying to understand how the data parallel algorithm works in LightGBM

One main communication point is in the function FindBestSplits where each worker populates a part of input_buffer_.

This input_buffer_ is the one later getting reduced-scattered among the workers.

What I don't understand is the discrepancy between the number of bins for each features, depending on which source we use:

The "canonical count" AFAI understand is:

auto num_bin = this->train_data_->FeatureNumBin(fid);

as used to populate the buffer_write_start_pos_ at each worker.

However, when copying data from smaller_leaf_histogram_array_ in to the input_buffer_, each worker will copy for a particular feature this->smaller_leaf_histogram_array_[feature_index].SizeOfHistgram() bytes.

What I've observed for sparse datasets is that features will return a SizeOfHistgram() different for each worker, and also different from this->train_data_->FeatureNumBin(fid).

This means that each worker copies different amounts of data to input_buffer_. What I don't understand then is how can this not cause issues when we reduce the input_buffer_, the same feature corresponds to different byte indexes on different workers, and yet it seems like they are aggregated together based on block_len_.

question

Source

thvasilo

All 12 comments

Are there different block_len_ in different workers? And the features bin mapper should be the same across workers. Did you check them by the same fid?

guolinke on 2 Nov 2019

The block_len_ are the same between workers.

What I've noticed is that this->smaller_leaf_histogram_array_[feature_index].SizeOfHistgram() returns different values for the same feature_id for different workers. Here's an reprex:

git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM
https://gist.githubusercontent.com/thvasilo/d19e8284077dc6fbe2810ff9f62ea410/raw/4e52caccd4d83279abbedb2dbb030cdf96dd5829/bin-printouts.patch
git apply bin-printouts.patch

mkdir build && cd build
cmake -DUSE_MPI=ON ..
make -j `nproc`

cd ../examples/parallel_learning

wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2
bunzip2 real-sim.bz2

mpiexec -np 2 ../../lightgbm config=train.conf data=real-sim num_trees=1 tree_learner=data min_data_in_leaf=1 num_threads=1 output_model=tree-real-sim.txt

Truncated output:

[LightGBM] [Info] Number of positive: 22238, number of negative: 50071
[LightGBM] [Info] Number of positive: 22238, number of negative: 50071
[LightGBM] [Info] Total Bins 507494
[LightGBM] [Info] Total Bins 507494
[LightGBM] [Info] Number of data: 35931, number of used features: 20958
[LightGBM] [Info] Number of data: 36378, number of used features: 20958
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.306671 -> initscore=-0.815729
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.308401 -> initscore=-0.807607
[LightGBM] [Info] Start training from score -0.811668
[LightGBM] [Info] Start training from score -0.811668
1: Inner feature of 15000: 16087
0: Inner feature of 15000: 6518
1: Examined feature number of bins: 3
1: Examined feature number of bins from data: 4
1: Examined feature buffer write position: 3400248
0: Examined feature number of bins: 9
0: Examined feature number of bins from data: 10
0: Examined feature buffer write position: 9270408

I guess what I'm missing is that on different workers feature_id over the for (int feature_index = 0; feature_index < this->num_features_; ++feature_index) loop corresponds to different "actual" features, as evidenced by the different inner_feature values?

I thought that the contents of buffer_write_start_pos_ were the same for every worker, but I guess that's not the case? Is buffer_write_start_pos_ a mapping from "inner" feature id to a memory offset position for input_buffer_?

So different workers write out the bins say for feature with "inner" index 5 to different parts of the input _buffer_? If that is the case, give that block_len and block_start are the same between workers, how does the reduction happen to ensure the bins for the same feature get aggregated together?

thvasilo on 4 Nov 2019

Thanks @thvasilo .
This actually is correct. The buffer is aligned with the real_feature_index. So Instead of inner_feature_index, you should examine the real_feature_index.

guolinke on 5 Nov 2019

+  int inner_feature_index = this->train_data_->InnerFeatureIndex(examined_feature);
+  printf("%d: Inner feature of %d: %d\n", rank_, examined_feature, inner_feature_index);

This is wrong, index over smaller_leaf_histogram_array_ is the inner index, not the real index.

And following is the right way to get the real_feature_index.

+    if (train_data_->RealFeatureIndex(feature_index) == examined_feature) {
+      printf("%d: Examined feature number of bins: %d\n", rank_, num_bins_for_feature);
+      printf("%d: Examined feature number of bins from data: %d\n", rank_, num_bins_data);
+      printf("%d: Examined feature buffer write position: %d\n", rank_, buffer_write_start_pos_[feature_index]);
+    }

guolinke on 5 Nov 2019

Hello @guolinke thank you for the explanation, and sorry for taking up more time.

I tried using this but still getting different numbers from the two sources, so I must still be using a wrong index.

Given the following snippet:

for (size_t feature_index = 0; feature_index < this->num_features_; ++feature_index) {
    if ((!this->is_feature_used_.empty() && this->is_feature_used_[feature_index] == false)) {
      continue;
    }

    size_t examined_feature = 15000;

    FeatureHistogram& hist_for_feature = this->smaller_leaf_histogram_array_[feature_index];

    int num_bins_by_hist_size = hist_for_feature.SizeOfHistgram() / sizeof(HistogramBinEntry);
    if (this->train_data_->RealFeatureIndex(feature_index) == examined_feature) {
      int num_bins_data = this->train_data_->FeatureNumBin(examined_feature);

      printf("%d: For feature_index :%zu, real feature is %d.\n", rank_, feature_index, this->train_data_->RealFeatureIndex(feature_index));
      printf("%d: Examined feature number of bins from hist size: %d\n", rank_, num_bins_by_hist_size);
      printf("%d: Examined feature number of bins from data: %d\n", rank_, num_bins_data);
      printf("%d: Examined feature buffer write position: %d\n", rank_, buffer_write_start_pos_[feature_index]);
    }

What changes should I make to ensure that num_bins_by_hist_size == num_bins_data?

What I'm seeing now is this:

0: For feature_index :6518, real feature is 15000.
0: Examined feature number of bins from hist size: 3
0: Examined feature number of bins from data: 10
0: Examined feature buffer write position: 3993408
1: For feature_index :16087, real feature is 15000.
1: Examined feature number of bins from hist size: 3
1: Examined feature number of bins from data: 4
1: Examined feature buffer write position: 3993408

What I'm trying to do is use a custom communication collective that performs the histogram aggregation, but right now I think I'm aggregating the histograms of different features. That's because my current assumption is that buffer_write_start_pos_[feature_index] returns the same offset on both workers which is true, but it corresponds to different "real" features apparently.

So on one worker we might start writing at position 5 in the buffer for fid=10, and the other worker will also write in the same offset, but fid=10 means different things for each worker. Hence one worker might write 10 bins worth of data (corresponding to real_feature=7) starting at position 5, while another might write 3 bins worth of data corresponding to real_feature=3.

buffer_write_start_pos_[i] for any i is the same for each worker, but it corresponds to different "real" features.

thvasilo on 5 Nov 2019

@thvasilo
for int num_bins_data = this->train_data_->FeatureNumBin(examined_feature);
you should use the feature_index.
For most operations in code, you should use inner_feature_index.

guolinke on 6 Nov 2019

What I'm trying to do is use a custom communication collective that performs the histogram aggregation, but right now I think I'm aggregating the histograms of different features. That's because my current assumption is that buffer_write_start_pos_[feature_index] returns the same offset on both workers which is true, but it corresponds to different "real" features apparently.

So on one worker we might start writing at position 5 in the buffer for fid=10, and the other worker will also write in the same offset, but fid=10 means different things for each worker. Hence one worker might write 10 bins worth of data (corresponding to real_feature=7) starting at position 5, while another might write 3 bins worth of data corresponding to real_feature=3.

buffer_write_start_pos_[i] for any i is the same for each worker, but it corresponds to different "real" features.

The feature_index here refers to the inner_feature_index. Different workers could have different real-featuer-index to inner-feature-index mapping. But the buffer_write_start_pos_ is algined by the real-feature-index. And this alignment is done in BeforeTrain .

guolinke on 6 Nov 2019

OK I think I've almost got it now.

for int num_bins_data = this->train_data_->FeatureNumBin(examined_feature);
you should use the feature_index.

Using the above I consistently see the number of bins being from_train_data_bins == hist_size_bins + 1, i.e. int num_bins_data = this->train_data_->FeatureNumBin(feature_index); will always return a number of bins equal to what I get from SizeOfHistgram() plus one (whether that's 3 vs. 4, 128 vs. 129, or 2 vs. 3).

Is the difference between the two related to #2539 because most likely these features are sparse and hence make use of the default bin? So it's expected to have the difference of 1 between the two?

I tried verifying that by a simple conditional print statement:

if (this->train_data_->FeatureBinMapper(feature_index)->GetDefaultBin() == 0) {
        printf("%d: Real feature %d uses the default bin.\n", rank_, this->train_data_->RealFeatureIndex(feature_index));
}

Which indeed was true for all the features I tried at least and all had this one off difference.

thvasilo on 7 Nov 2019

@thvasilo yes, different workers will optimize its own feature bin mappers based its own data partition.

guolinke on 8 Nov 2019

Can this be closed?

StrikerRUS on 15 Nov 2019

Hello Nikita,

I haven't had the chance to work on my side of the project yet, once I
manage to replicate the results of the scatter gather using another
collective I'll close this.

--
Sent from a mobile device. May contain autocorrect errors.

On Fri, Nov 15, 2019, 20:04 Nikita Titov notifications@github.com wrote:

Can this be closed?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/microsoft/LightGBM/issues/2537?email_source=notifications&email_token=ACFBHI7YYJDAHOYJIUFYRKDQT3XDJA5CNFSM4JH5Y7R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEGM7SA#issuecomment-554487752,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACFBHIZ6BDGWYZRT74JEWD3QT3XDJANCNFSM4JH5Y7RQ
.

thvasilo on 16 Nov 2019

👀1 👍1

Tried this out today and it seems like I've managed to properly align the buffers, so it looks like I'm able to replicate the results of the scatter gather.

I'm using a map-reduce approach to gather all histograms, and the best solution I've found is to use the buffer_write_start_pos_ as the reduction key, this allows me to circumvent all the confusion with inner/real feature indices.

Closing this, will probably have to come back with further question later on, as I move on to replicate the split finding step.

Thanks for all the help @guolinke, really appreciate you taking the time to help out.

thvasilo on 18 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings