Hi,
I have just started learning XGBoost model. I used SHAP as a tool for feature selection for my XGBoost prediction model. After obtaining the feature importance, I noticed that the SHAP values of some features are equal to zero. After investigating into more details, I found that the features with zero SHAP value are collinear. For example, if features A and B are highly correlated, the SHAP value of B is then set to zero.
However, from my understand, Shapley value uses cooperative game theory to compute the contribution of each signal. So, in case of multicollinearity, shouldn't A and B have the same value of Shapley value?
Also, if the Shapley value can handle multicollinearity, how does SHAP select the signal and ignore the other correlated features?
Have you tried computing the shap interaction values for your model?
Hi @mchirsa5, the question you ask about correlated features is a tricky one, and the phenomenon you observe (collinear features being assigned a SHAP importance score of zero) is actually quite common (and, arguably, problematic) in the field of explainable AI/interpretable ML: it is sometimes referred to by the name of correlation bias.
SHAP is a post-hoc model-agnostic interpretability method that uses Shapley values from game theory to estimate the predictive importance (i.e. SHAP score) of the features of a machine learning model. When using SHAP, remember that we are explaining the machine learning model (rather than the data!!!). Correlation bias occurs because of how the machine learning algorithm trains the model, not because of how SHAP estimates feature importance. For example, when presented with a group of highly correlated features {A,B,C}, the machine learning algorithm will assign a large weight to one arbitrary representative of the group, let鈥檚 say A. Indeed, features B and C hardly provide any more information than A does (they are very similarly associated to the prediction target labels). Features B and C are redundant. The model鈥檚 decision-making process will therefore rely heavily on feature A, but not on features B and C. Consequently, features B and C will score poorly when SHAP is used to explain the model. If SHAP were to assign high importance scores to features B and C, it would not be faithful to the model!
@alexcoca Sorry for my delay response. I haven't tried plotting the SHAP interaction yet. I will try plotting it. Thank you for the recommendation :)
@LEMTideman Thank you for your response. Please correct me if I misunderstood it. I do apologize for asking such a basic question. I'm very new to machine learning. So, the reason why the redundant variables are set to zero is because they do not exist in the machine leaning model.
@mchirsa5: the SHAP values of redundant variables are zero because the model does not depend on them (these variables have no influence on the model's output)
@LEMTideman I see. Thank you so much! I have one more question related to the interpretability. I've tried to compare the results between SHAP and correlation analysis. The Spearman's correlation coefficient was calculated between the features (input signals) and the target (predicted value).
What I have found is that some variables, which were selected by SHAP, have a very low correlation coefficient with the predicted value. Do you know why SHAP selects the signal with very low correlation coefficients? As I understand, SHAP measure the contribution of that signal on the output variable. However, if the signal has a very low correlation coefficient, this signal should have less contribution though?
Hi @mchirsa5, you are comparing two different approaches. By calculating the Spearman鈥檚 correlation coefficient between the features and target, you are studying the data. By calculating the Shapley values using SHAP, you are studying the model.
SHAP will tell you which features were considered important by the machine learning algorithm used to train your XGBoost model. SHAP measures the influence that each feature has on the XGBoost model鈥檚 prediction, which is not (necessarily) the same thing as measuring correlation. Spearman鈥檚 correlation coefficient only takes monotonic relationships between variables into account, whereas SHAP can also account for non-linear non-monotonic relationships between variables. Depending on your XGBoost model, it is possible for a feature whose Spearman鈥檚 correlation coefficient is small to have a large impact on the model鈥檚 decision-making process. For example, a feature that is not highly correlated with the target may add useful information to the prediction task, assuming the presence of several other features (hence, in a specific subset of the feature space).
@LEMTideman Thank for the clarification!!
@dushyant-007 I'm a little late to the party, but anyway. As @LEMTideman pointed, SHAP values are a measure of the importance of a feature in relation to the model. This is different from the true importance of a feature.
Consider a dataset that contains the index columns, and also consider that the model overfits and learns to use the index value to make a prediction. You can expect the shap values of the index to be through the roof as that feature alone is responsible for the prediction. However, out-of-sample accuracy will be rock bottom because that is not a feature that is useful for actual prediction. If you remove the index feature from the train data, you will train a new model that will have a different shap profile. And in this case, test accuracy should improve.
Shap is not a measure of "how important a given feature is in the real world", it is simply "how important a feature is to the model". If your model is not trained properly (for instance, when data cleanup is not employed), then the shap values will be distant from what you would expect. On the other hand, if you guarantee that your model is correct, then shap values should be a good approximation of what reality is like.
@GZuin thanks for the comment. Actually, there is a problem a problem with my code. So here's what happened -
I created the model. It performs great. But I did standardization of training set. Did the same for test set, so accuracy was high. But while passing the variable to SHAP I passed the original values. Obviously, since the friction factor had values from 0.01 to 45. It somehow still mattered , the rest didn't (for the model) and hence the result.
Most helpful comment
Hi @mchirsa5, the question you ask about correlated features is a tricky one, and the phenomenon you observe (collinear features being assigned a SHAP importance score of zero) is actually quite common (and, arguably, problematic) in the field of explainable AI/interpretable ML: it is sometimes referred to by the name of correlation bias.
SHAP is a post-hoc model-agnostic interpretability method that uses Shapley values from game theory to estimate the predictive importance (i.e. SHAP score) of the features of a machine learning model. When using SHAP, remember that we are explaining the machine learning model (rather than the data!!!). Correlation bias occurs because of how the machine learning algorithm trains the model, not because of how SHAP estimates feature importance. For example, when presented with a group of highly correlated features {A,B,C}, the machine learning algorithm will assign a large weight to one arbitrary representative of the group, let鈥檚 say A. Indeed, features B and C hardly provide any more information than A does (they are very similarly associated to the prediction target labels). Features B and C are redundant. The model鈥檚 decision-making process will therefore rely heavily on feature A, but not on features B and C. Consequently, features B and C will score poorly when SHAP is used to explain the model. If SHAP were to assign high importance scores to features B and C, it would not be faithful to the model!