If I had features that belonged to a particular group A and other features which belonged to a group B would taking the sum of their SHAP values indicate how much group A or group B influenced the decision of the model?
The short answer is yes! The longer answer is that there are some small ways that this makes slightly different assumptions than treating each group as its own feature in the first place, but those differences are typically not important.
Thank you!
Sorry, Is there any place where I can read about these assumptions or could you elaborate a little more?
The assumption with Shapley values is that each feature should be treated equally. For example consider a model with four features A, B, C and D. If there is an interaction effect between B, C and D of impact 6 then that value gets split up evenly between all three features, leading to a value of 2 for each. However, if we group A, B and C together into an ABC "super feature" then the interaction effect gets split up two ways instead of three ways, leading to a value of 3 being given to both ABC and D. If instead of grouping the features we just summed up the credit given to the individual features then we would get 2+2+0 = 4 for ABC and 2 for D.
In practice this does not make a big difference, but it is good to know about. Adding together the values of features to get the importance of a group is a very reasonable thing to do, it is just slightly different than the result you would get by treating the features as a group when computing the Shapley values.
It sounds interesting.
I wonder if this approach brings some computation time improvement as the number of features is reduced OR the credit given to the group is calculated _after_ standard computations (treating A B and C separately)?
Grouping beforehand is faster because it reduces the number of features, and so is an option for KernelExplainer. Adding afterwards is very convenient, and it what I suggest for the other explainers.
So the concern is if the group is formed before hand then there might be a slightly skewed Shapley value as you stated in your example. If the Shapley values are calculated for the features and then summed up then to reflect the groups then this wouldn't be a problem as it would be similar to the first case with ABC being equal to 4 and D to 2. Isn't that the desired result if Shapley values work that way.
On a side note does grouping before hand give any other advantage to the KernalExplainer except for faster results. How would you purpose the grouping of such numerical features or would that depend on the data more than anything else?
I don't think either way is wrong, they just enforce fair allocation at different levels. Consider the same idea in a more tangible example involving US states and people (where the states are a group of people), and assume that some payout only happens if everyone in both states agree (a big AND function). If the two states are Texas and Wyoming then computing the Shapley values at the person level will give most of the credit to Texas (since it has a much larger population). If however we instead compute the Shapley levels at the state level it will split it up evenly between the two states. This is an extreme example, but both cases could be reasonable in different scenarios (which is why in the US the Senate is allocated evenly among states, and the House is allocated evenly among people).
As for KernelExplainer one extra nice thing is that it can group things together that don't make sense to perturb independently (such as many computed features from the same time series).
I see your point with the state and people example. Well put!
How does the KernalExplainer group things together I read the documentation and cannot find anything which indicates anything like what you stated. Did I miss something?
Also if 2 features are belonging to the same group but they have Shapley values pushing the decision in opposite directions would we be able to say that the 2 features are actually different and should not be grouped together ( I am just trying to figure out how much we can infer from Shapley values).
Grouping of features should be added as a keyword parameter, but right now you define them by wrapping your data matrix in a DenseData object (which needs docs). https://github.com/slundberg/shap/blob/4efc638fba5836ddf1a833de2590fb845376325d/shap/common.py#L121
I think there are valid reasons to group features even when they oppose each other (such as people voting in states), just depends on what level view you want.
That is exactly what I did in my code. I was going to suggest it if it didn't exist. That answers all my questions so far.
Good to see I was on the right track. Thanks for all your help.