You mention in your readme that you're not using the same vector state than what was described in DM paper.
[X_t, X_{t-1}, ...X_{t-7}, Y_{t}, Y_{t-1}, ...Y_{t-7}, B, W]
instead of
[X_t, Y_t, X_{t-1}, Y_{t-1}...X_{t-7}, Y_{t-7}, C]
First of all, your argument for the color is that the padding will make it easier for white, I'm not exactly sure how it would advantage white. If anything I feel that B would be advantaged (since 0 is used for white, only bias can be used since weights are useless for white).
Second, you don't adress your choice to change the interweaving of colors. In DM paper, there is an alternance of the player stones and other player stones, while in your version there seems to be first all moves by the current player then all the stones from the other player.
Any rationale behind this choice ?
The interleaving is completely irrelevant, it's just a reordering of the inputs and weights.
You are right about white vs black:
"has a constant value of either 1 if black is to play or 0 if white is to play"
The player who gets the plane with 1's has an advantage because the convolution can see the board edge. I'll fix this in the README.
Any reason you went to B, W instead of 1, -1 ?
Also, i feel the interleaving could have an effect as the kernels of the convolution have limited range (3x3(. In order to see black and white stones in the same patch is impossible in your version, but it is in the DM version.
So my understanding is that convolutions can match black and white stones only at upper level of convolutions layers, no ?
I internally store them as bitsets and an even number worked a bit better with the OpenCL code I had. I would not be surprised if learning the edge is easier if it does not swap sign but I doubt it matters much in the long run.
You misunderstand how 2D convolutions work in a DCNN. All input planes are convolved into every output plane with a variable filter, so as already said, the interleaving is completely irrelevant and it's a simple reordering of weights. Again, you can literally reorder the inputs, reorder the already calculated weights from a weights file, and get 100% the same output for the same input. The range of the kernel doesn't even factor in at all here!
The 3x3 filters are spatial 2D filters over a single input (into a single output), but there is a filter for every output x input plane combination, it's not a 3x3x3 filter that only takes 3 different input planes into a single output or something like that.
My bad on the kernel size !