Hi,
I have a question about the dnn architecture employed in the dnn_face_recognition_ex.cpp example. It looks like the output descriptors are normalized, as the they contain values between -1 and 1.
However, I cannot see any kind of normalization applied, as the last layer is a fully connected layer. Is softmax or something applied under the hood? Because when I export the net to a caffe model, with the tool you provided, the last layer is a simple InnerProduct layer, which doesn't seem to do any kind of normalization on the final vector:
n.fc_no_bias1 = L.InnerProduct(n.avg_pool2, num_output=128, bias_term=False);
Sorry if this is kind of a noob question, but I am new to all the deep learning, caffe and tensorflow mess and I am trying to understand the architecture of your net.
The reason I am asking is, I am trying to port the architecture to tensorflow.js, in order to run face-recognition in the browser via WebGL, but I am trying to figure out how to build the output layer.
Thanks in advance for your help, I would highly appreciate your advice.
There aren't any hidden layer or any kind of normalization. The network
functions just as it seems and you appear to have the right idea about it's
execution. The output numbers are just by chance in that range.
Thanks for the quick reply! Yes you are right, there is still something squishy in my tf implementation, which leads to different descriptors. Unfortunately also the caffe net exported from that architecture is not entirely complete, since there are some warnings.
Firstly the dlib pooling layers can not be translated to caffe pooling layers (which apparently can be fixed by adjusting the padding values afterwards). But secondly, I have not figured out yet how to deal with this message:
The dlib network contained an add_prev layer (layer idx 31) that adds two
previous layers with different output tensor dimensions. Caffe's equivalent
layer, Eltwise, doesn't support adding layers together with different
dimensions. In the special case where the only difference is in the number of
channels, this converter program will add a dummy layer that outputs a tensor
full of zeros and concat it appropriately so this will work. However, this
network you are converting has tensor dimensions different in values other than
the number of channels. In particular, here are the two tensor shapes (batch
size, channels, rows, cols):
1 128 4 4
1 256 3 3
First time this appears is in the first ares_down block of alevel1 (the first residual block with 256 filters). Apparently the first convolution changes the blob dimensions from [1 128 8 8] to [1 256 3 3]. However, the shortcut (the average pooling of the last blocks output) results in the following dimensions: [1 128 4 4], thus add_prev of [1 256 4 4] (after zero padding) and [1 256 3 3] fails.
Is this just a mistake by the export script or what should be the correct dimensions of the blob in that layer, (I guess [1 256 4 4] should be correct here, which is what I get by using 'same' instead of 'valid' padding for the down sampling conv layer, but want to make sure)?
Sorry for the wall of text, but I would highly appreciate any hint on that issue, as finding the culprit leading to wrong descriptor values turns out to be a very tedious business.
Thanks again, best wishes!
Edit:
The dimensions are assuming an input of 3 channeled 150 x 150 sized images of course.
The returned descriptors from the fully connected layer seem to be way off, for example somewhere in the range -20 to + 20 instead of -1 to 1. I assigned the weights of the conv layer filters and biases as well as the scale layer weights and biases equivalently to how it's done in the python code for the caffe net. Is there any special layout for the weights in the ouput binary (.weights) file, that I didn't consider maybe? I assumed they are just a big numpy array, written as a flat array of float32s.
I don't think there is any mistake in the warning message. It just is what
it is. Caffe isn't going to be identical to dlib in the way that all
software is not identical.
There is a deeper question about why you don't just call dlib? Then you
wouldn't need to do all this stuff.
Okay, actually my question was, whether in the c++ implementation same or valid convolution is applied, e.g. what the correct sizes of the blobs are, as this might make a difference for the output descriptor (or maybe it's not that significant and my question is just dumb, as I said I am a newbie in that area). I guess I have to dig deeper into the c++ code and eventually compare the output of individual layers, to find out where the issue resides.
Don't get me wrong, I highly appreciate your library, thanks for providing your models to the open source community! The face recognition solution provided in dlib is the best open source solution I found so far that's why I am using it for quite some time now. Even wrote an npm package, that wraps the dlib face detection, landmark and recognition API for node.js.
But now I also wanted to try to port this to the browser, as inference with WebGL (via tensorflow.js for example, which has been released recently) seems to be insanely fast.
I don't think there is an issue. The sizes are very likely what they say
they are. Maybe you found a bug and the sizes aren't what is being
reported, but that is unlikely.
Oh right, I didn't reread the thread and forgot you were the guy trying to
run it in javascript :)
Alright. Is it possible by any chance to retreive the output blob for intermediate layers or only for the output layer?
Yes, you can get the output by just going to the layer and calling get_output(). There is a discussion of this in the introductory example programs.
Thanks, this will help me a lot!
Just one last question :D. I am comparing the outputs of indivdiual layers now by writing the data of the tensor obtained from get_output() to a file as follows:
std::vector<float> out;
for (auto iter = dlib::layer<130>(net).get_output().begin(); iter != dlib::layer<130>(net).get_output().end(); ++iter)
out.push_back(*iter);
ofstream file("conv32_in.dat", ios::binary);
file.write((char*)out.data(), sizeof(float) * out.size());
file.close();
The thing I am noticing however is, that the ouput data from dlib::layer<130>(first conv) and dlib::layer<128>(first conv -> affine -> relu) is the same. Are they pointing to the same tensor or am I doing something wrong here?
Some layers are in-place, meaning they operate on the tensor that comes
into them rather than allocating a new one. Any layer that can operate
in-place does, like relu and affine.
I see. Awesome, thanks for your help!