I'm trying to learn CNTK, and it would be fantastic if someone could answer these questions for me. I'm sure it would be for the good of the broader community also -- at least for relative beginners...
So, I'm running the binaries version 1.5 downloaded yesterday on windows. Firstly, I am just trying to run the MNIST sample and make sure that I understand what the output means. I've run it both in CPU mode and in GPU mode. I am running the 02_Convolution.cntk config file without modification except to switch between CPU and GPU modes. Here are some questions:
1) After completing the "train" and the "test" actions in CPU, the final result showing in the transcript is:
Final Results: Minibatch[1-625]: err = 0.00829 * 10000; ce = 0.03003748 * 10000; perplexity = 1.03049315
My questions are:
1A) What error rate am I achieving? 0.00829 * 10000 = 82.9 -- does that mean 82.9% errors?
1B) What is "ce"?
1C) Can someone give me a concise definition of "perplexity" and comment on the value 1.0305?
2) In the README that comes with MNIST, it says that with 02_Convolution I should achieve an error rate of 0.87% on GPU. Based on that, when I saw the result in item #1 above, I thought it must mean 0.829% error, but that is one reason for making sure I understand the output. Questions:
2A) Does it happen that error rate on CPU might differ from my error rate on GPU? Why would it be?
2B) Are the notes in README correct? Maybe they were for version 1.1, but not quite for version 1.5?
3) I ran the train and test actions in GPU mode, and the results at the end of the transcript are:
Final Results: Minibatch[1-625]: err = 0.0000000 * 10000; ce = 0.000000 * 10000; perplexity = 1.0000
So, I guess something is just wrong. I did check that I have env variable ACML_FMA equal to 0. Running on GPU was enormously faster, but did take a minute or so -- everything seemed reasonable although I don't really know how to look for log files, etc. Questions:
3A) Got any ideas what I should look for to figure out why GPU mode apparently did not work?
3B) Anything I should know about log files? Do they exist, if so, where? Or how to enable their generation?
Thanks!
John
Thanks!
1A) What error rate am I achieving? 0.00829 * 10000 = 82.9 -- does that mean 82.9% errors?
It means 0.829% error rate, averaged over 10,000 samples (83 errors in total).
1B) What is "ce"?
It stands for "cross entropy". To be precise, this is the name of the criterion node inside the network definition.
1C) Can someone give me a concise definition of "perplexity" and comment on the value 1.0305?
Perplexity is an alternative way of expressing the CE criterion specifically in text-processing, the formula is _perplexity = exp (ce)_. It is not normally used in image processing though.
2A) Does it happen that error rate on CPU might differ from my error rate on GPU? Why would it be?
Yes. Differences can arise from summation order of float values. Typically, reduction operations (summing over many values, e.g. computing the bias gradient) use some form of multi-level aggregation due to parallelization. That aggregation structure may even differ across different GPU types. The differences are normally tiny. But we have seen recurrent models amplify the difference in some cases.
2B) Are the notes in README correct? Maybe they were for version 1.1, but not quite for version 1.5?
This page links to the release notes for 1.5: [[https://github.com/Microsoft/CNTK/releases/tag/v1.5]] at the bottom. That link leads to a correct page. Let me check whether the ZIP file has a wrong version.
3A) Got any ideas what I should look for to figure out why GPU mode apparently did not work?
ACML_FMA only affects the CPU. I have seen the same problem with older GPUs, e.g. my Quadro 6000. What is your GPU? We don't create direct binaries for older GPUs, but NVidia's driver is supposed to mask that by cross-JITting the code. I suspect that driver may have a bug. Please let me know your GPU type, and we take it from there.
3B) Anything I should know about log files? Do they exist, if so, where? Or how to enable their generation?
You can say this in your .cntk config file:
stderr = "$ExpDir$/log"
(or any pathname of your choosing). The actual log file will have this as a stem, and append the name of the train action plus ".log".
Frank --
Thanks for answering! As for my GPU, here is what I know:
NVIDIA Quadro 2000
John
That's compute capability 2.1. The makefile depends on an external variable, so let me verify with my colleagues whether the 1.5 downloadable binary includes CUDA binaries for device capability 2.1 or not.
Note that some operations require 3.x, for example convolution. We may have some unclarity internally to what degree we actually support 2.x currently.
Can you suggest a GPU or two that are known to be good with CNTK ?
Thanks for the fantastic support.
We are successfully using Titan-X, K20, K40, and M40.
I confirmed that the 1.5 binary compiles for compute capability 2.0, but not explicitly for 2.1. I still need to find out whether that could make a difference.
Would you be willing to try to compile CNTK yourself, with modified build options to target 2.1 directly, and see if that solves it?
Frank -- thanks for all your help. Probably, at this time, not worth it for me to try to get that GPU operating -- about to set up new computer for this work, and will target getting one of the GPUs you mention.
I just realized, I don't know whether you are running Windows or Linux. Our standard build process seems to include 2.0 in Linux but not on Windows, which only builds for 3.0 and higher. I am starting an internal thread to clean this up; either build for 2.0 or officially declare our version support for 2.0.
running Windows.
OK, so that is the problem. Not that you run Windows I mean, but that the Windows version does not build for 2.0. I have internally suggested to bring back 2.0 for the binary (interestingly we still build 2.0 in Debug, so it still seems to work).
But if you have access to a newer card, that would be the better solution for you. I remember that some NVidia convolution code requires 3.0 or above, so sooner or later you may run into this problem anyway.
OK, mystery solved! Now, I'd like to ask another question along the lines of understanding output from the MNIST example -- I added a "write" action to my config file so that I could run the code using the .exe as if I was doing an eval on a single input image. So, I added this:
write = [
action = write
minibatchSize = 16
reader = [
readerType = "CNTKTextFormatReader"
file = "$DataDir$/3.txt"
input = [
features = [
dim = 784
format = "dense"
]
]
]
format = [
type = "category"
labelMappingFile = "$DataDir$/labelsmap.txt"
]
outputPath = "PL2.txt"
]
My data file called "3.txt" was a file formatted like this:
|features 0 0 0 0 0 0 0 0 .... and so on...
I extracted it out of the test.txt data file that I had downloaded, and it corresponds to an image of a hand-written "3".
So, I ran the "write" action in the config file, and I looked in the file I wrote ( PL2.txt ) and sitting in that file was a "3" ! Yahoo! It worked. A couple simple questions are:
1) why is the output file written actually called "PL2.txt.ol.z" ? It seems to be a plain text file, so why not just "PL2.txt"?
2) I thought I might get 10 values as output - each a 'probability' that the input was one of the 10 labels. If I wanted to get that kind of output, how would I?
1) why is the output file written actually called "PL2.txt.ol.z" ?
The "write" command allows to write more than one output at once. So we append the node name. I sometimes wondered the same, would it be valuable to default to the original name if only one node is specified. Do you think that would be a more meaningful default?
2) I thought I might get 10 values as output
You do that by changing the format type. type="category" means pick the highest-scoring one. You can try type="real".
I'd like "PL2.ol.z.txt" better -- so windows still knows its a text file, and it opens easily. But no biggie - I'm sure you have bigger fish to fry!
Ah! type = "real" -- great, thanks!
Good feedback. Will think how best to do that.
Are you OK to close this Issue?
Yes, OK to close. I am next trying to figure out EvalDLL so I may post another query or two again soon! Thanks for all your help. Watched your recent talk on YouTube which was also very helpful. Thanks!
Cool! I will close this now. If you have a question about EvalDLL that is unrelated to this Issue, please open a new one. Thanks much!
Most helpful comment
Thanks!
It means 0.829% error rate, averaged over 10,000 samples (83 errors in total).
It stands for "cross entropy". To be precise, this is the name of the criterion node inside the network definition.
Perplexity is an alternative way of expressing the CE criterion specifically in text-processing, the formula is _perplexity = exp (ce)_. It is not normally used in image processing though.
Yes. Differences can arise from summation order of float values. Typically, reduction operations (summing over many values, e.g. computing the bias gradient) use some form of multi-level aggregation due to parallelization. That aggregation structure may even differ across different GPU types. The differences are normally tiny. But we have seen recurrent models amplify the difference in some cases.
This page links to the release notes for 1.5: [[https://github.com/Microsoft/CNTK/releases/tag/v1.5]] at the bottom. That link leads to a correct page. Let me check whether the ZIP file has a wrong version.
ACML_FMA only affects the CPU. I have seen the same problem with older GPUs, e.g. my Quadro 6000. What is your GPU? We don't create direct binaries for older GPUs, but NVidia's driver is supposed to mask that by cross-JITting the code. I suspect that driver may have a bug. Please let me know your GPU type, and we take it from there.
3B) Anything I should know about log files? Do they exist, if so, where? Or how to enable their generation?
You can say this in your .cntk config file:
(or any pathname of your choosing). The actual log file will have this as a stem, and append the name of the train action plus ".log".