Glow checks are for the most part bound to whether or not asserts are enabled in the build.
The rationale is we expect the inputs (network, weights) to be correctly shaped and sized when we invoke the compiler/loaders and thus this design eliminates unnecessary overhead.
Given most checks are removed in release mode, we end up in a situation where we cannot use this variant when bringing up new works.
For instance, silly mistakes like shaping incorrectly a transpose will produce random compiler errors or just invalid/inaccurate outputs with no way to tell what is going wrong.
Then, running the same network with a debug variant would trigger a verifier failure from the beginning, hinting directly at what the problem was. This process is far from being intuitive and is cumbersome.
In essence, we have the following problems:
To make a comparison, Glow behaves like a compiler that wouldn't issue a syntax error diagnostic if compiled with assert disabled. On top of that, when the assertions are enabled, errors like type mismatch would be reported bluntly: no location in the source file, no naming of the type that was causing the problem and so on.
In a nutshell, when working on a new model, the process may look like this:
Finally, for large networks, it is possible that the debug mode is too slow to be practical, making debugging very hard.
Ideally, we would like to be sure that only valid IR hit the compiler. Then, in the compiler itself, it seems reasonable to rely on asserts for things related to the IR, like the invariants after such and such optimizations and so on. Indeed, at this point, any problem occurring in the compiler is the result of a wrong doing of the compiler itself and thus not actionable by the user.
Of course, we would still need to have some kind of non-assert based error reporting for anything that is not in the control of the compiler. E.g., failing to write to the filesystem, out-of-memory allocation, network too big for the targeted backend, etc.
We basically have two approaches to make sure only valid IR hit the compiler:
The second approach would be the preferred way as it focused the burden of validating the IR in only one place. However, this potentially makes error reporting more difficult for the backends. Therefore, we likely would want both.
Regarding the general error reporting, I believe we would want to have the front-end to register a diagnostic handler in Glow, so that the reporting is properly plumbed there. This is less critical than ensuring that the IR is valid as most users's errors should be caught by the front-end itself.
With the proposed approach, we don't really have a way to get rid of the checks that valid the IR, so how do we avoid to waste resources when we are running for the same network over and over?
We see two possible directions here:
For #2, the idea would be that only trusted networks would be run. This could be checked for instance with a whitelist embedded in the compiler/loader.
What do people think?
cc: @nadavrot, @artemrakhov, @opti-mix, @stoklund, @jfix71, @rdzhabarov, @hegemanjwh2, @beicy, @bertmaher, @tlepley.
@qcolombet Thanks for writing this. The motivation is clear. I absolutely agree that we need to validate the input the compiler in something that's stronger than assert(). In the past we talked about the need to use GLOW_ASSERT (which remains in debug builds) to validate all inputs, and use assert() to validate interfaces inside the compiler.
Do you think that it would be sufficient to change the asserts inside the verifier to GLOW_ASSERT? Can we trust different loaders (like, the ONNX loader) to validate their own inputs and then run the verifier on the graph that they created?
GLOW_ASSERT would help, but @artemrakhov was concerned that it would slow things down for things we know are valid.
See #1502 and #1499 for examples of our conversions over this topic.
For the verifier itself, I think having one GLOW_ASSERT calling the verifier at the beginning of the pipeline would be valuable, but we probably need to do a bit of audit to make sure we don't call the verifier more than once, which I believe is not the case today.
Ideally, I would like to turn the verifier into return true/false + diagnostic and wrap that into GLOW_ASSERT/regular assert as appropriate.
To reply to the question
Can we trust different loaders (like, the ONNX loader) to validate their own inputs and then run the verifier on the graph that they created?
The answer is no, right now.
GLOW_ASSERT will crash the program, right? This might be an issue when we integrate glow in a inference service setup. There could be invalid inputs from an individual model, we should report error and not serve it instead of crashing the whole service.
agree with @yinghai, at the point of model loading (through ONNX loader and ONNXIFI loader specifically) and graph validation of the loaded model we should not use GLOW_ASSERT() but rather use a method which lets the caller know what the error is and potentially recover from it (through exceptions or error codes).
We already have F->verify(); which could be used right after model loading, but we need to make sure that validation is handled with proper mechanisms.
For the verifier itself, I think having one GLOW_ASSERT calling the verifier at the beginning of the pipeline would be valuable, but we probably need to do a bit of audit to make sure we don't call the verifier more than once, which I believe is not the case today.
At the Node level, I believe we verify twice: At the beginning of ExecutionEngine::generateIR() (once the graph has been built, prior to any optimizations), and then at the beginning of IRFunction::generateIR() (once all optimizations are complete and we are doing IRGen). We could use NDEBUG to pick which ones we want to run.
RE: Not crashing when loading the model, one of the biggest places this is going to be a pain is the node creators. They shouldn't have any assertion failures -- they should signal up the stack that there was an issue building a node. Currently all uses of them assume a valid pointer to the node created is returned, so unless we use exceptions it will be a pain to rework this.
so unless we use exceptions it will be a pain to rework this.
Do we allow throwing exception during import?
agree with @yinghai, at the point of model loading (through ONNX loader and ONNXIFI loader specifically) and graph validation of the loaded model we should not use GLOW_ASSERT() but rather use a method which lets the caller know what the error is and potentially recover from it (through exceptions or error codes).
Agree and I vote for error codes plus diagnostic handlers if need be. LLVM has a nice error handling mechanism that we could use (see http://llvm.org/docs/ProgrammersManual.html#error-handling, and in particular the recoverable error part.)
I think I'm just repeating what @qcolombet said, but I want to add that it is important to distinguish between two different types of errors:
Assertions are traditionally used to detect the second kind of error: bugs in Glow. This means that an assertion failure always indicates a bug in Glow, no matter what input it is presented with. It is not a bug in Glow to receive invalid input, and I think we should use something other than asserts for this.
When validating API arguments, I think it is okay to use assertions. Wrong API arguments represent a bug in the code calling into the API, and we kindly detect the bug.
The type of error detected is also a separate thing from the mechanism used to report the error. Possible mechanisms are:
abort().When detecting a bug in Glow, calling abort() is usually the right thing to do, and that is what the various assert macros do. For other kinds of errors like invalid inputs, I think it makes sense to use error codes like LLVM does it.
Hi @rdzhabarov,
Would you like to take a stab at it, at least the loader part, while making the ONNXIFI interface more robust?
Yes, this is on my list of things to do!
Working on moving the verifier from assert-based to return-boolean, as discussed in https://github.com/pytorch/glow/issues/1517#issuecomment-415549954.
First step in #1993.
Forgot to post an update here.
As of #2027, the module, function and node verifiers do not use any assert.
We can use that method in the loader to report error now.
It seems we aren't calling verify anywhere in Release mode. At this point is there any reason we can't/shouldn't add verify checks in the ONNX/Caffe2 model importers? E.g. before returning success here, add RETURN_ERR_IF_NOT(F.verify(), "Function verification failed.");:
Related to https://github.com/pytorch/glow/issues/2247
@jfix71 no I think there's no reason we shouldn't be calling verify() now
@jfix71 sounds good to me. That was the plan when I started the refactoring.
Happy to see that moving further along!
@jackm321 please reopen if this is ongoing.
Most helpful comment
Forgot to post an update here.
As of #2027, the module, function and node verifiers do not use any assert.
We can use that method in the loader to report error now.