Glow: [RFC] Error reporting, malformed graph, and general resilience

Created on 23 Aug 2018 · 18Comments · Source: pytorch/glow

Context

Glow checks are for the most part bound to whether or not asserts are enabled in the build.
The rationale is we expect the inputs (network, weights) to be correctly shaped and sized when we invoke the compiler/loaders and thus this design eliminates unnecessary overhead.

Problems

Given most checks are removed in release mode, we end up in a situation where we cannot use this variant when bringing up new works.

For instance, silly mistakes like shaping incorrectly a transpose will produce random compiler errors or just invalid/inaccurate outputs with no way to tell what is going wrong.
Then, running the same network with a debug variant would trigger a verifier failure from the beginning, hinting directly at what the problem was. This process is far from being intuitive and is cumbersome.

In essence, we have the following problems:

The debug mode gives us diagnostics but is slow to run.
The release mode is fast to run but does not gives any information on what is going on.
The error reporting through asserts has too little context to be actionable on its own.

To make a comparison, Glow behaves like a compiler that wouldn't issue a syntax error diagnostic if compiled with assert disabled. On top of that, when the assertions are enabled, errors like type mismatch would be reported bluntly: no location in the source file, no naming of the type that was causing the problem and so on.

In a nutshell, when working on a new model, the process may look like this:

Run the network with the release variant
=> compiler crashes
Run the network with the debug variant
=> assertion failure "invalid..."
Run the network with the debug variant in a debugger
=> assertion failure "invalid...", then print variables, examine the memory, etc. to see what is wrong.

Finally, for large networks, it is possible that the debug mode is too slow to be practical, making debugging very hard.

Ultimate Goal

Ideally, we would like to be sure that only valid IR hit the compiler. Then, in the compiler itself, it seems reasonable to rely on asserts for things related to the IR, like the invariants after such and such optimizations and so on. Indeed, at this point, any problem occurring in the compiler is the result of a wrong doing of the compiler itself and thus not actionable by the user.

Of course, we would still need to have some kind of non-assert based error reporting for anything that is not in the control of the compiler. E.g., failing to write to the filesystem, out-of-memory allocation, network too big for the targeted backend, etc.

How Do We Get There?

We basically have two approaches to make sure only valid IR hit the compiler:

Have every "frontend" of Glow performs the checks (e.g., like the protobuf loaders).
Have Glow performs the checks before doing anything.

The second approach would be the preferred way as it focused the burden of validating the IR in only one place. However, this potentially makes error reporting more difficult for the backends. Therefore, we likely would want both.

Regarding the general error reporting, I believe we would want to have the front-end to register a diagnostic handler in Glow, so that the reporting is properly plumbed there. This is less critical than ensuring that the IR is valid as most users's errors should be caught by the front-end itself.

What About Getting Rid of the Unnecessary Checks Then?

With the proposed approach, we don't really have a way to get rid of the checks that valid the IR, so how do we avoid to waste resources when we are running for the same network over and over?

We see two possible directions here:

Something higher level of Glow, could cache the result of the compilation and such, then it wouldn't invoke Glow or its front-end/loader again and again and instead would directly run what it already had.
We could feature a "reckless" mode that effectively disable every error reporting in both the front-end and Glow. This mode wouldn't be the default like it is right now.

For #2, the idea would be that only trusted networks would be run. This could be checked for instance with a whitelist embedded in the compiler/loader.

What do people think?

cc: @nadavrot, @artemrakhov, @opti-mix, @stoklund, @jfix71, @rdzhabarov, @hegemanjwh2, @beicy, @bertmaher, @tlepley.

Source

qcolombet

👍1

Most helpful comment

Forgot to post an update here.
As of #2027, the module, function and node verifiers do not use any assert.
We can use that method in the loader to report error now.

qcolombet on 21 Nov 2018

👍3

All 18 comments

@qcolombet Thanks for writing this. The motivation is clear. I absolutely agree that we need to validate the input the compiler in something that's stronger than assert(). In the past we talked about the need to use GLOW_ASSERT (which remains in debug builds) to validate all inputs, and use assert() to validate interfaces inside the compiler.

Do you think that it would be sufficient to change the asserts inside the verifier to GLOW_ASSERT? Can we trust different loaders (like, the ONNX loader) to validate their own inputs and then run the verifier on the graph that they created?

nadavrot on 23 Aug 2018

GLOW_ASSERT would help, but @artemrakhov was concerned that it would slow things down for things we know are valid.
See #1502 and #1499 for examples of our conversions over this topic.

For the verifier itself, I think having one GLOW_ASSERT calling the verifier at the beginning of the pipeline would be valuable, but we probably need to do a bit of audit to make sure we don't call the verifier more than once, which I believe is not the case today.

Ideally, I would like to turn the verifier into return true/false + diagnostic and wrap that into GLOW_ASSERT/regular assert as appropriate.

qcolombet on 23 Aug 2018

👍1

To reply to the question

Can we trust different loaders (like, the ONNX loader) to validate their own inputs and then run the verifier on the graph that they created?

The answer is no, right now.

qcolombet on 23 Aug 2018

👍1

GLOW_ASSERT will crash the program, right? This might be an issue when we integrate glow in a inference service setup. There could be invalid inputs from an individual model, we should report error and not serve it instead of crashing the whole service.

yinghai on 24 Aug 2018

👍1

agree with @yinghai, at the point of model loading (through ONNX loader and ONNXIFI loader specifically) and graph validation of the loaded model we should not use GLOW_ASSERT() but rather use a method which lets the caller know what the error is and potentially recover from it (through exceptions or error codes).
We already have F->verify(); which could be used right after model loading, but we need to make sure that validation is handled with proper mechanisms.

rdzhabarov on 24 Aug 2018

👍3

For the verifier itself, I think having one GLOW_ASSERT calling the verifier at the beginning of the pipeline would be valuable, but we probably need to do a bit of audit to make sure we don't call the verifier more than once, which I believe is not the case today.

At the Node level, I believe we verify twice: At the beginning of ExecutionEngine::generateIR() (once the graph has been built, prior to any optimizations), and then at the beginning of IRFunction::generateIR() (once all optimizations are complete and we are doing IRGen). We could use NDEBUG to pick which ones we want to run.

RE: Not crashing when loading the model, one of the biggest places this is going to be a pain is the node creators. They shouldn't have any assertion failures -- they should signal up the stack that there was an issue building a node. Currently all uses of them assume a valid pointer to the node created is returned, so unless we use exceptions it will be a pain to rework this.

jfix71 on 24 Aug 2018

so unless we use exceptions it will be a pain to rework this.

Do we allow throwing exception during import?

yinghai on 24 Aug 2018

agree with @yinghai, at the point of model loading (through ONNX loader and ONNXIFI loader specifically) and graph validation of the loaded model we should not use GLOW_ASSERT() but rather use a method which lets the caller know what the error is and potentially recover from it (through exceptions or error codes).

Agree and I vote for error codes plus diagnostic handlers if need be. LLVM has a nice error handling mechanism that we could use (see http://llvm.org/docs/ProgrammersManual.html#error-handling, and in particular the recoverable error part.)

qcolombet on 27 Aug 2018

👍2

I think I'm just repeating what @qcolombet said, but I want to add that it is important to distinguish between two different types of errors:

Glow has been presented with invalid input.
Glow has a bug that caused its data structures to become invalid.

Assertions are traditionally used to detect the second kind of error: bugs in Glow. This means that an assertion failure always indicates a bug in Glow, no matter what input it is presented with. It is not a bug in Glow to receive invalid input, and I think we should use something other than asserts for this.

When validating API arguments, I think it is okay to use assertions. Wrong API arguments represent a bug in the code calling into the API, and we kindly detect the bug.

The type of error detected is also a separate thing from the mechanism used to report the error. Possible mechanisms are:

Call abort().
Throw an exception.
Return an error code of some kind.

When detecting a bug in Glow, calling abort() is usually the right thing to do, and that is what the various assert macros do. For other kinds of errors like invalid inputs, I think it makes sense to use error codes like LLVM does it.

stoklund on 28 Aug 2018

👍2

Hi @rdzhabarov,

Would you like to take a stab at it, at least the loader part, while making the ONNXIFI interface more robust?

qcolombet on 7 Sep 2018

Yes, this is on my list of things to do!

rdzhabarov on 7 Sep 2018

👍3

Working on moving the verifier from assert-based to return-boolean, as discussed in https://github.com/pytorch/glow/issues/1517#issuecomment-415549954.

qcolombet on 7 Nov 2018

❤1 👍1

First step in #1993.

qcolombet on 9 Nov 2018

Forgot to post an update here.
As of #2027, the module, function and node verifiers do not use any assert.
We can use that method in the loader to report error now.

qcolombet on 21 Nov 2018

👍3

It seems we aren't calling verify anywhere in Release mode. At this point is there any reason we can't/shouldn't add verify checks in the ONNX/Caffe2 model importers? E.g. before returning success here, add RETURN_ERR_IF_NOT(F.verify(), "Function verification failed.");:

https://github.com/pytorch/glow/blob/884d40b336994082b47516db862f6f0b18b9ae6f/lib/Importer/ONNXModelLoader.cpp#L963-L979

jfix71 on 9 Jan 2019

👍2

@jfix71 no I think there's no reason we shouldn't be calling verify() now