Azure-functions-host: Increased Logging around gRPC server health and JobHost restarts

Created on 22 May 2020  Β·  7Comments  Β·  Source: Azure/azure-functions-host

What problem would the feature you're requesting solve? Please describe.

If the gRPC server were to become unreachable, or if the JobHost were to be forced to restart, we do not have much telemetry to help us reason about the root cause behind either of these issues. It would be great if we could increase our logging coverage to address these two scenarios πŸ”

Describe the solution you'd like

Adding extra logs to the gRPC server and JobHost manager to report useful information around failures, restarts, and general health.

Describe alternatives you've considered

I don't think there are other meaningful alternatives, but I'd be happy to hear about them!

Additional context

I would be happy to volunteer to adding these extra logs myself, I would just need some guidance as to relevant files and data sources to do so. I've been told that the FunctionRPCService would be a good place to start, but are there other places were useful logging information may be found? Thanks!

⚑ ⚑

Most helpful comment

@davidmrdavid πŸ‘ πŸ™ 🎊

All 7 comments

Tagging @mhoeger and @brettsam as they are familiar with the context

If the gRPC server were to become unreachable,

The scope of gRpc service is the application lifetime. Do you have an example kusto query which points to an issue inthe gRpc service? That will help figure out which logs would be helpful.

if the JobHost were to be forced to restart

JobHost restarts can happen multiple times. I do believe we have enough logs to tell us why it is restarting. gRpc service will not be affected by JobHost restarts. Same here please do share an example kusto query that will help us identify what information you are looking for.

I think exceptions are swallowed here: https://github.com/Azure/azure-functions-host/blob/6c8e3444a041f7e5ba85f4ab71e7f5c7855ec625/src/WebJobs.Script/Workers/Rpc/FunctionRpcService.cs#L84

The worker that was restarted (on JobHost restart) was unable to connect back to the server

@pragnagopa Thanks for the info! So this is actually related to a customer incident so I don't think I should share any Kusto logs in about it in a public forum. I'd be happy to follow-up internally though!

@mhoeger Thanks so much, good catch! So I take it we can probably get a way with just adding the following handler right before the finally:

C# catch (Exception rpcException) { _logger.LogError(rpcException, "Exception encountered while listening to EventStream"); } finally { //....
Obviously, we'll never know if this is enough to catch that error, but hopefully it would give us more context. If this makes sense, I'll go ahead and make the PR. Please let me know your thoughts / give me a thumbs up so I know to proceedπŸ¦–

@davidmrdavid πŸ‘ πŸ™ 🎊

Moved to triage. Logging unhandled exceptions in gRpc host will help us figure out next steps in adding more specific logging if needed.

@davidmrdavid - please go ahead and send a PR. Thanks!

Since the PR was merged, I'll close this issue. Thanks folks :)

Was this page helpful?
0 / 5 - 0 ratings