Botframework-sdk: Track key unhandled DialogEvents in telemetry

Created on 29 May 2020  Â·  7Comments  Â·  Source: microsoft/botframework-sdk

_For additional context see https://github.com/microsoft/botbuilder-dotnet/issues/3998 and its fix: https://github.com/microsoft/botbuilder-dotnet/pull/4000_

Issue

Unhandled key DialogEvents aren't currently tracked in telemetry. Two potential key DialogEvents for tracking are DialogEvents.Error and DialogEvents.VersionChanged (i.e. "error" and "versionChanged").

Per @stevenic we shouldn't log all unhandled DialogEvents which would add noise to the telemetry data. DialogEvents are optional by design.

DialogEvents:

  • "beginDialog"
  • "repromptDialog"
  • "cancelDialog"
  • activityReceived"
  • "versionChanged"
  • "error"

Proposed change

Call dc.Dialogs.TelemetryClient.TrackTrace() for key unhandled DialogEvents in Dialog.OnDialogEventAsync()

_Potential_ C# code:

        public virtual async Task<bool> OnDialogEventAsync(DialogContext dc, DialogEvent e, CancellationToken cancellationToken)
        {
            // Before bubble
            var handled = await this.OnPreBubbleEventAsync(dc, e, cancellationToken).ConfigureAwait(false);

            // Bubble as needed
            if (!handled && e.Bubble && dc.Parent != null)
            {
                handled = await dc.Parent.EmitEventAsync(e.Name, e.Value, true, false, cancellationToken).ConfigureAwait(false);
            }

            // Post bubble
            if (!handled)
            {
                handled = await this.OnPostBubbleEventAsync(dc, e, cancellationToken).ConfigureAwait(false);
            }

            // Log key unhandled DialogEvents
            // Changes to the Dialog or its children could result in bad dialog state for the user.
            // If the unhandled event's name is "versionChanged", emit a trace through the TelemetryClient. 
            if (!handled && e.Name == DialogEvents.VersionChanged)
            {
                await dc.Dialogs.TelemetryClient.TrackTrace(dc, e, cancellationToken).ConfigureAwait(false);
            }
            return handled;
        }

Implementation considerations

DialogEvents.VersionChanged

"versionChanged" events are only emitted from DialogContainer and its subclasses. As of 4.9.x, these subclasses are currently ComponentDialog and AdaptiveDialog.

An override of Dialog.OnDialogEventAsync in DialogContainer may be a cleaner option. It seems unlikely that a DialogContainer subclass isn't the parent dialog or the leaf of another DialogContainer...

    public abstract class DialogContainer : Dialog
    {
        // ...

        public override async Task<bool> OnDialogEventAsync(DialogContext dc, DialogEvent e, CancellationToken cancellationToken)
        {
            var handled = await base.OnDialogEventAsync(dc, e, cancellationToken).ConfigureAwait(false);

            // Trace unhandled "versionChanged" events.
            if (!handled && e.Name == DialogEvents.VersionChanged)
            {
                await dc.Dialogs.TelemetryClient.TrackTrace(dc, e, cancellationToken).ConfigureAwait(false);
            }
            return handled;
        }
    }

DialogEvents.Error

"error" DialogEvents are only emitted from DialogManager. If this event isn't handled, it's rethrown from DialogManager.OnTurnAsync() which triggers the adapter's error handler.

With the current Dialog, DialogContainer and DialogManager classes in mind, it doesn't _seem_ necessary to track unhandled errors at a lower level.

Component Impact

Dialog or DialogContainer. Perhaps DialogManager.

Customer Impact

Additional Traces sent to their telemetry for events that could forewarn Dialog errors.

Tracking Status

Dotnet SDK

  • [x] PR
  • [x] Merged

Javascript SDK

  • [ ] PR
  • [ ] Merged

Java SDK

  • N/A Dialog as of the latest preview doesn't have the eventing mechanism present in C# and JS.

Python SDK

  • N/A Dialog as of 4.9.x doesn't have the eventing mechanism present in C# and JS.

[dcr]

@tomlm, @stevenic, @darrenj, @vishwacsena, @garypretty

Adaptive backlog feature-request

All 7 comments

Makes sense - do we want to also consider logging as an Trace Activity so it will show up in Emulator meaning it's even easier to see this when developing?

@darrenj that sounds like a good idea to me. However, this suggestion reminded me that remote "dialogs" (aka Skills) exist...

Skills and DialogEvents.VersionChanged

I believe we have a new scenario to consider, Skill Hosts and their _Skills_ having a versionChange event.

    public abstract class DialogContainer : Dialog
    {
        // ...

        public override async Task<bool> OnDialogEventAsync(DialogContext dc, DialogEvent e, CancellationToken cancellationToken)
        {
            var handled = await base.OnDialogEventAsync(dc, e, cancellationToken).ConfigureAwait(false);

            // Trace unhandled "versionChanged" events.
            if (!handled && e.Name == DialogEvents.VersionChanged)
            {
                await dc.Dialogs.TelemetryClient.TrackTrace(dc, e, cancellationToken).ConfigureAwait(false);

                // Send a TraceActivity logic...

                // This code determines if the activity is from another bot.
                var isSkillCallerId = dc.Context.Activity.CallerId != null;
                isSkillCallerId &= dc.Context.Activity.CallerId.StartsWith(CallerIdConstants.BotToBotPrefix, StringComparison.InvariantCultureIgnoreCase);

                if (isSkillCallerId)
                {
                    // Do what exactly? Nothing?
                }
            }
            return handled;
        }
    }

During a versionChange with an activity sent from a skill host, we could do nothing in the SDK. In other words, the Skill doesn't inform the skill host that it had an unhandled "versionChanged" event.

When CheckForVersionChangeAsync stops throwing exceptions, via dotnet (master@36d0e41) AdaptiveDialogs and ComponentDialogs will continue running with an unhandled "versionChanged" event.

If the Skill's dialogs ends up throwing an error, the parent will know there was a problem, but not about the version change (which may have caused the problem).


Problem with above scenario

However, if you consider the Skill called via a SkillDialog to be another Dialog then the above behavior doesn't match other Dialog behavior around DialogEvents.

The Skill's root DialogContainer essentially did not "bubble-up" the event.

Dialog.OnDialogEventAsync()@36d0e41

        public virtual async Task<bool> OnDialogEventAsync(DialogContext dc, DialogEvent e, CancellationToken cancellationToken)
        {
            // Before bubble
            var handled = await this.OnPreBubbleEventAsync(dc, e, cancellationToken).ConfigureAwait(false);

            // <!-- NOTE: The below if-block with dc.Parent.EmitEventAsync() never occurred -->
            // Bubble as needed
            if (!handled && e.Bubble && dc.Parent != null)
            {
                handled = await dc.Parent.EmitEventAsync(e.Name, e.Value, true, false, cancellationToken).ConfigureAwait(false);
            }

            // Post bubble
            if (!handled)
            {
                handled = await this.OnPostBubbleEventAsync(dc, e, cancellationToken).ConfigureAwait(false);
            }

            return handled;
        }

If we treat the Skill as a regular Dialog and inform the parent that a "versionChanged" event occurred, this likely results in changes to Dialog.OnDialogEventAsync, SkillDialog, and any Skill hosts _not_ using Dialogs. This last item is probably _highly_ undesired as customers have to change their code.

The "bubble up" stage from the skill is likely to be an activity with expectReplies or an invoke, as the skill needs an immediate response for its own processing.

If we decide to bubble up DialogEvents.VersionChanged to a skill host...

The parent should be able to ignore the skill's "versionChanged" event and send a 100(?) HTTP Status code back to the in-proc skill which then continues running its dialogs.

If the skill host sends a 200 (400?, 405?), the host must inform the skill to stop processing and relinquish control send an appropriate status code back to the host that called the skill. In this case, is the skill sending an EoC?

For 4xx and 5xx status codes from the parent, the skill possibly continues processing? The skill is unaware if there is one or _n_ instances of the skill host, all it knows was that its attempted event bubble up failed.

However, by bubbling the event, we have back-to-back HTTP requests which doesn't guarantee affinity. So we/customers might end up with blanket statements, e.g. all skill host instances don't accept _any_ "versionChanged" events from its skills; any request of this type gets a 100 status code response.

Alternatively, skill hosts have code that can account for two different instances having the same DialogState for a given user. When a skill host tells a skill to stop processing, the instance that called the skill may gracefully exit the turn. The second instance will create a turn in its processing of the versionChanged event.


tl;dr Not bubbling up the "versionChanged" event from the skill to the parent differs from normal Dialog behavior if skills are just remote Dialogs.

Bubbling up the event causes back-to-back HTTP requests.

Actually, ending up with back-to-back HTTP requests is not acceptable due to channels/adapters that require affinity.

With this in mind, the skill must discard its state changes and notify the skill host of the versionChanged event via the initial Host-to-Skill HTTP Request (activity from host to skill). Below are two simple scenarios:

Scenario 1: Skill should “ignore” versionChanged events during bubble phase

The skill host should then re-invoke the skill with the original activity. The activity should probably have an added flag that signals to the skill to not re-bubble the event to the skill host and instead continue processing.

Scenario 2: Skill shouldn’t ignore versionChanged event for current conversation

The skill host doesn’t re-invoke the skill. If the skill host cancels the SkillDialog, this will send an EoC to the skill for state management.


Edit:

~The new problem is that the skill now doesn’t know if it should fire its “post Bubble” event handler for unhandled versionChanged events. The skill doesn’t get an in-turn response when it bubbles the event to its skill host, because it ended the current turn when it informed the parent of the versionChanged event.~

2nd Edit:

Edit #1 is fine/unavoidable... When the parent re-invokes the skill, the flag that indicates to the child to not re-bubble the event is equivalent to handled = true; right before the bubble phase.

The event should still be unhandled after the pre-bubble, but won’t trigger the bubble or post-bubble handlers.

The post-bubble handlers are never called.

Or the SkillDialog.OnDialogEventAsync override calls post-bubble because it can inspect the activity to see if it has been resent due to a versionChanged bubble.

3rd Edit:

Either the parent never knows about its skills’ versionChanged events, or the skill’s dialogs _possibly_ don’t trigger a post-bubble handler for its versionChanged.

@garypretty can you help?

Sure, I can help implement this.

Synced with @stevengum and agreed we will send a trace activity and TelemetryClient.TrackTrace - this will ensure that when using skills in production, developers can still see unhandled version changed events if they are using telemetry.

Looks like all PRs have been merged. Closing.

Was this page helpful?
0 / 5 - 0 ratings