When I signal (schedule) a message to an entity in the future and during that scheduled time frame no server instances are running (cold), is it expected the function app will wake and provision an instance?
i'm noticing the messages are not being delivered until our service bus trigger executes (hours later) and wanted to get some clarification.
thanks!
It should definitely be woken up by our scaling logic.
Can you share your application name, task hub name, region, and timestamp? If you can't provide app name or task hub name, if you could give us an orchestration instance id (assuming they are unique and you aren't reusing them), that would also be sufficient.
Sure here you go:
app name: auction-etr-int-purchases-east-test
task hub name: PurchaseTaskHub
entity id: @purchase@site-id@202@sale-line-id@1418372813146095259@block-dtm@20200908T160810708
region: us east
According to the traces in app insights a signal was sent at 9/8/2020, 8:42:00.956 PM (UTC) and scheduled to be delivered at 9/8/2020, 8:47:00 PM (UTC).
Then again looking at app insights, the traces show the request coming in at 9/9/2020, 12:03 AM (UTC)
I'm using the following to get the scheduled datetime
var scheduleTimeUtc = DateTime.UtcNow.Add(TimeSpan.FromMinutes(5));
Let me know if you need anything else!
@mpaul31
So I took a look. What appears to be the problem is that for some reason, our scale controller thinks that all of your entity triggers are turned off via app settings. I did confirm that you do have the app settings present that can be used to disable functions, but you have them set to false. I took a look at the Scale Controller code and your settings should be fine for that.
I would try some of the following steps to try to force the scale controller to recognize your trigger.
If you could report back when you have done any of these steps, I am very curious as to how you got in this state.
hmm that is strange and the only place i have all my entity triggers disabled is in the stage slot which also uses a different task hub altogether.
i can try some of those things tomorrow but it's going to be difficult to reproduce. it also appears we had a deployment occurring around that same time (which is completely possible). Does this help at all?
now i wondering if the scaling i was seeing during a load test was due to the service bus trigger and not the durable runtime.
is there a way for me to tell via the logs or some other way whether or not scaling is working appropriately? or was it in a temporary bad state?
thanks!
@ConnorMcMahon i just removed the disabled app settings from the prod slot and restarted.
@mpaul31,
That fixed part of the issue (though I still don't know why) our ScaleController had the wrong app settings for your app.
That being said, for some reason our scaling logic _still_ wasn't picking up your TaskHub traffic. It turns out that I have a typo in our sync triggers API on the function host (it should be "storageProvider" not "storageOptions"). Because of that, we are looking for the taskhub on your default storage account in AzureWebJobsStorage, not on the one specified under extensions.durableTask.storageProvider.connectionStringName. This will require a fix to the functions host, which we can track here.
In the mean time, you can duplicate your value of extensions.durableTask.storageProvider.connectionStringName under extensions.durableTask.storageOptions.connectionStringName. It should have no affect on your application runtime, but it will ensure that we send our scale controller the right connection string information.
@ConnorMcMahon OK thanks I'll get my host.json updated and give it a shot! Once I kick off a new release those app setting will get added again so hopefully we can stay out of the bad state.
When I ping you back, can you check and see if everything looks OK?
Absolutely!
@ConnorMcMahon made my updates and ran a load test. please let me know if this look as expected. thanks!
@ConnorMcMahon stange but now I am seeing these errors appears in app insights after the changes. these are operation failures and are concerning too me because i am unsure if DF will execute the entity operation again (unable to tell from the logs) or is it a message lost? i'm going to guess this is due to scaling (i haven't seen this before the latest config changes)?
Container is disposed and should not be used: Container is disposed. You may include Dispose stack-trace into the message via: container.With(rules => rules.WithCaptureContainerDisposeStackTrace())
@mpaul31, I can confirm that the scale controller vote is behaving as expected finally. I am going to relabel this issue to accurately reflect the underlying bug.
As for the other issue:
Sometime this happens when the functions host is shutting down and we have started the execution from the DTFx side, but we have not entered your code yet. It is often thrown because the Functions Host logger factory is disposed but we try to get a new logger for your function execution. We do make a best effort to abandon messages that fail before entering customer code, so that they get retried as opposed to failing, but I would need to take a closer look to verify this is the case for the exceptions you encountered.
Would you mind filing another issue, with timestamp details and ideally the entity id that encountered these exceptions (you may not have that, as these exceptions often happen before you can enter your own code). A full stacktrace would also be nice. Having this as a separate issue will let us track them separately, as these are almost certainly different root causes.
@ConnorMcMahon do you have any ideas on when this would make it into a release? i'm going to bet there are a lot of people running on consumption with this going unnoticed wondering why things are not scaling appropriately.
I know you provided a configuration workaround but I always feel a little uneasy regarding undocumented changes like this.
I'll open a PR in the next day or two against the Functions Host, so that it can go into the next V2/V3 release.
awesome man thanks for the quick turnaround!
@mpaul31,
I have a PR tracking it against both Functions V3 and Functions V2. It looks like a release may have just been cut, so my guess is that it will be another ~4 weeks before this fix is widely deployed.