Cosmos DB change feed doesn't get triggered the first time, until I manually visit the Function App -> Change Feed Function -> function.json -> Logs. This change feed is being observed on a partitioned collection with 25k RUs. Function is hosted on consumption plan with a single lease (leases collection: 400 RUs). Once it gets triggered it works fine afterwards but the first time I always have to navigate to the Logs as mentioned above.
The behavior is not tied to a specific Collection or Cosmos DB account but always reproducible by deploying this ARM template and running these load tests. The trigger is configured in code here.
What is the volume of events (inserts/updates) that are happening the first time? Is it 1 event or are there multiple ones?
Are the events being persisted in the database some time after the Function creation or right at the time of creation?
There are multiple events. Its a load test which ramps up to 1500 upserts/sec in the collection. The events are triggered way after the function creation.
It sounds like the Scale Controller in the Consumption Plan is not detecting the changes and increasing the instances.
Could you try if this repros with a Function App in App Service Mode? So we can rule out an issue inside the Trigger.
@syedhassaanahmed how does the ARM template deploy the function code? Is it using webdeploy/msdeploy or some other means?
@ealsur Function App on App Service Plan doesn't trigger Changefeed at all, not even after visiting Logs in the portal. The only event I see in Application Insights live stream is
Host Status: {
"id": "consumer-egress-func",
"state": "Running",
"version": "1.0.11612.0"
}
@paulbatum ARM template deploys the function code using GitHub repo.
"properties": {
"RepoUrl": "[parameters('repoURL')]",
"branch": "master",
"IsManualIntegration": true
}
OK, that should work. Can you give me a UTC timeframe for both a deployment and a corresponding load test where this issue repro'd?
Deployment Time: 2018-03-21T21:07:45
Load Test: 2018-03-21T21:20:04
Thanks, I can see an error in our scale logs for this timeframe, but its very non-descript. Not sure what is wrong yet. We're going to dig further (might take some time as we may need to deploy a change with more logging before we can understand the underlying issue).
Nevermind, I found the logs with the specific error. We're investigating..
Hi @paulbatum ,
I've just retried this scenario and the problem still exists on Change Feed not getting triggered until the Function App portal console is manually visited. Do you have an ETA on this?
@syedhassaanahmed Could you confirm if the leases collection you are using already exists before the deploy? Are you reusing it in-between deploy, deleteting or creating a new one every time?
@ealsur The existence of leases collection doesn't matter. Tried both reusing it and re-creating a new one every time, but the end result is still that I have to navigate to the portal in order for Change Feed to "kick in".
@syedhassaanahmed Could you do a test?
Thanks @ealsur I just made another observation after going through the steps you've described above.
The way I've been inserting documents into the target collection is by using this load test script which prior to sending the load, recreates both the target collection as well as leases collection (recreating leases collection doesn't impact the bug though). Reason for this was to have a clean slate each time. If there is a simpler way to delete all records in a partitioned Cosmos DB collection, I'm happy to use that.
What makes the difference is, when I don't recreate the target collection each time in the load test script, changefeed does get triggered. When I delete the collection and create it again with the same name, somehow the changefeed link gets broken and the only way to trigger it back is to visit the portal.
In any case here is the Change Feed Azure Function Name consumer-egress-func deployed after above steps.
Thanks @syedhassaanahmed, one last question, does that test script gets run after or before you deploy the Function code?
Basically, does the lease collection exist when the deploy is made?
@ealsur Load test script runs after the deployment of Function code. Here is the ARM template which deploys the Function.
Leases collection doesn't exist when deployment is made. Leases is created as part of the Load Test. Also the function code uses attribute CreateLeaseCollectionIfNotExists = true but its only taken into account once the ChangeFeed starts getting triggered after visiting the portal.
If you'd like to reproduce the scenario in your own sub, just clone this repo and follow the instructions in readme.
What makes the difference is, when I don't recreate the target collection each time in the load test script, changefeed does get triggered. When I delete the collection and create it again with the same name, somehow the changefeed link gets broken and the only way to trigger it back is to visit the portal.
My initial thought is that when you deploy the Function, it tries to start, but fails due to the non-existent target collection (this is similar to when you create a Function and you point to an invalid connection string for example). When the target collection already exists, it starts without issue as you described. I'll see if I can find any logs to back this train of thought. Did you try adding an HTTP Triggered Function (empty, that actually does nothing, with anon auth) to the deployed Functions and calling it in the generate-load.sh process to force a wake up of the instances after the initial fail due to the non-existent collection?
@ealsur Well, collections are not part of Cosmos DB Connection Strings though, but I get your point. To add to that, the entire solution (function apps, service bus, cosmos db, app insights) is being deployed using ARM. ARM currently doesn't support creating a Cosmos DB database and collection, just the account. That makes it impossible to deploy the Function App via ARM, such that the collection exists beforehand.
Let me try the empty HTTP Triggered Function to see if it forces a wake up.
@syedhassaanahmed Yeah, I agree, neither databases nor collections can be created through ARM and for this scenario, it's an issue.
The Trigger, when starting, tries to connect to the collection's Change Feed, and if the collection does not exist, it will fail to initialize. I'm not sure how Consumption Plan works in that scenario (probably Functions team can comment), but it sounds like the instance gets shutdown. Pinging the HTTP Trigger should force an instance to wake up, and since the collection should be created by then, the Trigger should initialize correctly.
@ealsur I can confirm that using an HTTP Trigger correctly forces a wakeup and consequently triggers a changefeed.
It seems then that it's a race condition of sorts, where the Function tries to start initially and fails due to the non-existent collections. The HTTP Trigger ping forces a wake-up and, since the collections exist at that point, the Trigger will initialize correctly.
If the collections exist before the deploy as you indicated, it works from the start, because the initial Trigger initialization work.
@ealsur I'm facing some issues that could perhaps be related to this discussion (for which I may potentially need to open a separate isse), but for the purposes of education for anyone coming across this thread - what would happen in @syedhassaanahmed 's scenario above to any changes to the collection that are modified after the function initially fails, but before the HTTP trigger ping wakes it up? Would they get sent to the function upon it starting or are they lost?
@andrewdmoreno If the Trigger ran at least once, at some point in time, it will have the leases with the latest processed checkpoint. Any failures to start won't make it lose any changes since the trigger will start from the last successful checkpoint, whenever that was.
If the Trigger never ran (no leases) and changes happen before it first starts, then yes, those changes are lost in terms of processing, since the Trigger, when no leases exist, starts processing from that point in time onward.
Forgot to mark this issue as closed. The Host will now retry on an exponential interval to start the Trigger, ref: https://github.com/Azure/azure-webjobs-sdk-extensions/pull/433