Azure-functions-host: Cosmos DB Trigger: Consumption plan not creating instances

Created on 18 Mar 2018 · 24Comments · Source: Azure/azure-functions-host

Function App version: 1.0
Function App name: consumer-egress-func
Function name(s) (as appropriate): ChangeFeedFunc
Region: West Europe

Cosmos DB change feed doesn't get triggered the first time, until I manually visit the Function App -> Change Feed Function -> function.json -> Logs. This change feed is being observed on a partitioned collection with 25k RUs. Function is hosted on consumption plan with a single lease (leases collection: 400 RUs). Once it gets triggered it works fine afterwards but the first time I always have to navigate to the Logs as mentioned above.

The behavior is not tied to a specific Collection or Cosmos DB account but always reproducible by deploying this ARM template and running these load tests. The trigger is configured in code here.

Source

syedhassaanahmed

👍1

All 24 comments

What is the volume of events (inserts/updates) that are happening the first time? Is it 1 event or are there multiple ones?

Are the events being persisted in the database some time after the Function creation or right at the time of creation?

ealsur on 19 Mar 2018

There are multiple events. Its a load test which ramps up to 1500 upserts/sec in the collection. The events are triggered way after the function creation.

syedhassaanahmed on 19 Mar 2018

It sounds like the Scale Controller in the Consumption Plan is not detecting the changes and increasing the instances.

Could you try if this repros with a Function App in App Service Mode? So we can rule out an issue inside the Trigger.

ealsur on 21 Mar 2018

@syedhassaanahmed how does the ARM template deploy the function code? Is it using webdeploy/msdeploy or some other means?

paulbatum on 21 Mar 2018

@ealsur Function App on App Service Plan doesn't trigger Changefeed at all, not even after visiting Logs in the portal. The only event I see in Application Insights live stream is

Host Status: {
 "id": "consumer-egress-func",
 "state": "Running",
 "version": "1.0.11612.0"
 }

@paulbatum ARM template deploys the function code using GitHub repo.

"properties": {
                        "RepoUrl": "[parameters('repoURL')]",
                        "branch": "master",
                        "IsManualIntegration": true
                    }

syedhassaanahmed on 21 Mar 2018

OK, that should work. Can you give me a UTC timeframe for both a deployment and a corresponding load test where this issue repro'd?

paulbatum on 21 Mar 2018

Deployment Time: 2018-03-21T21:07:45
Load Test: 2018-03-21T21:20:04

syedhassaanahmed on 22 Mar 2018

Thanks, I can see an error in our scale logs for this timeframe, but its very non-descript. Not sure what is wrong yet. We're going to dig further (might take some time as we may need to deploy a change with more logging before we can understand the underlying issue).

paulbatum on 22 Mar 2018

Nevermind, I found the logs with the specific error. We're investigating..

paulbatum on 22 Mar 2018

👍1

Hi @paulbatum ,
I've just retried this scenario and the problem still exists on Change Feed not getting triggered until the Function App portal console is manually visited. Do you have an ETA on this?

syedhassaanahmed on 4 Jun 2018

@syedhassaanahmed Could you confirm if the leases collection you are using already exists before the deploy? Are you reusing it in-between deploy, deleteting or creating a new one every time?

ealsur on 6 Jun 2018

@ealsur The existence of leases collection doesn't matter. Tried both reusing it and re-creating a new one every time, but the end result is still that I have to navigate to the portal in order for Change Feed to "kick in".

syedhassaanahmed on 7 Jun 2018

👍1

@syedhassaanahmed Could you do a test?

Do a deploy and enter the Portal
Verify that the lease collection contains documents
Delete the Function
Do another deploy targeting the same lease collection
Do not enter the Portal and insert some documents to see if the Function does not fire
Share the name of the Function with us so we can take a look at the telemetry.

ealsur on 7 Jun 2018

Thanks @ealsur I just made another observation after going through the steps you've described above.
The way I've been inserting documents into the target collection is by using this load test script which prior to sending the load, recreates both the target collection as well as leases collection (recreating leases collection doesn't impact the bug though). Reason for this was to have a clean slate each time. If there is a simpler way to delete all records in a partitioned Cosmos DB collection, I'm happy to use that.

What makes the difference is, when I don't recreate the target collection each time in the load test script, changefeed does get triggered. When I delete the collection and create it again with the same name, somehow the changefeed link gets broken and the only way to trigger it back is to visit the portal.

In any case here is the Change Feed Azure Function Name consumer-egress-func deployed after above steps.

syedhassaanahmed on 8 Jun 2018

Thanks @syedhassaanahmed, one last question, does that test script gets run after or before you deploy the Function code?

Basically, does the lease collection exist when the deploy is made?

ealsur on 8 Jun 2018

@ealsur Load test script runs after the deployment of Function code. Here is the ARM template which deploys the Function.

Leases collection doesn't exist when deployment is made. Leases is created as part of the Load Test. Also the function code uses attribute CreateLeaseCollectionIfNotExists = true but its only taken into account once the ChangeFeed starts getting triggered after visiting the portal.

If you'd like to reproduce the scenario in your own sub, just clone this repo and follow the instructions in readme.

syedhassaanahmed on 8 Jun 2018

What makes the difference is, when I don't recreate the target collection each time in the load test script, changefeed does get triggered. When I delete the collection and create it again with the same name, somehow the changefeed link gets broken and the only way to trigger it back is to visit the portal.

My initial thought is that when you deploy the Function, it tries to start, but fails due to the non-existent target collection (this is similar to when you create a Function and you point to an invalid connection string for example). When the target collection already exists, it starts without issue as you described. I'll see if I can find any logs to back this train of thought. Did you try adding an HTTP Triggered Function (empty, that actually does nothing, with anon auth) to the deployed Functions and calling it in the generate-load.sh process to force a wake up of the instances after the initial fail due to the non-existent collection?

ealsur on 8 Jun 2018

@ealsur Well, collections are not part of Cosmos DB Connection Strings though, but I get your point. To add to that, the entire solution (function apps, service bus, cosmos db, app insights) is being deployed using ARM. ARM currently doesn't support creating a Cosmos DB database and collection, just the account. That makes it impossible to deploy the Function App via ARM, such that the collection exists beforehand.

Let me try the empty HTTP Triggered Function to see if it forces a wake up.

syedhassaanahmed on 8 Jun 2018

@syedhassaanahmed Yeah, I agree, neither databases nor collections can be created through ARM and for this scenario, it's an issue.
The Trigger, when starting, tries to connect to the collection's Change Feed, and if the collection does not exist, it will fail to initialize. I'm not sure how Consumption Plan works in that scenario (probably Functions team can comment), but it sounds like the instance gets shutdown. Pinging the HTTP Trigger should force an instance to wake up, and since the collection should be created by then, the Trigger should initialize correctly.

ealsur on 8 Jun 2018

@ealsur I can confirm that using an HTTP Trigger correctly forces a wakeup and consequently triggers a changefeed.

syedhassaanahmed on 8 Jun 2018

👍1

It seems then that it's a race condition of sorts, where the Function tries to start initially and fails due to the non-existent collections. The HTTP Trigger ping forces a wake-up and, since the collections exist at that point, the Trigger will initialize correctly.

If the collections exist before the deploy as you indicated, it works from the start, because the initial Trigger initialization work.

ealsur on 8 Jun 2018

@ealsur I'm facing some issues that could perhaps be related to this discussion (for which I may potentially need to open a separate isse), but for the purposes of education for anyone coming across this thread - what would happen in @syedhassaanahmed 's scenario above to any changes to the collection that are modified after the function initially fails, but before the HTTP trigger ping wakes it up? Would they get sent to the function upon it starting or are they lost?

andrewdmoreno on 9 Nov 2018

@andrewdmoreno If the Trigger ran at least once, at some point in time, it will have the leases with the latest processed checkpoint. Any failures to start won't make it lose any changes since the trigger will start from the last successful checkpoint, whenever that was.
If the Trigger never ran (no leases) and changes happen before it first starts, then yes, those changes are lost in terms of processing, since the Trigger, when no leases exist, starts processing from that point in time onward.

ealsur on 12 Nov 2018

Forgot to mark this issue as closed. The Host will now retry on an exponential interval to start the Trigger, ref: https://github.com/Azure/azure-webjobs-sdk-extensions/pull/433

ealsur on 12 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

HTTP model binding can't handle DateTimes with seconds

JasonBSteele · 3Comments

Azure Function '_mysql' is not defined Python3.8

rati3l · 3Comments

[PHP] Using Blob as input fails with `mscorlib` exception

ElvenSpellmaker · 3Comments

Access to XMLHttpRequest at 'http://localhost:8080/register' from origin 'http://localhost:4200' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: It does not have HTTP ok status.

slim-souissi · 3Comments

How to detect host shutdown using CancellationToken (aka Graceful shutdown)

shibayan · 3Comments