Azure-functions-host: Queue part of host.json has no impact on how WebJobs perform

Created on 20 Jan 2017  路  11Comments  路  Source: Azure/azure-functions-host

Webjobs seem to ignore queue settings in host.json regarding batchSize, newBatchThreshold and visibilityTimeout. The initial goal is to take only one message into execution and the next one when the first is finished. Curently webjob seems to execute as much as possible at once what leads to plenty of Never Finished jobs.

Repro steps

Given we have a valid json file in WebJob folder that describes queues behaviour among others.

"queues": {
    "visibilityTimeout": "00:00:10",
    "batchSize": 6,
    "maxDequeueCount": 5,
    "newBatchThreshold": 3

  }

I create i.e. 40 messages in a for loop, I have long running webjob (ca. 30 seconds when cold start )

Expected behavior

  • Having mentioned file I expect to take up to 6 messages from queue in one fetch and take next (next 6 ?) when currently executed job count drops to 3.
  • Moreover I expect any message to become visible again after 10s from start of execution as their normall execution times is greater
  • Lastly, failed execution should make the message remain in main queue until dequeue count goes to 5 and than finally move it to poison queue

Actual behavior

  • Web job takes some large amount of messages in one fetch that has no relation to batchSize : 6 and during execution repeats it making impression that it tries to make it as pararell as possible
  • Messages never become visible again. Even Failed and Never finished messages remain invisible although they finally somewhen get again into execution.
  • Cannot observe dequeueCount as messages never change to visible again.
  • When placing ca. 40 messages of not trival tasks the webjob gets stucked and some of messages go into Never fisnished
  • Running locally host.json works well esspecially in correlation with batchSize and newBatchThreshold. I havenot noticed visibilityTimeout to work locally as expected.
  • I can observe host.json to influence webjob by setting very low functionTimeoutand than having all function gone to failed

Related information

The script uses nodejs QueueTrigger from samples, is deployed to Azure with option 2. WebApp is S1 Standard non-dynamic scaling.

question

Most helpful comment

Indeed it works as expected after fixing the issue and the WebHost is now sensitive to knobs changes.
@mathewc thank you for committment in the issue.

All 11 comments

To confirm the details of your deployment, you've used the steps here? I.e., you're deploying this as a Continuous WebJob?

How many instances is your WebApp scaled out to? The reason I ask is that all of those queue configuration knobs affect processing on a single instance. That's why things are working for you when you run locally I believe - there is only a single instance. When scaled out to multiple instances, each instance will be operating independently based on those settings - the settings aren't limiting things globally.

Thank you for attention in the issue.

Exactly, I did steps described in the link you have posted. Yes it is continous WebJob, that is polling azure queue.

As far as I know the WebApp is not scaling either up or out as I have never set any of those. Nevertheless I will check these and post here the feedback.

There is no scaling out set.

Today morning after whole weekend of idle the posted settings started to work. It behave as expected: I have posted 20 messeges at once. I could have observed them being taken into execution in batches of 6 so I have seen in MS Azure Storage Explorer 20 messages before it all started, than I have seen 14, than 8, than 2 and at the end all were successfully executed - disappeared.

I have just tried to experiment more and changed these settings to following :

"queues": {
    "maxPollingInterval": 2000,  
    "visibilityTimeout": "00:00:10",
    "batchSize": 4,
    "maxDequeueCount": 5,
    "newBatchThreshold": 2

  }

Soon after such change I see following lines on output:

[01/23/2017 06:24:23 > 80c57d: SYS INFO] Job directory change detected: Job file 'QueueTrigger\index.js' timestamp differs between source and working directories.
[01/23/2017 06:26:32 > 80c57d: SYS WARN] Failed to delete temporary directory

For some reason I see the setting are again as on friday, they are omitted. When posting 20 messages at once they are all taken into execution and in MS explorer just immediately after posting I can see:

image

And similarily on dashboard :
image

After some time the experiment gives exactly same results. Same after WebApp restart.

The feeling I get is that host.json is not refreshed automatically only the change is noticed and thus host.json is not being taken into account anymore - at least for some time as after weekend it somehow refreshed. How to force WebJob to reload host.json file?

Related Info

index.js

module.exports = function (context, workItem) {
    context.log('Node.js queue trigger function processed work item', new Date());


    setTimeout(function(){context.done(null, workItem)}, 15000);

}

Update
I have noticed that each same size new batch of messages sent to queue is being executed more parallel.
At first only few messages are being executed same time. After few similar consecutive trials whole batch is taken to execution at once. Does it have something to do with growing number or processes in webjob dashbord after each batch processed?

The dashboard image you show above with all the "Running" lines - if all you have is a single continuous WebJob deployed, you should only see a single "Running" line in the dashboard for that WebJob. Can you expand the image you provided above to also show the name of the webjob? Are they all the same, indicating that it isn't a single continuous job running, but multiple instances of the same job (i.e. a triggered or scheduled job?)

A normal continuous job starts once, reads the host.json file and stays running. There would only be a single "Running" line for that host.

Can you share a complete repro project (e.g. point me to a small repository with all the files you're deploying) with the deploy steps you followed exactly so I can try to reproduce this?

Also wondering if this isn't related to https://github.com/Azure/azure-webjobs-sdk-script/issues/1118. In that issue it is also reported that multiple host instances are being started. That bug is also using the same deployment model.

The image I have posted previousely might be misleading. There were "running" on functions invocations not webjobs.

image

This is what happens when I post batch of 40 messages.
Namely the host.json queue knobs have no effect. Maybe it even have effect at very begining, after first invocation of batch messages (after long time of idle of after change of settings). Then the queue utilization seems to happen in fixed batches and it looks like according to settings. But with every next batch it all happen more and more parallel as in the image above.

Btw : according to instruction

  1. To the same continuous WebJob folder above, add a bin directory containing the WebJobs.Script.Host binaries. These binaries are in the WebJobs.Script.Host.zip file.

This is most unsure step as I cannot find the mentioned zip, hence I always look for the exe file mentioned in step 4

  1. A run.cmd file in the WebJob folder (peer to the bin dir) that invokes WebJobs.ScriptHost.exe (e.g. .\bin\Microsoft.Azure.WebJobs.Script.Host.exe). When the WebJob starts, this will be the entry point, the ScriptHost will start up, load all the function scripts and begin running your functions.

and upload it all what might be an overhead or wrong. This you can also see in the repo I have pointed you to.

I pulled your project and ran it locally. I changed the batch size to 3 and newBatchThreshold to 1, enqueued 12 messages before starting the host. I had modified your function code logging statement output to:

module.exports = function (context, workItem) {
    context.log('Beginning message processing...');

    setTimeout(function(){
        context.log('Processing complete');
        context.done();
    }, 30000);
}

Here is my output - the batching appears to be working fine for me, with messages being processed in batches of 3. You can see from the logs below that only 3 are being processed at a time:

Development settings applied
Found the following functions:
Host.Functions.QueueTrigger
Job host started
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=160a2a4f-5e5c-465e-81a5-b6fe4ad52927)
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=5475eb76-dbcf-406d-8325-36c843913c35)
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=57a3e155-a4c3-4f3f-ae64-71bdf3ad7455)
Beginning message processing...
Beginning message processing...
Beginning message processing...
Processing complete
Processing complete
Processing complete
Executed 'Functions.QueueTrigger' (Succeeded, Id=57a3e155-a4c3-4f3f-ae64-71bdf3ad7455)
Executed 'Functions.QueueTrigger' (Succeeded, Id=160a2a4f-5e5c-465e-81a5-b6fe4ad52927)
Executed 'Functions.QueueTrigger' (Succeeded, Id=5475eb76-dbcf-406d-8325-36c843913c35)
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=4e8cc482-609c-4b06-a973-55e21c32569a)
Beginning message processing...
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=202ea0d5-c80b-4043-b07b-742d62adbbea)
Beginning message processing...
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=5915dd4c-ecdb-4402-b011-1740bf2705b3)
Beginning message processing...
Processing complete
Processing complete
Processing complete
Executed 'Functions.QueueTrigger' (Succeeded, Id=202ea0d5-c80b-4043-b07b-742d62adbbea)
Executed 'Functions.QueueTrigger' (Succeeded, Id=4e8cc482-609c-4b06-a973-55e21c32569a)
Executed 'Functions.QueueTrigger' (Succeeded, Id=5915dd4c-ecdb-4402-b011-1740bf2705b3)
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=9aba1708-c6fb-4aea-be87-aac61ea417fd)
Beginning message processing...
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=0cd2d439-1f7f-4feb-a15a-60f7db521fb7)
Beginning message processing...
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=a7e8c3ed-a5ba-4cd7-9a80-d193e3fabae4)
Beginning message processing...
Processing complete
Processing complete
Processing complete
Executed 'Functions.QueueTrigger' (Succeeded, Id=0cd2d439-1f7f-4feb-a15a-60f7db521fb7)
Executed 'Functions.QueueTrigger' (Succeeded, Id=9aba1708-c6fb-4aea-be87-aac61ea417fd)
Executed 'Functions.QueueTrigger' (Succeeded, Id=a7e8c3ed-a5ba-4cd7-9a80-d193e3fabae4)
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=bd18c4ef-a938-427a-b222-a485caef13e2)
Beginning message processing...
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=15f206e3-0c59-4465-b437-d5abe5a92415)
Beginning message processing...
Executing 'Functions.QueueTrigger' (Reason='New queue message detected on 'repro-input'.', Id=7cd97ccf-2a55-42be-80dd-5f4b681ecf6a)
Beginning message processing...
Processing complete
Processing complete
Processing complete
Executed 'Functions.QueueTrigger' (Succeeded, Id=bd18c4ef-a938-427a-b222-a485caef13e2)
Executed 'Functions.QueueTrigger' (Succeeded, Id=15f206e3-0c59-4465-b437-d5abe5a92415)
Executed 'Functions.QueueTrigger' (Succeeded, Id=7cd97ccf-2a55-42be-80dd-5f4b681ecf6a)

So I'm pretty sure there are no issues with these settings being used. It might be that the Dashboard is making it look like there is an issue? E.g. there will be a slight delay between when a function completes and when the Dashboard shows completed for it. However, I'd only expect that overlap to be small.

Try running this locally for yourself from the command line to convince yourself that things are working as expected. I was also able to set the batchSize to 1 and newBatchThreshold to 0 to get one at a time processing as you asked for initially.

Closing this as no repro. If you still think there is an issue please reopen.

First of all thank you @mathewc for you involvment in the issue.

At the begining I need to say that I have made all tests according to your advice and a have exactly same output as you. Moreover from the begining I observed that locally it is ok in terms of these two knobs Maybe I was not enough clear with it (quote from initial post):

Running locally host.json works well esspecially in correlation with batchSize and newBatchThreshold. I havenot noticed visibilityTimeout to work locally as expected.

The console output you have pasted above shows perfectly the sequence of execution locally. I would really like to paste similar output from remote azure WebJob but there is no such possibility as far as I know - only output of single task what gives no clue about the number of messages being processed in parallel. Nevertheless the behaviour of remote webjob is completely different from local, having exactly same settings and same number of messages to process. Easiest to observe with use of MS Azure Storage Explorer how messages _gets invisible_ running locally and how running remotely.

Test scenario is following I use your batching settings (3/1), put 32 messages and observe their utilization in MS Azure Storage Explorer. Any time the relation between number of messages visible and invisible changes I make a screenshot.

Locally :
1

2

3
4

and so on... - always _eating_ 3 messages until queue is empty

Remotely
1
r1
r2
r3
r4
r6

and so on...

Clearly there is a difference. Clearly it is utilizing messages in different manner than stated in host.json. After such trials I have noticed following situation in process explorer:
process

My assumption is that there are more _processes_ performing (probably) exactly how host.json describes, but because there are more than one of such process, overall outcome is different than expected.
Assuming that the job itself is somethig more sophisticated than console.log and that it requires significant amount of resources and we are having no control on number of simultaneous jobs it all results in never finished jobs or reaching more than 85% of available memory (recieving emails and dashboard monits)

During operation, and trials WebJob outputs many times following output :

 Status changed to Success
 Process went down, waiting for 60 seconds
 Status changed to PendingRestart
 WebJob singleton lock is released
 WebJob singleton lock is acquired
 Run script 'run.cmd' with script host - 'WindowsScriptHost'
 Status changed to Running

I am not sure if the analysis above is a reason to repoen the issue or create a new one or merge with #1118 (surely there is a corelation) nevertheless the problem still exists what leads to conclusion that I have no control on the number of simultaneous queue messages processed in paralell in azure WebJobs.

Thanks for all the details - I'll look into this more today and get back to you.

I've found your issue! Your run.cmd file is incorrect. Rather than using a "start" command like you've done which will spawn a child process, you need to simply have .\bin\Microsoft.Azure.WebJobs.Script.Host.exe in the file as the wiki stated. I'll make that more clear in the wiki.

When you do this, you'll notice that you no longer have multiple child processes, and instead will have one special process labeled as a WebJob process:

webjobs

I was able to reproduce the broken behavior with your run.cmd contents, and after fixing that things work as expected.

Indeed it works as expected after fixing the issue and the WebHost is now sensitive to knobs changes.
@mathewc thank you for committment in the issue.

Was this page helpful?
0 / 5 - 0 ratings