Description:
Found a bug while using SAM's new SQS queue automatic generation with the SNS Event Type feature: If your Lambda function timeout is greater than 30 seconds and you're using the SqsSubscription: true property to create an SQS queue with default settings, your deployment will fail with an error from the Lambda service saying the Lambda function timeout cannot be greater than the SQS queue's visibility timeout. SAM should pass the Lambda function timeout through as the SQS queue visibility timeout.
Sample template:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
Topic:
Type: AWS::SNS::Topic
MyFunction:
Type: AWS::Serverless::Function
Properties:
InlineCode: |
exports.handler = async (event) => {
return {}
}
Handler: index.handler
Runtime: nodejs12.x
Timeout: 60
Events:
TopicEvent:
Type: SNS
Properties:
Topic: !Ref Topic
SqsSubscription: true
Steps to reproduce the issue:
Observed result:
Deployment fails with this error when trying to create the Lambda event source mapping:
Queue visibility timeout: 30 seconds is less than Function timeout: 60 seconds (Service: AWSLambda; Status Code: 400; Error Code: InvalidParameterValueException; Request ID: **) The following resource(s) failed to create: [MyFunctionTopicEventEventSourceMapping]. . Rollback requested by user.
Expected result:
Successful deployment.
I got hit by this as well a couple of times today. It appears that you are forced to set the visibility timeout >= to the function's timeout. Although most of the time this should be the case(queue visibility timeout should be at least 6 times greater the Lambda timeout value), there are cases where you need to set the visibility time out to a low value since this controls implicitly the retry interval of the Lambda function.
We have a use-case where the Producer sends a message(s) every 5 minutes, and it will keep sending that same message until it has been processed and confirmed by the Consumer(single). Now if the Consumer fails the first time to process that message(for whatever reason), we want it to retry as fast as possible(for up to X times) and before the deduplication interval expires.
But since the Lambda retry interval is implicitly controlled by the queue's visibility timeout, then if this timeout is much greater than the Lambda timeout, the Lambda retry will happen way after the deduplication interval(5 min) has expired. This will result in duplicates in the queue(FIFO).
In our case, we set the queue visibility timeout to a lower value than the Lambda timeout because of the 5 minutes interval that the Producer sends messages. It's a trade-off between giving enough time to the Consumer but also waiting for as little as possible for the same message to be available again for processing, in case of a failure the first time. Otherwise, with a higher visibility timeout, it takes more time for the same message to be reprocessed. And by that time, the deduplication interval(5 min) might have expired and we risk having duplicates in the FIFO queue as the same message will be sent again after 5 minutes when the Producer runs since it won't have been processed/confirmed by the Consumer yet.
Is the above issue really a SAM bug or is this the doing of CloudFormation behind the scenes? Is there a solution to the above-described issue?
Thanks
@piersf SAM needs to add VisibilityTimeout property for SQS queue resource that SAM generates and set the value to Timeout given for the Lambda function. The value of VisibilityTimeout being always greater than or equal to Lambda Timeout is not set by SAM.
IMO this constraint shouldn't exist. This should be a warning, not an error. If I want to set the queue visibility timeout to be less than the lambda timeout, because of some weird use case that I have, then I should be able to do that. It isn't AWS's job to enforce this.
Edit: just as an example, here is a perfectly reasonable use case that in fact improves upon AWS' default behaviour, but is currently impossible to implement due to this bizzare opinionated enforcement: Setting a very low visibility timeout on the queue, but upon a succesful lambda invocation, immediately increasing it - in code - to the lambda timeout (or higher), for all messages in the batch. This avoids waiting for the visibility timeout to pass (which can be take many minutes) in case of throttles, which are very common for SQS triggers due to how the connection scaling works, and significantly improves the queue processing time. It makes no sense that AWS would disallow doing something like this.
@praneetap, @ShreyaGangishetty can you please confirm (or infirm, if it's the case) the following statement?
_For the moment, we cannot set the default visibility timeout for a SQS queue to be smaller than the timeout of the Lambda which is consuming the SQS queue messages._
I'm asking this because my team ran today into a similar CloudFormation change set execution issue:
_"Queue visibility timeout: 30 seconds is less than Function timeout: 900 seconds (Service: AWSLambda; Status Code: 400; Error Code: InvalidParameterValueException; ... "_
If this is true, we are forced to have the SQS default visibility timeout >= Lambda timeout
Also, a quite misleading (if not actually a bug) thing was the fact that this error was thrown in CloudFormation, but when we changed the SQS default visibility timeout from AWS console, it worked. This means that the probable equality constraint is not checked when applying the update from AWS console, but it is enforced in CloudFormation.
The AWS console update is not a valid choice for us because we deploy our changes from infrastructure code. The AWS console update was made just for testing the update which should be made from the infrastructure code -> CloudFormation.
The purpose of this restriction is to prevent duplicate processing of messages. If your Lambda takes 900 seconds maximum to process a message, and the VisibilityTimeout is 30 seconds, then SQS will invoke your Lambda function every 30 seconds until it gets a confirmed result (after 900 seconds). This restriction is meant as a convenience to you, to help you avoid poor architectural decisions. There are ways to work around it if you are determined.
@michaelj-smith what ways are there to work around this?
Since the VisibilityTimeout controls implicitly the execution interval of the function, some times there are use-cases where one would want to have the function executed as fast as possible after a previously failed execution, something that would be done by lowering the visibility timeout value.
But currently, considering that the queue timeout can't be lower than the function timeout, we can't achieve this.
@michaelj-smith is there a way to set VisibilityTimeout in SAM template when using SqsSubscription: true? Or does the SQS need to be created ahead of time in that case?
@piersf One workaround is to sam deploy with SqsSubscription: true and function timeout set to 30. After the deploy succeeds change timeout to your preferred value and call sam deploy again. This will keep the VisibilityTimeout of SQS at 30 seconds and change Lambda timeout to your preferred value. Of course, as @michaelj-smith explained this is a poor architectural decision.
@ajdnik thanks! We did something similar to that.
Yep, we know the visibility timeout should be at least 5-6 times higher than the function's timeout.
But we have a specific use-case where the Producer sends a message(s) every 5 minutes, and it will keep sending that same message until it has been processed and confirmed by the Consumer(single). Now if the Consumer fails the first time to process that message(for whatever reason), we want it to retry as fast as possible(for up to X times) and before the deduplication interval expires.
But since the Lambda retry interval is implicitly controlled by the queue's visibility timeout, then if this timeout is much greater than the Lambda timeout, the Lambda retry will happen way after the deduplication interval(5 min) has expired. This will result in duplicates in the queue(FIFO)in our scenario since the Producer will keep sending the same message.
Most helpful comment
IMO this constraint shouldn't exist. This should be a warning, not an error. If I want to set the queue visibility timeout to be less than the lambda timeout, because of some weird use case that I have, then I should be able to do that. It isn't AWS's job to enforce this.
Edit: just as an example, here is a perfectly reasonable use case that in fact improves upon AWS' default behaviour, but is currently impossible to implement due to this bizzare opinionated enforcement: Setting a very low visibility timeout on the queue, but upon a succesful lambda invocation, immediately increasing it - in code - to the lambda timeout (or higher), for all messages in the batch. This avoids waiting for the visibility timeout to pass (which can be take many minutes) in case of throttles, which are very common for SQS triggers due to how the connection scaling works, and significantly improves the queue processing time. It makes no sense that AWS would disallow doing something like this.