Hi!
We are using the latest Hangfire.SqlServer bits (1.6.0) in our Azure environment which consists of
Now the problem is, easily understandable viewing the following screenshot of the Azure management portal.
screenshot
As you can see, the following query consumes almost all dtus, as it takes ages to execute (9 to 22 MINUTES!!)
(@now datetime,@count int)delete top (@count) from [HangFire].[Job] with (readpast) where ExpireAt < @now
So either the SQL Azure Query Performance insight is not trustworthy at all, or there seems to be a serious problem in Hangfire, right?
Some more infos:
1) I checked the [Hangfire].[Job] table and there is an unclustered index on ExpireAt.
2) We currently have 22.000 completed jobs, no recurring jobs, and only one server (see here)
3) The issue was already present with the last version (Hangfire.SqlServer 1.5.0), but as I came across the following issue #429 I thought it would have been resolved with version 1.6.0.
Currently we are using the following packages:
<package id="Hangfire.Core" version="1.6.0" targetFramework="net452" />
<package id="Hangfire.Dashboard.Authorization" version="2.1.0" targetFramework="net452" />
<package id="Hangfire.SqlServer" version="1.6.0" targetFramework="net452" />
<package id="Hangfire.Windsor" version="1.0.5981.19913" targetFramework="net452" />
Is there anything that we can do to workaround this issue? Disabling the deletion of jobs, or specifying that this should only happen once a day? - currently it seems this is done every 5 minutes, which is why our DTUs are all gone very fast.
see this DTU graph of SQL Azure Management Portal
It's caused by the Index Scan operator, when number of records to delete is high. Looking for a way how to force Index Seek instead. As a temporary workaround, you can use your private build using NumberOfRecordsInSinglePass = 100 in ExpirationManager.

Since the FORCESEEK table hint can't be used in the DELETE statement, I don't know how to optimize this query without changes to the underlying schema. I'm considering to solve this problem in two steps:
NumberOfRecordsInSinglePass to a lower one for the Job table (experiments show that 200 value leads to Index Seek) and make it configurable through the SqlServerStorageOptions class. Will be shipped with 1.6.2.ExpireAt column and the corresponding indexes for JobParameters and State tables and include them into the ExpirationManager loop. Will be shipped with 1.7.0, since the migration will be changed.As a part of an investigation, I've also found that Counter and AggregatedCounter table lacks of index on ExpireAt column, this should be also changed.
Thank you so much for the fast response!
as you can see, directly executing the query of the ExpirationManager takes
over one minute in our system
Using your proposed approach,
if I use "top(200)", there is still some index_scan in the estimated execution plan
if I, however, use "top(100)", then all index_scans are replaced by index_seeks
Running the former query (top(200) takes about 69 seconds!!!
Running the latter query (top(100)) "only" takes 5 seconds, so this is way better.
Hm, so we'll use 100 as a default value for the NumberOfRecordsInSinglePass option, until ExpireAt columns are added to the corresponding tables, thanks for feedback.
It's a pity that I just released version 1.6.1 without fix for this issue :disappointed:.
Yes it is. I just saw it this morning and immediately hoped you'd have a fix for my problem included ;-)
Thanks for the prompt response though - you really helped us a lot!
I've changed how the ExpirationManager class works. Instead of issuing 100 queries to delete N rows, it now executes 1 query to delete all rows in a loop on SQL Server side using DELETE TOP(N) command. With this change, lower values of a batch are totally acceptable, because they don't lead to an increased query count. So step 2 isn't required.
The change will be shipped with 1.6.2.
Reopening the issue. I've found more elegant way how to avoid INDEX SCANs without guessing the number of rows in a batch. Moreover, as seen in #643, we can't guess it anyway. Today I've discovered the following things, that will resolve the issue with falling back to INDEX SCAN, when large amount of rows are deleted from dependent tables:
MERGE JOIN operation causes the engine to use INDEX SCAN operator.LOOP JOIN operation uses INDEX SCAN operator.OPTION clause may be used to tell the engine which join to use in a query.So, we can tune our query in the following way and resolve this issue in a correct way.
delete top (@count)
from [{schemaName}].[{table}] with (readpast)
where ExpireAt < @now
option (loop join);
Hi, we are having major issues with this at many installs. Errors like:
[Hangfire.SqlServer.ExpirationManager] ERROR Hangfire.SqlServer.ExpirationManager [(null)] - Error occurred during execution of 'Hangfire.SqlServer.ExpirationManager' process. Execution will be retried (attempt 17 of 2147483647) in 00:04:39 seconds. System.Data.SqlClient.SqlException (0x80131904): Cannot release the application lock (Database Principal: 'public', Resource: 'locks:expirationmanager') because it is not currently held.
Has been common probably due to timeouts on Azure.
So I started to manually clean up Succeeded Jobs using you queries with top 100. The first 100 took 44s to delete... This means that the built in clean up on you latest version with one db call with:
while (1 = 1)
begin
delete top (@count) from [{schemaName}].[{table}] with (readpast) where ExpireAt < @now;
if @@ROWCOUNT = 0 break;
end
Does not work. This will take hours to complete and only timeout again.
I also just tried with your latest solution (loop join), but that only reduced run time from 44s down to 43s.
I will continue to play with this today but I'm leaning towards a schema change. I would like to use cascading deletes, but have them hit a clustered key on the referenced tables.
@Surferdude-AF, what do the following queries show?
select max(t.Count) as MaxStateCount
from (select count(id) as Count from HangFire.State with (nolock) group by JobId) t
select max(t.Count) as MaxParameterCount
from (select count(id) as Count from HangFire.JobParameter with (nolock) group by JobId) t
Please also check https://github.com/HangfireIO/Hangfire/issues/643#issuecomment-244114578 regarding statistics.
43
3
Checking statistics now
No, no luck. I have tried with updated statistics and with using a clustered index JobId,Id instead. Still almost a minute to purge 100 rows out of 100.000 in total.
Terrible numbers. Have you tried to rebuild indexes in that table? Btw, how did you update the statistics?
_Edit_ Had to consider IDENTITY...
Yes, rebuild indexes did not help. And, I'm not sure the indexes help much anyway since data in these tables change so much over a day anyway. We have many Jobs per day and need to set ExpireAt to one hour to get rid of them fast.
New solution to get rid of old failed and succeeded jobs when there are MANY jobs in the tables. As far as I can see this one works very well for the Hangfire 1.6.2 schema, but it must be run when the system is shut down:
--Create new temp tables
CREATE TABLE [HangFire].[Job2](
[Id] [int] NOT NULL,
[StateId] [int] NULL,
[StateName] [nvarchar](20) NULL,
[InvocationData] [nvarchar](max) NOT NULL,
[Arguments] [nvarchar](max) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[ExpireAt] [datetime] NULL,
CONSTRAINT [PK_HangFire_Job2] PRIMARY KEY CLUSTERED
(
[Id] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_HangFire_Job2_StateName] ON [HangFire].[Job2]
(
[StateName] ASC
)
GO
CREATE TABLE [HangFire].[JobParameter2](
[Id] [int] NOT NULL,
[JobId] [int] NOT NULL,
[Name] [nvarchar](40) NOT NULL,
[Value] [nvarchar](max) NULL,
CONSTRAINT [PK_HangFire_JobParameter2] PRIMARY KEY CLUSTERED
(
[Id] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_HangFire_JobParameter2_JobIdAndName] ON [HangFire].[JobParameter2]
(
[JobId] ASC,
[Name] ASC
)
GO
CREATE TABLE [HangFire].[State2](
[Id] [int] NOT NULL,
[JobId] [int] NOT NULL,
[Name] [nvarchar](20) NOT NULL,
[Reason] [nvarchar](100) NULL,
[CreatedAt] [datetime] NOT NULL,
[Data] [nvarchar](max) NULL,
CONSTRAINT [PK_HangFire_State2] PRIMARY KEY CLUSTERED
(
[Id] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_HangFire_State2_JobId] ON [HangFire].[State2]
(
[JobId] ASC
)
GO
--Insert active into new tables
INSERT INTO [HangFire].[Job2]
(Id,
[StateId]
,[StateName]
,[InvocationData]
,[Arguments]
,[CreatedAt]
,[ExpireAt])
SELECT Id,
[StateId]
,[StateName]
,[InvocationData]
,[Arguments]
,[CreatedAt]
,[ExpireAt]
FROM Hangfire.Job
WHERE StateName not in ('Failed','Succeeded')
GO
INSERT INTO [HangFire].[JobParameter2]
SELECT jp.Id,
jp.[JobId]
,jp.[Name]
,jp.[Value]
FROM Hangfire.JobParameter jp
inner join Hangfire.Job2 j2 on j2.Id = jp.JobId
GO
INSERT INTO [HangFire].[State2]
SELECT s.Id,
s.[JobId]
,s.[Name]
,s.[Reason]
,s.[CreatedAt]
,s.[Data]
FROM Hangfire.State s
inner join Hangfire.Job2 j2 on j2.Id = s.JobId
GO
--Drop FKs
ALTER TABLE [HangFire].[State] DROP CONSTRAINT [FK_HangFire_State_Job]
GO
ALTER TABLE [HangFire].[JobParameter] DROP CONSTRAINT [FK_HangFire_JobParameter_Job]
GO
--Truncate tables
truncate table Hangfire.JobParameter
truncate table Hangfire.State
truncate table Hangfire.Job
GO
SET IDENTITY_INSERT [HangFire].[Job] ON
--insert back
INSERT INTO [HangFire].[Job]
([Id]
,[StateId]
,[StateName]
,[InvocationData]
,[Arguments]
,[CreatedAt]
,[ExpireAt])
SELECT [Id]
,[StateId]
,[StateName]
,[InvocationData]
,[Arguments]
,[CreatedAt]
,[ExpireAt]
FROM Hangfire.Job2
SET IDENTITY_INSERT [HangFire].[Job] OFF
GO
INSERT INTO [HangFire].[JobParameter]
([JobId]
,[Name]
,[Value])
SELECT [JobId]
,[Name]
,[Value]
FROM Hangfire.JobParameter2
GO
INSERT INTO [HangFire].[State]
([JobId]
,[Name]
,[Reason]
,[CreatedAt]
,[Data])
SELECT [JobId]
,[Name]
,[Reason]
,[CreatedAt]
,[Data]
FROM Hangfire.State2
GO
--Add FKs
ALTER TABLE [HangFire].[State] WITH CHECK ADD CONSTRAINT [FK_HangFire_State_Job] FOREIGN KEY([JobId])
REFERENCES [HangFire].[Job] ([Id])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
ALTER TABLE [HangFire].[State] CHECK CONSTRAINT [FK_HangFire_State_Job]
GO
ALTER TABLE [HangFire].[JobParameter] WITH CHECK ADD CONSTRAINT [FK_HangFire_JobParameter_Job] FOREIGN KEY([JobId])
REFERENCES [HangFire].[Job] ([Id])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
ALTER TABLE [HangFire].[JobParameter] CHECK CONSTRAINT [FK_HangFire_JobParameter_Job]
GO
--Drop temp tables
drop table hangfire.Job2
drop table hangfire.JobParameter2
drop table hangfire.State2
GO
For Azure I consider this issue to be open still. The above truncate script helped us get back to scratch when tables had grown too much, but it is still extremely vulnerable with the:
while (1 = 1)
begin
delete top (@count) from [{schemaName}].[{table}] with (readpast) where ExpireAt < @now;
if @@ROWCOUNT = 0 break;
end
This one will very easily time out on Azure and when it does tables just continues to grow. We need another solution where it will delete in small enough batches like this, but commit after every delete so that if it times out only the last delete gets rolled back.
In the past two weeks, I've done a lot of optimizations for Hangfire.SqlServer. I've set up the worst possible SQL Azure database (Basic plan with 5 DTUs) and gave a workload. After performing the optimizations it successfully process 500K of background jobs every day. Here are the optimizations:
LOOP JOIN and OPTIMIZE FOR (@count = 20000) hints for the whole query. The first one forbids to use MERGE JOIN on large number of rows. The second one leads to reduced number of logical reads, because db engine adds a sorting operator before getting records from State and JobParameter tables.WITH INDEX(0) hint for the Counter table to always use the clustered index when aggregating the first N rows instead of a non-clustered one. Resulted in lower number of logical reads also.NOLOCK hint for JobQueue table to prevent blocking rows when determining available queue names.ROW_NUMBER. Please see this article for details.You can try all of this optimizations by getting the pre-release build from the CI feed, or download an archive from the artifacts page.
Install-Package Hangfire.Pro -Version 1.6.5-build-02247 -Source https://ci.appveyor.com/nuget/hangfire
Great! I can notice the difference now when I test. It is more predictable. I did however run into one db with 15.000 jobs in it and did:
declare @count int
set @count = 100
delete top (@count) from Hangfire.Job with (readpast)
where ExpireAt < getdate()
option (loop join, optimize for (@count = 2000));
Ouch, it took 3 min first and second time. So I:
update statistics hangfire.job
update statistics hangfire.state
Ouch it's still at 10s, meaning that when this is run as one query looping while 1=1 it will time out and rollback.
So, it's better, but still not robust according to me. We need a solution where it will delete in small enough batches like this, but commit after every delete so that if it times out only the last delete gets rolled back. What do you think?
We need a solution where it will delete in small enough batches like this, but commit after every delete so that if it times out only the last delete gets rolled back.
It is already committed after each DELETE, there is no any transaction here. Setting transaction isolation level is just a defensive technique, for cases when isolation level is leaked.
set transaction isolation level read committed;
set nocount on;
while (1 = 1)
begin
delete top (@count) from [{schemaName}].[{table}] with (readpast)
where ExpireAt < @now
option (loop join, optimize for (@count = 20000));
if @@ROWCOUNT = 0 break;
end
Ouch it's still at 10s, meaning that when this is run as one query looping while 1=1 it will time out and rollback.
What plan are you using for SQL Azure database?
Aha, no implicit transaction, then this could work fine.
Plan is P0.
I've also found that the following simple queries has very long duration:
(@id int,@expireAt datetime)update [HangFire].Job set ExpireAt = @expireAt where Id = @id
(@id int)update [HangFire].Job set ExpireAt = NULL where Id = @id
All of them are using non-clustered index IX_HangFire_Job_ExpireAt, which was created exactly for these queries. Now I'm trying to add a filter for that index ExpireAt IS NOT NULL to prevent index modifications when it is unnecessary. So let's wait and see the results.
@Surferdude-AF, P0 (Premium, I don't see any P0 available) or S0 (Standard)? The difference is huge :smile:
Yes, S0, this is a small install. We use P0 or higher for the bug ones, you get much better storage with P.
With all of these improvements coming in version 1.6.5, Hangfire now performs smoothly even on SQL Azure with Basic pricing tier, giving 500K jobs/day. So I'm closing this issue.
@Surferdude-AF and @gentledepp, thank you a lot for the feedback and investigation!
AWESOME!!!!!
Most helpful comment
With all of these improvements coming in version 1.6.5, Hangfire now performs smoothly even on SQL Azure with Basic pricing tier, giving 500K jobs/day. So I'm closing this issue.
@Surferdude-AF and @gentledepp, thank you a lot for the feedback and investigation!