Hangfire: Deadlock (??) caused by Hangfire Query consumes all SQL Azure DTUs!

Created on 27 Jul 2016 · 23Comments · Source: HangfireIO/Hangfire

Hi!
We are using the latest Hangfire.SqlServer bits (1.6.0) in our Azure environment which consists of

a virtual machine hosting an ASP.NET Web Application using Hangfire on IIS
a SQL Azure Database with 20 DTUs (test environment)

Now the problem is, easily understandable viewing the following screenshot of the Azure management portal.
screenshot

As you can see, the following query consumes almost all dtus, as it takes ages to execute (9 to 22 MINUTES!!)

(@now datetime,@count int)delete top (@count) from [HangFire].[Job] with (readpast) where ExpireAt < @now

So either the SQL Azure Query Performance insight is not trustworthy at all, or there seems to be a serious problem in Hangfire, right?

Some more infos:
1) I checked the [Hangfire].[Job] table and there is an unclustered index on ExpireAt.
2) We currently have 22.000 completed jobs, no recurring jobs, and only one server (see here)
3) The issue was already present with the last version (Hangfire.SqlServer 1.5.0), but as I came across the following issue #429 I thought it would have been resolved with version 1.6.0.

Currently we are using the following packages:

  <package id="Hangfire.Core" version="1.6.0" targetFramework="net452" />
  <package id="Hangfire.Dashboard.Authorization" version="2.1.0" targetFramework="net452" />
  <package id="Hangfire.SqlServer" version="1.6.0" targetFramework="net452" />
  <package id="Hangfire.Windsor" version="1.0.5981.19913" targetFramework="net452" />

Is there anything that we can do to workaround this issue? Disabling the deletion of jobs, or specifying that this should only happen once a day? - currently it seems this is done every 5 minutes, which is why our DTUs are all gone very fast.
see this DTU graph of SQL Azure Management Portal

sql-server bug

Source

gentledepp

Most helpful comment

With all of these improvements coming in version 1.6.5, Hangfire now performs smoothly even on SQL Azure with Basic pricing tier, giving 500K jobs/day. So I'm closing this issue.

@Surferdude-AF and @gentledepp, thank you a lot for the feedback and investigation!

odinserj on 23 Sep 2016

🎉2 😄2 👍2 ❤1

All 23 comments

It's caused by the Index Scan operator, when number of records to delete is high. Looking for a way how to force Index Seek instead. As a temporary workaround, you can use your private build using NumberOfRecordsInSinglePass = 100 in ExpirationManager.

odinserj on 27 Jul 2016

Since the FORCESEEK table hint can't be used in the DELETE statement, I don't know how to optimize this query without changes to the underlying schema. I'm considering to solve this problem in two steps:

Change the default value of the NumberOfRecordsInSinglePass to a lower one for the Job table (experiments show that 200 value leads to Index Seek) and make it configurable through the SqlServerStorageOptions class. Will be shipped with 1.6.2.
Remove the foreign keys and thus cascading deletes, add ExpireAt column and the corresponding indexes for JobParameters and State tables and include them into the ExpirationManager loop. Will be shipped with 1.7.0, since the migration will be changed.

As a part of an investigation, I've also found that Counter and AggregatedCounter table lacks of index on ExpireAt column, this should be also changed.

odinserj on 27 Jul 2016

Thank you so much for the fast response!
as you can see, directly executing the query of the ExpirationManager takes
over one minute in our system

Using your proposed approach,
if I use "top(200)", there is still some index_scan in the estimated execution plan
if I, however, use "top(100)", then all index_scans are replaced by index_seeks

Running the former query (top(200) takes about 69 seconds!!!
Running the latter query (top(100)) "only" takes 5 seconds, so this is way better.

gentledepp on 27 Jul 2016

👍1

Hm, so we'll use 100 as a default value for the NumberOfRecordsInSinglePass option, until ExpireAt columns are added to the corresponding tables, thanks for feedback.

It's a pity that I just released version 1.6.1 without fix for this issue :disappointed:.

odinserj on 27 Jul 2016

🎉1

Yes it is. I just saw it this morning and immediately hoped you'd have a fix for my problem included ;-)
Thanks for the prompt response though - you really helped us a lot!

gentledepp on 27 Jul 2016

I've changed how the ExpirationManager class works. Instead of issuing 100 queries to delete N rows, it now executes 1 query to delete all rows in a loop on SQL Server side using DELETE TOP(N) command. With this change, lower values of a batch are totally acceptable, because they don't lead to an increased query count. So step 2 isn't required.

The change will be shipped with 1.6.2.

odinserj on 4 Aug 2016

Reopening the issue. I've found more elegant way how to avoid INDEX SCANs without guessing the number of rows in a batch. Moreover, as seen in #643, we can't guess it anyway. Today I've discovered the following things, that will resolve the issue with falling back to INDEX SCAN, when large amount of rows are deleted from dependent tables:

MERGE JOIN operation causes the engine to use INDEX SCAN operator.
LOOP JOIN operation uses INDEX SCAN operator.
The OPTION clause may be used to tell the engine which join to use in a query.

So, we can tune our query in the following way and resolve this issue in a correct way.

delete top (@count) 
from [{schemaName}].[{table}] with (readpast) 
where ExpireAt < @now
option (loop join);

odinserj on 1 Sep 2016

Hi, we are having major issues with this at many installs. Errors like:

[Hangfire.SqlServer.ExpirationManager] ERROR Hangfire.SqlServer.ExpirationManager [(null)] - Error occurred during execution of 'Hangfire.SqlServer.ExpirationManager' process. Execution will be retried (attempt 17 of 2147483647) in 00:04:39 seconds. System.Data.SqlClient.SqlException (0x80131904): Cannot release the application lock (Database Principal: 'public', Resource: 'locks:expirationmanager') because it is not currently held.

Has been common probably due to timeouts on Azure.

So I started to manually clean up Succeeded Jobs using you queries with top 100. The first 100 took 44s to delete... This means that the built in clean up on you latest version with one db call with:

while (1 = 1)
begin
    delete top (@count) from [{schemaName}].[{table}] with (readpast) where ExpireAt < @now;
    if @@ROWCOUNT = 0 break;
end

Does not work. This will take hours to complete and only timeout again.

I also just tried with your latest solution (loop join), but that only reduced run time from 44s down to 43s.

I will continue to play with this today but I'm leaning towards a schema change. I would like to use cascading deletes, but have them hit a clustered key on the referenced tables.

Surferdude-AF on 2 Sep 2016

@Surferdude-AF, what do the following queries show?

select max(t.Count) as MaxStateCount 
from (select count(id) as Count from HangFire.State with (nolock) group by JobId) t

select max(t.Count) as MaxParameterCount 
from (select count(id) as Count from HangFire.JobParameter with (nolock) group by JobId) t

Please also check https://github.com/HangfireIO/Hangfire/issues/643#issuecomment-244114578 regarding statistics.

odinserj on 2 Sep 2016

Checking statistics now

Surferdude-AF on 2 Sep 2016

No, no luck. I have tried with updated statistics and with using a clustered index JobId,Id instead. Still almost a minute to purge 100 rows out of 100.000 in total.

Surferdude-AF on 2 Sep 2016

Terrible numbers. Have you tried to rebuild indexes in that table? Btw, how did you update the statistics?

odinserj on 2 Sep 2016

_Edit_ Had to consider IDENTITY...

Yes, rebuild indexes did not help. And, I'm not sure the indexes help much anyway since data in these tables change so much over a day anyway. We have many Jobs per day and need to set ExpireAt to one hour to get rid of them fast.

New solution to get rid of old failed and succeeded jobs when there are MANY jobs in the tables. As far as I can see this one works very well for the Hangfire 1.6.2 schema, but it must be run when the system is shut down:

--Create new temp tables
CREATE TABLE [HangFire].[Job2](
    [Id] [int] NOT NULL,
    [StateId] [int] NULL,
    [StateName] [nvarchar](20) NULL,
    [InvocationData] [nvarchar](max) NOT NULL,
    [Arguments] [nvarchar](max) NOT NULL,
    [CreatedAt] [datetime] NOT NULL,
    [ExpireAt] [datetime] NULL,
 CONSTRAINT [PK_HangFire_Job2] PRIMARY KEY CLUSTERED
(
    [Id] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_HangFire_Job2_StateName] ON [HangFire].[Job2]
(
    [StateName] ASC
)
GO

CREATE TABLE [HangFire].[JobParameter2](
    [Id] [int] NOT NULL,
    [JobId] [int] NOT NULL,
    [Name] [nvarchar](40) NOT NULL,
    [Value] [nvarchar](max) NULL,
 CONSTRAINT [PK_HangFire_JobParameter2] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_HangFire_JobParameter2_JobIdAndName] ON [HangFire].[JobParameter2]
(
    [JobId] ASC,
    [Name] ASC
)
GO
CREATE TABLE [HangFire].[State2](
    [Id] [int] NOT NULL,
    [JobId] [int] NOT NULL,
    [Name] [nvarchar](20) NOT NULL,
    [Reason] [nvarchar](100) NULL,
    [CreatedAt] [datetime] NOT NULL,
    [Data] [nvarchar](max) NULL,
 CONSTRAINT [PK_HangFire_State2] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_HangFire_State2_JobId] ON [HangFire].[State2]
(
    [JobId] ASC
)
GO

--Insert active into new tables
INSERT INTO [HangFire].[Job2]
           (Id,
            [StateId]
           ,[StateName]
           ,[InvocationData]
           ,[Arguments]
           ,[CreatedAt]
           ,[ExpireAt])
SELECT      Id,
            [StateId]
           ,[StateName]
           ,[InvocationData]
           ,[Arguments]
           ,[CreatedAt]
           ,[ExpireAt]
FROM Hangfire.Job
WHERE StateName not in ('Failed','Succeeded')
GO

INSERT INTO [HangFire].[JobParameter2]
SELECT      jp.Id,
            jp.[JobId]
           ,jp.[Name]
           ,jp.[Value]
FROM Hangfire.JobParameter jp
inner join Hangfire.Job2 j2 on j2.Id = jp.JobId
GO

INSERT INTO [HangFire].[State2]
SELECT      s.Id,
            s.[JobId]
           ,s.[Name]
           ,s.[Reason]
           ,s.[CreatedAt]
           ,s.[Data]
FROM Hangfire.State s
inner join Hangfire.Job2 j2 on j2.Id = s.JobId
GO

--Drop FKs
ALTER TABLE [HangFire].[State] DROP CONSTRAINT [FK_HangFire_State_Job]
GO
ALTER TABLE [HangFire].[JobParameter] DROP CONSTRAINT [FK_HangFire_JobParameter_Job]
GO

--Truncate tables
truncate table Hangfire.JobParameter
truncate table Hangfire.State
truncate table Hangfire.Job
GO

SET IDENTITY_INSERT [HangFire].[Job] ON
--insert back
INSERT INTO [HangFire].[Job]
           ([Id]
           ,[StateId]
           ,[StateName]
           ,[InvocationData]
           ,[Arguments]
           ,[CreatedAt]
           ,[ExpireAt])
SELECT      [Id]
            ,[StateId]
           ,[StateName]
           ,[InvocationData]
           ,[Arguments]
           ,[CreatedAt]
           ,[ExpireAt]
FROM Hangfire.Job2
SET IDENTITY_INSERT [HangFire].[Job] OFF
GO

INSERT INTO [HangFire].[JobParameter]
           ([JobId]
           ,[Name]
           ,[Value])
SELECT      [JobId]
           ,[Name]
           ,[Value]
FROM Hangfire.JobParameter2
GO

INSERT INTO [HangFire].[State]
           ([JobId]
           ,[Name]
           ,[Reason]
           ,[CreatedAt]
           ,[Data])
SELECT      [JobId]
           ,[Name]
           ,[Reason]
           ,[CreatedAt]
           ,[Data]
FROM Hangfire.State2
GO

--Add FKs
ALTER TABLE [HangFire].[State]  WITH CHECK ADD  CONSTRAINT [FK_HangFire_State_Job] FOREIGN KEY([JobId])
REFERENCES [HangFire].[Job] ([Id])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
ALTER TABLE [HangFire].[State] CHECK CONSTRAINT [FK_HangFire_State_Job]
GO

ALTER TABLE [HangFire].[JobParameter]  WITH CHECK ADD  CONSTRAINT [FK_HangFire_JobParameter_Job] FOREIGN KEY([JobId])
REFERENCES [HangFire].[Job] ([Id])
ON UPDATE CASCADE
ON DELETE CASCADE
GO

ALTER TABLE [HangFire].[JobParameter] CHECK CONSTRAINT [FK_HangFire_JobParameter_Job]
GO

--Drop temp tables
drop table hangfire.Job2
drop table hangfire.JobParameter2
drop table hangfire.State2
GO

Surferdude-AF on 2 Sep 2016

For Azure I consider this issue to be open still. The above truncate script helped us get back to scratch when tables had grown too much, but it is still extremely vulnerable with the:
while (1 = 1)
begin
delete top (@count) from [{schemaName}].[{table}] with (readpast) where ExpireAt < @now;
if @@ROWCOUNT = 0 break;
end

This one will very easily time out on Azure and when it does tables just continues to grow. We need another solution where it will delete in small enough batches like this, but commit after every delete so that if it times out only the last delete gets rolled back.

Surferdude-AF on 16 Sep 2016

In the past two weeks, I've done a lot of optimizations for Hangfire.SqlServer. I've set up the worst possible SQL Azure database (Basic plan with 5 DTUs) and gave a workload. After performing the optimizations it successfully process 500K of background jobs every day. Here are the optimizations:

ExpirationManager – added LOOP JOIN and OPTIMIZE FOR (@count = 20000) hints for the whole query. The first one forbids to use MERGE JOIN on large number of rows. The second one leads to reduced number of logical reads, because db engine adds a sorting operator before getting records from State and JobParameter tables.
CountersAggregator – added WITH INDEX(0) hint for the Counter table to always use the clustered index when aggregating the first N rows instead of a non-clustered one. Resulted in lower number of logical reads also.
SqlServerJobQueueMonitoringApi – added missing NOLOCK hint for JobQueue table to prevent blocking rows when determining available queue names.
SqlServerMonitoringApi – using CTE when performing paging. This optimization dramatically decreased the number of logical reads required to fetch background jobs, when using ROW_NUMBER. Please see this article for details.

You can try all of this optimizations by getting the pre-release build from the CI feed, or download an archive from the artifacts page.

Install-Package Hangfire.Pro -Version 1.6.5-build-02247 -Source https://ci.appveyor.com/nuget/hangfire

odinserj on 16 Sep 2016

Great! I can notice the difference now when I test. It is more predictable. I did however run into one db with 15.000 jobs in it and did:

declare @count int
set @count = 100

delete top (@count) from Hangfire.Job with (readpast)
where ExpireAt < getdate()
option (loop join, optimize for (@count = 2000));

Ouch, it took 3 min first and second time. So I:
update statistics hangfire.job
update statistics hangfire.state

Ouch it's still at 10s, meaning that when this is run as one query looping while 1=1 it will time out and rollback.

So, it's better, but still not robust according to me. We need a solution where it will delete in small enough batches like this, but commit after every delete so that if it times out only the last delete gets rolled back. What do you think?

Surferdude-AF on 16 Sep 2016

We need a solution where it will delete in small enough batches like this, but commit after every delete so that if it times out only the last delete gets rolled back.

It is already committed after each DELETE, there is no any transaction here. Setting transaction isolation level is just a defensive technique, for cases when isolation level is leaked.

set transaction isolation level read committed;
set nocount on;
while (1 = 1)
begin
    delete top (@count) from [{schemaName}].[{table}] with (readpast) 
    where ExpireAt < @now
    option (loop join, optimize for (@count = 20000));

    if @@ROWCOUNT = 0 break;
end

Ouch it's still at 10s, meaning that when this is run as one query looping while 1=1 it will time out and rollback.

What plan are you using for SQL Azure database?

odinserj on 16 Sep 2016

Aha, no implicit transaction, then this could work fine.

Plan is P0.

Surferdude-AF on 16 Sep 2016

I've also found that the following simple queries has very long duration:

(@id int,@expireAt datetime)update [HangFire].Job set ExpireAt = @expireAt where Id = @id
(@id int)update [HangFire].Job set ExpireAt = NULL where Id = @id

All of them are using non-clustered index IX_HangFire_Job_ExpireAt, which was created exactly for these queries. Now I'm trying to add a filter for that index ExpireAt IS NOT NULL to prevent index modifications when it is unnecessary. So let's wait and see the results.

odinserj on 16 Sep 2016

@Surferdude-AF, P0 (Premium, I don't see any P0 available) or S0 (Standard)? The difference is huge :smile:

odinserj on 16 Sep 2016

Yes, S0, this is a small install. We use P0 or higher for the bug ones, you get much better storage with P.

Surferdude-AF on 16 Sep 2016

With all of these improvements coming in version 1.6.5, Hangfire now performs smoothly even on SQL Azure with Basic pricing tier, giving 500K jobs/day. So I'm closing this issue.

@Surferdude-AF and @gentledepp, thank you a lot for the feedback and investigation!

odinserj on 23 Sep 2016

🎉2 😄2 👍2 ❤1

AWESOME!!!!!

gentledepp on 23 Sep 2016

🎉2 😄2

Was this page helpful?

0 / 5 - 0 ratings