Hangfire: Enqueued jobs not running, 10 days after server restart

Created on 16 Nov 2020 · 12Comments · Source: HangfireIO/Hangfire

Hi we are experiencing an odd issue with the running of jobs.

Problem

After 10 days of leaving our webserver running(no restarts), enqueued jobs no longer process. They simply sit in the queued jobs tab. After stopping the server and then starting up again newly queued jobs process fine. Restarting does not work, we must do a stop then start.

It is worth noting the server we stop/start after 10 days is not the server that actually calls BackgroundProcess.Enqueue, please see details below along with a simple diagram of what is going on. Everything works perfectly all other times.

We received this exception on the ninth day on the api server not the web server. We have not restarted the api server at all and when viewing the hangfire dashboard a heart beat is shown as expected.

I don't however see how this exception could be relevant seeing as:

new jobs are placed on the queue but not processed after 9/10 days (2 days of no activity)
web server start/stop fixes the issue
no start/stop of the api server required

The more I think about it the more I think this might be an issue with postgres and npgsql as opposed Hangfire.

Any help would be greatly appreciated @odinserj

2020-11-15 21:41:06.392 +00:00 [Error] Hangfire.Processing.BackgroundExecution: Execution Worker is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04
Npgsql.NpgsqlException (0x80004005): Exception while writing to stream ---> System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   at System.Net.Security.SslStreamInternal.WriteSingleChunk[TWriteAdapter](TWriteAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Net.Security.SslStreamInternal.WriteAsyncInternal[TWriteAdapter](TWriteAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Net.Security.SslStreamInternal.Write(Byte[] buffer, Int32 offset, Int32 count)
   at System.Net.Security.SslStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   at Npgsql.NpgsqlWriteBuffer.Flush(Boolean async)
   at Npgsql.NpgsqlWriteBuffer.Flush(Boolean async)
   at Npgsql.NpgsqlCommand.SendExecute(NpgsqlConnector connector, Boolean async)
   at Npgsql.NpgsqlCommand.ExecuteReaderAsync(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(CommandBehavior behavior)
   at Dapper.SqlMapper.ExecuteReaderWithFlagsFallback(IDbCommand cmd, Boolean wasClosed, CommandBehavior behavior) in C:\projects\dapper\Dapper\SqlMapper.cs:line 1051
   at Dapper.SqlMapper.QueryImpl[T](IDbConnection cnn, CommandDefinition command, Type effectiveType)+MoveNext() in C:\projects\dapper\Dapper\SqlMapper.cs:line 1079
   at System.Collections.Generic.List`1.AddEnumerable(IEnumerable`1 enumerable)
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Dapper.SqlMapper.Query[T](IDbConnection cnn, String sql, Object param, IDbTransaction transaction, Boolean buffered, Nullable`1 commandTimeout, Nullable`1 commandType) in C:\projects\dapper\Dapper\SqlMapper.cs:line 721
   at Hangfire.PostgreSql.PostgreSqlJobQueue.<>c__DisplayClass5_1.<Dequeue_Transaction>b__1()
   at Hangfire.PostgreSql.Utils.Utils.TryExecute[T](Func`1 func, T& result, Func`2 smoothExValidator, Nullable`1 tryCount)
   at Hangfire.PostgreSql.PostgreSqlJobQueue.Dequeue_Transaction(String[] queues, CancellationToken cancellationToken)
   at Hangfire.PostgreSql.PostgreSqlJobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
   at Hangfire.PostgreSql.PostgreSqlConnection.FetchNextJob(String[] queues, CancellationToken cancellationToken)
   at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
   at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state)
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)

Current setup

Web server

1 app service hosted in Azure exposes a front end to the client
.Net core 2.2
PostgresSql 11 back end
Shows the hangfire dashboard
running latest version of hangfire

Api server

1 app service hosted in Azure exposes an api which when triggered queues a background job
.Net core 2.2
PostgresSql 11 back end
registers hangfire as a server
this server does the enqueuing of the jobs
running latest version of hangfire

Workflow

See simple diagram https://app.lucidchart.com/invitations/accept/0d6b7469-6243-4fa5-9174-f573cb6ae3e9

Example of serialized request of a failed job

{"Type":"Domain.Interfaces.Services.IReportProcessService, Domain",
"Method":"BeginProcessAsync",
"ParameterTypes":"[\"System.Guid, mscorlib\",\"System.String\",\"System.Threading.CancellationToken, mscorlib\"]",
"Arguments":"[\"\\\"7ae5abb4-1232-4adf-a44f-4845ed8c0187\\\"\",\"\\\"0e3d1f31-0195-4116-b58c-681c5940e6af\\\"\",null]"}

Example of serialized request of a failed job

{"Type":"Domain.Interfaces.Services.IReportProcessService, Domain",
"Method":"BeginProcessAsync",
"ParameterTypes":"[\"System.Guid, mscorlib\",\"System.String\",\"System.Threading.CancellationToken, mscorlib\"]",
"Arguments":"[\"\\\"c015f5c6-89f8-465f-ad7c-a5a05d579a21\\\"\",\"\\\"0e3d1f31-0195-4116-b58c-681c5940e6af\\\"\",null]"}

Source

tonykaralis

Most helpful comment

Just an update, we migrated our code base over to core 3.1 as well as the app service runtime and this issue has not occurred since.

tonykaralis on 28 Dec 2020

❤1 👍1

All 12 comments

Same here after update to hangfire version 1.7.17. Server is not executing jobs after 3 days and restart server fix problem. Actually, we are on memory storage. No log error at all!

Update: I have reverted the version to 1.17.12 to see if it solve that.

meriturva on 19 Nov 2020

Please create an issue in the repository which provides the PostgresSql job storage extension as it's written and maintained by different people. Job storage is the central piece in Hangfire, and bugs in it will cause bugs everywhere in Hangfire.

odinserj on 19 Nov 2020

👍1

Actually, i use Memory Storage and I guess it is not related to storage but to something introduced with the latest versions.

meriturva on 19 Nov 2020

👍1

@meriturva what package for memory storage you are using? Stack trace from the original message clearly shows the problem relates to Npgsql and a closed connection.

odinserj on 19 Nov 2020

Hangfire.MemoryStorage -> 1.7.0

No error on logs, just stop executing enqueue jobs (also recurring jobs) two times in 10 days.
I have for sure to investigate more, no time now so I revert to the old version of hangfire (1.17.12) to see if it solve the issue.

Ps: when hangfire was blocked dashboard works correctly but shows all jobs queued, server count is corrected also, and no running job.

meriturva on 19 Nov 2020

Probably the same issue with SQL Server here. Jobs get stuck in "Enqueued" state after some days of server uptime. Restarting server helps, but after some time jobs get stuck again. Using Hangfire.AspNetCore 1.7.10 and Hangfire.SqlServer 1.7.10

minajevs on 19 Nov 2020

@meriturva there are a lot of problems with the package you are using too, instead of downgrading try switching to the new Hangfire.InMemory package instead, it's already on NuGet.

odinserj on 19 Nov 2020

👍1

@minajevs this can happen due to background jobs themselves. Try running https://github.com/odinserj/stdump to obtain stack traces when you see the blocking problem, and create a new issue with all the stack traces.

There are a lot of reasons for blocking, and it's very important to avoid using a single GitHub issue for them.

odinserj on 19 Nov 2020

@odinserj Cool, thanks, will do

minajevs on 19 Nov 2020

@odinserj I had a brief chat with the lads over on the hangfire postgres repo and they raised the issue may lie with hangfire itself misbehaving with a website hosted in azure that isn't receiving any traffic fir a set period of time. I have read the docs but its really unclear what to do to combat this. App has been upgraded to core 3.1. I am thinking, over the weekend traffic almost comes to a stand still and perhaps pool recycles and hangfire cant recover? Any suggestions on what you look for or deal with this?

Would setting up some monitoring that polls the website sort this? Perhaps if we polled the dashboard every 30 minutes or so.

I am hoping I dont need a background processor to stop and start hangfire.

tonykaralis on 24 Nov 2020

👀1

Just an update, we migrated our code base over to core 3.1 as well as the app service runtime and this issue has not occurred since.

tonykaralis on 28 Dec 2020

❤1 👍1

Thanks for the update @tonykaralis!

odinserj on 28 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings