Hi we are experiencing an odd issue with the running of jobs.
After 10 days of leaving our webserver running(no restarts), enqueued jobs no longer process. They simply sit in the queued jobs tab. After stopping the server and then starting up again newly queued jobs process fine. Restarting does not work, we must do a stop then start.
It is worth noting the server we stop/start after 10 days is not the server that actually calls BackgroundProcess.Enqueue, please see details below along with a simple diagram of what is going on. Everything works perfectly all other times.
We received this exception on the ninth day on the api server not the web server. We have not restarted the api server at all and when viewing the hangfire dashboard a heart beat is shown as expected.
I don't however see how this exception could be relevant seeing as:
The more I think about it the more I think this might be an issue with postgres and npgsql as opposed Hangfire.
Any help would be greatly appreciated @odinserj
2020-11-15 21:41:06.392 +00:00 [Error] Hangfire.Processing.BackgroundExecution: Execution Worker is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04
Npgsql.NpgsqlException (0x80004005): Exception while writing to stream ---> System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
--- End of inner exception stack trace ---
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
at System.Net.Security.SslStreamInternal.WriteSingleChunk[TWriteAdapter](TWriteAdapter writeAdapter, ReadOnlyMemory`1 buffer)
at System.Net.Security.SslStreamInternal.WriteAsyncInternal[TWriteAdapter](TWriteAdapter writeAdapter, ReadOnlyMemory`1 buffer)
at System.Net.Security.SslStreamInternal.Write(Byte[] buffer, Int32 offset, Int32 count)
at System.Net.Security.SslStream.Write(Byte[] buffer, Int32 offset, Int32 count)
at Npgsql.NpgsqlWriteBuffer.Flush(Boolean async)
at Npgsql.NpgsqlWriteBuffer.Flush(Boolean async)
at Npgsql.NpgsqlCommand.SendExecute(NpgsqlConnector connector, Boolean async)
at Npgsql.NpgsqlCommand.ExecuteReaderAsync(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior)
at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
at System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(CommandBehavior behavior)
at Dapper.SqlMapper.ExecuteReaderWithFlagsFallback(IDbCommand cmd, Boolean wasClosed, CommandBehavior behavior) in C:\projects\dapper\Dapper\SqlMapper.cs:line 1051
at Dapper.SqlMapper.QueryImpl[T](IDbConnection cnn, CommandDefinition command, Type effectiveType)+MoveNext() in C:\projects\dapper\Dapper\SqlMapper.cs:line 1079
at System.Collections.Generic.List`1.AddEnumerable(IEnumerable`1 enumerable)
at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
at Dapper.SqlMapper.Query[T](IDbConnection cnn, String sql, Object param, IDbTransaction transaction, Boolean buffered, Nullable`1 commandTimeout, Nullable`1 commandType) in C:\projects\dapper\Dapper\SqlMapper.cs:line 721
at Hangfire.PostgreSql.PostgreSqlJobQueue.<>c__DisplayClass5_1.<Dequeue_Transaction>b__1()
at Hangfire.PostgreSql.Utils.Utils.TryExecute[T](Func`1 func, T& result, Func`2 smoothExValidator, Nullable`1 tryCount)
at Hangfire.PostgreSql.PostgreSqlJobQueue.Dequeue_Transaction(String[] queues, CancellationToken cancellationToken)
at Hangfire.PostgreSql.PostgreSqlJobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
at Hangfire.PostgreSql.PostgreSqlConnection.FetchNextJob(String[] queues, CancellationToken cancellationToken)
at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state)
at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)
See simple diagram https://app.lucidchart.com/invitations/accept/0d6b7469-6243-4fa5-9174-f573cb6ae3e9
{"Type":"Domain.Interfaces.Services.IReportProcessService, Domain",
"Method":"BeginProcessAsync",
"ParameterTypes":"[\"System.Guid, mscorlib\",\"System.String\",\"System.Threading.CancellationToken, mscorlib\"]",
"Arguments":"[\"\\\"7ae5abb4-1232-4adf-a44f-4845ed8c0187\\\"\",\"\\\"0e3d1f31-0195-4116-b58c-681c5940e6af\\\"\",null]"}
{"Type":"Domain.Interfaces.Services.IReportProcessService, Domain",
"Method":"BeginProcessAsync",
"ParameterTypes":"[\"System.Guid, mscorlib\",\"System.String\",\"System.Threading.CancellationToken, mscorlib\"]",
"Arguments":"[\"\\\"c015f5c6-89f8-465f-ad7c-a5a05d579a21\\\"\",\"\\\"0e3d1f31-0195-4116-b58c-681c5940e6af\\\"\",null]"}
Same here after update to hangfire version 1.7.17. Server is not executing jobs after 3 days and restart server fix problem. Actually, we are on memory storage. No log error at all!
Update: I have reverted the version to 1.17.12 to see if it solve that.
Please create an issue in the repository which provides the PostgresSql job storage extension as it's written and maintained by different people. Job storage is the central piece in Hangfire, and bugs in it will cause bugs everywhere in Hangfire.
Actually, i use Memory Storage and I guess it is not related to storage but to something introduced with the latest versions.
@meriturva what package for memory storage you are using? Stack trace from the original message clearly shows the problem relates to Npgsql and a closed connection.
Hangfire.MemoryStorage -> 1.7.0
No error on logs, just stop executing enqueue jobs (also recurring jobs) two times in 10 days.
I have for sure to investigate more, no time now so I revert to the old version of hangfire (1.17.12) to see if it solve the issue.
Ps: when hangfire was blocked dashboard works correctly but shows all jobs queued, server count is corrected also, and no running job.
Probably the same issue with SQL Server here. Jobs get stuck in "Enqueued" state after some days of server uptime. Restarting server helps, but after some time jobs get stuck again. Using Hangfire.AspNetCore 1.7.10 and Hangfire.SqlServer 1.7.10
@meriturva there are a lot of problems with the package you are using too, instead of downgrading try switching to the new Hangfire.InMemory package instead, it's already on NuGet.
@minajevs this can happen due to background jobs themselves. Try running https://github.com/odinserj/stdump to obtain stack traces when you see the blocking problem, and create a new issue with all the stack traces.
There are a lot of reasons for blocking, and it's very important to avoid using a single GitHub issue for them.
@odinserj Cool, thanks, will do
@odinserj I had a brief chat with the lads over on the hangfire postgres repo and they raised the issue may lie with hangfire itself misbehaving with a website hosted in azure that isn't receiving any traffic fir a set period of time. I have read the docs but its really unclear what to do to combat this. App has been upgraded to core 3.1. I am thinking, over the weekend traffic almost comes to a stand still and perhaps pool recycles and hangfire cant recover? Any suggestions on what you look for or deal with this?
Would setting up some monitoring that polls the website sort this? Perhaps if we polled the dashboard every 30 minutes or so.
I am hoping I dont need a background processor to stop and start hangfire.
Just an update, we migrated our code base over to core 3.1 as well as the app service runtime and this issue has not occurred since.
Thanks for the update @tonykaralis!
Most helpful comment
Just an update, we migrated our code base over to core 3.1 as well as the app service runtime and this issue has not occurred since.