Hangfire next execution in the past

Created on 1 Jul 2019  Â·  75Comments  Â·  Source: HangfireIO/Hangfire

Hello,

for a while we've been using Hangfire with several recurring jobs in a .net core api. Ive configured IIS to make sure it is always running. Ive set the appPool to AlwaysRunning with timeout of 0, and enabled pre-load for the site.

image

image

image

This seemed to work fine at first.

However sometimes the jobs still stop working, when i open the dashboard the next execution date is a time in the past, when i trigger the job again manually it works fine again.

Any idea what i could be doing wrong here?

Regards
Johan

core bug

Most helpful comment

Version 1.7.8 was released yesterday that uses retries instead of immediately disabling a recurring job if there are any exceptions. So the behavior is almost the same as was in version 1.6.X and it can be used for multiple applications, although I don't recommend doing so until there's a proper handling of such a feature.

@sseyalioglu could you tell me where that ghost application was running and whether it was running in IIS' Application Pool? I wrongly told you it's not possible to have an unlisted server that's still running on version 1.7.0, but that's true only for SQL Server storage, where 'BackgroundServerGoneException' is thrown when server was already removed on a heartbeat. MySQL implementation should also throw this exception, and a PR should be added.

@mikanyg I could add support for multiple applications in one day, but that will break a lot of applications. I was attempted to solve this without introducing breaking changes for several times, but failed. I will perform yet another try in the near future to see whether it's possible in 1.8.0, because it's indeed a really great feature.

Thank you for your assistance on this issue, the result is much better than it was in the original version! I'm closing this issue, please open new ones if you catch a similar problem.

All 75 comments

We experienced it after upgrading to v.1.7.4 in a distributed environment, where some servers ran 1.6.24 and a single one ran 1.7.4. Tried to upgrade all to v1.7.4, which caused all of our scheduled jobs to be stuck. Ended up rolling back to v1.6.2x versions and had to update the Hangfire.Hash table and remove all rows that had field = 'V' and value ='2' and everything started working again.

I've made some changes for recurring jobs in the latest 1.7.5 release, so may be they solve these issues. If not, please post here log messages as they should contain exceptions to understand what's going wrong.

We had the same issue as @JBastiaan with 1.7.4, after upgrading to 1.7.5 we are seeing all of our recurring jobs in the DISABLED state. When I mouse over it says the following: "Cron expression is invalid or don't have any occurrences over the next 100 years". The Cron expression works fine as far as I can tell. The expression is "20 1 * * *", I checked Cron Tab it says its fine. Below is a screen shot of what I am seeing.

image

From what I can tell this error message isn't correct and can be misleading.

I've modified recurring jobs page and scheduler to show actual errors in dashboard when there's a scheduling error. Fix will be available in version 1.7.6 shortly, the build is already queued on AppVeyor CI.

@odinserj thanks for the quick turn around! I'll pull it down today!

I'm also getting this error. I deleted all recurring jobs, then upgraded to 1.7.6 but after a while they start showing "next execution: xxx hours ago" and they stop firing according to the cron job
image

@Tieno could you show me the output of the following query (should be run against your Hangfire database)? I need to see whether identifiers of those jobs are sitting in the schedule first.

SELECT TOP (20) *
  FROM [HangFire].[Set]
  WHERE [Key] = N'recurring-jobs'
  ORDER BY [Score]

@odinserj , here's a screenshot: image

@Tieno, thank you for update. Now I see that the Score value is inconsistent with the NextExecution one, and that's a bug. I've spent almost entire day thinking how can this happen, but failed – everywhere they are updated simultaneously.

Could you also tell me more about your recurring job by running the queries below, and post here your configuration logic related to Hangfire?

SELECT TOP (20) *
  FROM [HangFire].[Set]
  WHERE [Key] = N'recurring-jobs'
  ORDER BY [Score]

SELECT TOP (1000) *
  FROM [HangFire].[Hash]
  WHERE [Key] LIKE N'recurring-job:%'

@odinserj What do you mean by configuration logic?
The recurring jobs are configured in an Owin startup class by calling "RecurringJob.AddOrUpdate"

Below you can find the results of the query
I routinely delete the recurring jobs so they get recreated. At the moment only
ShutDownMonitor and MyCareNetMonitor are showing the problem of "next execution in the past". After a while all recurring jobs have a next execution in the past.

SELECT TOP (1000) *
  FROM [HangFire].[Hash]
  WHERE [Key] LIKE N'recurring-job:%'

Id  Key Field   Value   ExpireAt
253500  recurring-job:RebuildDefragmentedIndexesArchive Queue   defaultsequentialqueue  NULL
253501  recurring-job:RebuildDefragmentedIndexesArchive Cron    10 4 * * *  NULL
253502  recurring-job:RebuildDefragmentedIndexesArchive TimeZoneId  Romance Standard Time   NULL
253503  recurring-job:RebuildDefragmentedIndexesArchive Job {"Type":".Invoicing.Batch.Jobs.Maintenance.RebuildIndexesArchiveWorkerJob, Invoicing.Batch, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","Method":"Run","ParameterTypes":"[]","Arguments":"[]"}   NULL
253504  recurring-job:RebuildDefragmentedIndexesArchive CreatedAt   2019-08-07T05:32:40.3669804Z    NULL
253505  recurring-job:RebuildDefragmentedIndexesArchive NextExecution   2019-08-16T02:10:00.0000000Z    NULL
253506  recurring-job:RebuildDefragmentedIndexesArchive V   2   NULL
254026  recurring-job:RebuildDefragmentedIndexesArchive LastExecution   2019-08-15T02:10:07.9100911Z    NULL
254027  recurring-job:RebuildDefragmentedIndexesArchive LastJobId   1439604 NULL
254835  recurring-job:MyCareNetMonitor  Queue   default NULL
254836  recurring-job:MyCareNetMonitor  Cron    */15 * * * *    NULL
254837  recurring-job:MyCareNetMonitor  TimeZoneId  Romance Standard Time   NULL
254838  recurring-job:MyCareNetMonitor  Job {"Type":"Invoicing.Batch.Jobs.MyCareNet.MyCareNetQueueDownloadOrSendJobsWorkerJob, Invoicing.Batch, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","Method":"Run","ParameterTypes":"[\"Invoicing.Common.Orchestration.Messages.MCNKeyHandle, Invoicing.Common, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null\",\"System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\"]","Arguments":"[\"{\\\"KeyHandle\\\":\\\"REMOVED\\\",\\\"MinutesLeft\\\":718,\\\"RequestAt\\\":\\\"2019-08-14T08:19:24.5139004+02:00\\\",\\\"user\\\":\\\"591978667963\\\"}\",\"15\"]"}    NULL
254839  recurring-job:MyCareNetMonitor  CreatedAt   2019-08-09T06:26:37.3305973Z    NULL
254840  recurring-job:MyCareNetMonitor  NextExecution   2019-08-14T17:45:00.0000000Z    NULL
254841  recurring-job:MyCareNetMonitor  V   2   NULL
254842  recurring-job:ShutDownMonitor   Queue   default NULL
254843  recurring-job:ShutDownMonitor   Cron    */10 * * * *    NULL
254844  recurring-job:ShutDownMonitor   TimeZoneId  Romance Standard Time   NULL
254845  recurring-job:ShutDownMonitor   Job {"Type":"Invoicing.Batch.Jobs.MyCareNet.ShutDownMyCareNetMonitorWorkerJob, Invoicing.Batch, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","Method":"Run","ParameterTypes":"[\"System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\",\"System.Int64, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\"]","Arguments":"[\"\\\"20:02:24 PM +02 - 14-08-2019\\\"\",\"637014097445139004\"]"}    NULL
254846  recurring-job:ShutDownMonitor   CreatedAt   2019-08-09T06:26:37.4712312Z    NULL
254847  recurring-job:ShutDownMonitor   NextExecution   2019-08-14T06:40:00.0000000Z    NULL
254848  recurring-job:ShutDownMonitor   V   2   NULL
255258  recurring-job:MyCareNetMonitor  LastExecution   2019-08-14T17:30:04.3771158Z    NULL
255259  recurring-job:MyCareNetMonitor  LastJobId   1439587 NULL
255260  recurring-job:ShutDownMonitor   LastExecution   2019-08-14T06:30:10.5051333Z    NULL
255261  recurring-job:ShutDownMonitor   LastJobId   1438470 NULL
256230  recurring-job:RebuildDefragmentedIndexes    Queue   defaultsequentialqueue  NULL
256231  recurring-job:RebuildDefragmentedIndexes    Cron    0 4 * * *   NULL
256232  recurring-job:RebuildDefragmentedIndexes    TimeZoneId  Romance Standard Time   NULL
256233  recurring-job:RebuildDefragmentedIndexes    Job {"Type":"Invoicing.Batch.Jobs.Maintenance.RebuildIndexesWorkerJob, Invoicing.Batch, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","Method":"Run","ParameterTypes":"[]","Arguments":"[]"}   NULL
256234  recurring-job:RebuildDefragmentedIndexes    CreatedAt   2019-08-13T16:06:15.0311007Z    NULL
256235  recurring-job:RebuildDefragmentedIndexes    NextExecution   2019-08-16T02:00:00.0000000Z    NULL
256236  recurring-job:RebuildDefragmentedIndexes    V   2   NULL
256237  recurring-job:UpdateFacturenIndexVoorDoorstorting   Queue   defaultsequentialqueue  NULL
256238  recurring-job:UpdateFacturenIndexVoorDoorstorting   Cron    0 4 * * *   NULL
256239  recurring-job:UpdateFacturenIndexVoorDoorstorting   TimeZoneId  Romance Standard Time   NULL
256240  recurring-job:UpdateFacturenIndexVoorDoorstorting   Job {"Type":"Invoicing.Batch.Jobs.Maintenance.UpdateFacturenIndexVoorDoorstortingWorkerJob, Invoicing.Batch, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","Method":"Run","ParameterTypes":"[]","Arguments":"[]"}  NULL
256241  recurring-job:UpdateFacturenIndexVoorDoorstorting   CreatedAt   2019-08-13T16:06:18.9611080Z    NULL
256242  recurring-job:UpdateFacturenIndexVoorDoorstorting   NextExecution   2019-08-16T02:00:00.0000000Z    NULL
256243  recurring-job:UpdateFacturenIndexVoorDoorstorting   V   2   NULL
256244  recurring-job:Update GefactureerdeZorgdagen dagelijks   Queue   defaultsequentialqueue  NULL
256245  recurring-job:Update GefactureerdeZorgdagen dagelijks   Cron    0 4 * * *   NULL
256246  recurring-job:Update GefactureerdeZorgdagen dagelijks   TimeZoneId  Romance Standard Time   NULL
256247  recurring-job:Update GefactureerdeZorgdagen dagelijks   Job {"Type":"Invoicing.Batch.Jobs.Maintenance.QueueUpdateHistoriekFacturatieTabelWorkerJob, Invoicing.Batch, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","Method":"Run","ParameterTypes":"[]","Arguments":"[]"}  NULL
256248  recurring-job:Update GefactureerdeZorgdagen dagelijks   CreatedAt   2019-08-13T16:06:21.9157922Z    NULL
256249  recurring-job:Update GefactureerdeZorgdagen dagelijks   NextExecution   2019-08-16T02:00:00.0000000Z    NULL
256250  recurring-job:Update GefactureerdeZorgdagen dagelijks   V   2   NULL
256251  recurring-job:RebuildDefragmentedIndexes    LastExecution   2019-08-15T02:00:00.1752636Z    NULL
256252  recurring-job:RebuildDefragmentedIndexes    LastJobId   1439589 NULL
256253  recurring-job:Update GefactureerdeZorgdagen dagelijks   LastExecution   2019-08-15T02:00:00.1752636Z    NULL
256254  recurring-job:Update GefactureerdeZorgdagen dagelijks   LastJobId   1439590 NULL
256255  recurring-job:UpdateFacturenIndexVoorDoorstorting   LastExecution   2019-08-15T02:00:00.1752636Z    NULL
256256  recurring-job:UpdateFacturenIndexVoorDoorstorting   LastJobId   1439591 NULL


SELECT TOP (20) *
  FROM [HangFire].[Set]
  WHERE [Key] = N'recurring-jobs'
  ORDER BY [Score]


Id  Key Score   Value   ExpireAt
3962913 recurring-jobs  -1  ShutDownMonitor NULL
3962912 recurring-jobs  -1  MyCareNetMonitor    NULL
3996549 recurring-jobs  1565920800  Update GefactureerdeZorgdagen dagelijks NULL
3996548 recurring-jobs  1565920800  UpdateFacturenIndexVoorDoorstorting NULL
3996547 recurring-jobs  1565920800  RebuildDefragmentedIndexes  NULL
3898507 recurring-jobs  1565921400  RebuildDefragmentedIndexesArchive   NULL

@odinserj have you got any ideas where the problem might be? We're still experiencing these issues.

as a workaround I created a background process in hangfire that manually corrects the recurring jobs with a score of -1

`

  public class RecurringJobFixupBackgroundProcess : IBackgroundProcessAsync, 
  IBackgroundProcess
  {
    private readonly string _connectionString;
    private readonly string _schemaName;

    public RecurringJobFixupBackgroundProcess(string connectionString, string schemaName)
    {
        _connectionString = connectionString;
        _schemaName = schemaName;
    }
    public async Task ExecuteAsync(BackgroundProcessContext context)
    {

        using (var conn = new SqlConnection(_connectionString))
        {
            if (conn.State != ConnectionState.Open)
            {
                await conn.OpenAsync(context.StoppingToken);
            }
            using (var cmd = new SqlCommand())
            {
                cmd.Connection = conn;
                cmd.CommandText = $@"update {_schemaName}.[Set]  set Score = 0  WHERE [Key] = N'recurring-jobs' and Score = -1";
                await cmd.ExecuteNonQueryAsync(context.StoppingToken);
            }
        }
        context.Wait(TimeSpan.FromMinutes(10));
    }

    public void Execute(BackgroundProcessContext context)
    {
        ExecuteAsync(context).GetAwaiter().GetResult();
    }
}

`

would be lovely to have a fix for this.
I have 1.7.6 also and they work couple days and at an unknown moment it stops. The only way for jobs to start working again is to select them from dashboard and run them again which will keep it working for couple more days (if lucky)

To add context, when there is a problem, seeing this in DB

image

When there is no problem, looks like the following
image

all based on what @odinserj asked earlier

And more requested data from the last query: (problematic one when workaround not applied)
Id Key Field Value ExpireAt
29 recurring-job:everyDay_UTC_1900 Queue default
30 recurring-job:everyDay_UTC_1900 Cron 0 19 * * *
31 recurring-job:everyDay_UTC_1900 TimeZoneId UTC
32 recurring-job:everyDay_UTC_1900 Job {"Type":"Domain.Jobs.ReOccurringJobCoordinator, Domain, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","Method":"DailyBookingContactComplete","ParameterTypes":"[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.7.6.0, Culture=neutral, PublicKeyToken=null\"]","Arguments":"[null]"}
33 recurring-job:everyDay_UTC_1900 CreatedAt 2019-08-22T17:46:05.3326198Z
34 recurring-job:everyDay_UTC_1900 NextExecution 2019-09-06T19:00:00.0000000Z
35 recurring-job:everyDay_UTC_1900 V 2
36 recurring-job:everyDay_UTC_1915 Queue default
37 recurring-job:everyDay_UTC_1915 Cron 15 19 * * *
38 recurring-job:everyDay_UTC_1915 TimeZoneId UTC
39 recurring-job:everyDay_UTC_1915 Job {"Type":"Domain.Jobs.ReOccurringJobCoordinator, Domain, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","Method":"DailyBuildMonthsInvoices","ParameterTypes":"[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.7.6.0, Culture=neutral, PublicKeyToken=null\"]","Arguments":"[null]"}
40 recurring-job:everyDay_UTC_1915 CreatedAt 2019-08-22T17:46:05.5287898Z
41 recurring-job:everyDay_UTC_1915 NextExecution 2019-09-06T19:15:00.0000000Z
42 recurring-job:everyDay_UTC_1915 V 2
43 recurring-job:everyDay_UTC_1920 Queue default
44 recurring-job:everyDay_UTC_1920 Cron 20 19 * * *
45 recurring-job:everyDay_UTC_1920 TimeZoneId UTC
46 recurring-job:everyDay_UTC_1920 Job {"Type":"Domain.Jobs.ReOccurringJobCoordinator, Domain, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","Method":"DailyAddConversionToInvoices","ParameterTypes":"[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.7.6.0, Culture=neutral, PublicKeyToken=null\"]","Arguments":"[null]"}
47 recurring-job:everyDay_UTC_1920 CreatedAt 2019-08-22T17:46:05.5459194Z
48 recurring-job:everyDay_UTC_1920 NextExecution 2019-09-08T19:20:00.0000000Z
49 recurring-job:everyDay_UTC_1920 V 2
50 recurring-job:everyDay_UTC_1930 Queue default
51 recurring-job:everyDay_UTC_1930 Cron 30 19 * * *
52 recurring-job:everyDay_UTC_1930 TimeZoneId UTC
53 recurring-job:everyDay_UTC_1930 Job {"Type":"Domain.Jobs.ReOccurringJobCoordinator, Domain, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","Method":"DailyPeriodicDeliveryScheduling","ParameterTypes":"[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.7.6.0, Culture=neutral, PublicKeyToken=null\"]","Arguments":"[null]"}
54 recurring-job:everyDay_UTC_1930 CreatedAt 2019-08-22T17:46:05.5611524Z
55 recurring-job:everyDay_UTC_1930 NextExecution 2019-09-06T19:30:00.0000000Z
56 recurring-job:everyDay_UTC_1930 V 2
57 recurring-job:everyDay_UTC_1900 LastExecution 2019-09-05T23:40:39.1052462Z
58 recurring-job:everyDay_UTC_1900 LastJobId 8754
59 recurring-job:everyDay_UTC_1915 LastExecution 2019-09-05T23:40:39.1436596Z
60 recurring-job:everyDay_UTC_1915 LastJobId 8755
61 recurring-job:everyDay_UTC_1920 LastExecution 2019-09-07T19:20:34.4544013Z
62 recurring-job:everyDay_UTC_1920 LastJobId 8845
63 recurring-job:everyDay_UTC_1930 LastExecution 2019-09-05T23:40:39.1960020Z
64 recurring-job:everyDay_UTC_1930 LastJobId 8757

image

One final comment. I have another one running under version 1.7.4 and works perfectly fine since Aug 15th and the previous versions earlier for months.

Any progress on this? I am experiencing the same issues as described above.

If it might help anything, I think our jobs started showing next execution at in the past immediately after one of the deployments. It was a regular deployment and the hangfire version was not increased as a part of it.

@kresimirlesic are your jobs start on fixed times or every X min/hour etc?
Ours is fixed times daily.

@sseyalioglu my job is configured to run every 5 minutes

I would like to share one more observation.
The only way to have jobs status corrected is to manually triggering it.
Once I manually trigger it shows that next run would be running properly (at least how you fee it)
However after some time (today's experience), 1 of 4 jobs seems like going to work next day but all other 3 started to be in the past again. (within 2-3 hours as seen below)

image

This is super annoying, I can't be doing this manually every day and keeping an eye on it.
Please fix.

@odinserj is there any timeline for this fix? This is occuring for me on multiple projects and it is really annoying. Not to mention that there is a lot of critical stuff in those jobs.

Thanks!

@odinserj @kresimirlesic @Tieno I have another insight to this.

  1. We've downgraded to 1.7.4 but this did not change the behavior.
  2. We have 2 separate applications using 2 different databases where one of them is experiencing the issue
  3. As a result, I think it is related to something in database possibly. In the configuration for the problematic one we have the following (in startup.cs)

In ConfigureService
services.AddHangfire(cfg => cfg.UseStorage(new MySqlStorage(hangfireConnection, new MySqlStorageOptions { QueuePollInterval = TimeSpan.FromSeconds(60), PrepareSchemaIfNecessary = false, JobExpirationCheckInterval = TimeSpan.FromMinutes(2) })));

and in Configure() it is initiated in the following way:

app.UseHangfireServer(new BackgroundJobServerOptions {WorkerCount = 2}); app.UseHangfireDashboard();

In the one that works fine without an issue we have the following in ConfigureService

services.AddHangfire(cfg => cfg.UseStorage(new MySqlStorage(hangfireConnection, new MySqlStorageOptions { QueuePollInterval = TimeSpan.FromSeconds(60), PrepareSchemaIfNecessary = true, DashboardJobListLimit = 1000 })));

@kresimirlesic what is the underlying data source you are using?

@kresimirlesic @odinserj I am almost sure that, the issue is somehow related to database itself. Seems like on a clean fresh database, there is no issue.

@sseyalioglu we are using postgre

@kresimirlesic @odinserj I tried so much to make it work but does not.
I've made complete new database, didn't work. Changed from hangfire to hangfire.core didn't work. Run it on single server only, didn't work.
Read everything on 1.7 upgrade guide and followed, and it did not work either.

All it comes to is that, when it is time for the job to really work, everything goes upside down.
I am not sure how this still cannot be fixed!
I am downgrading as well to 1.6.27 (I hope this will solve the issue)

Is there any ETA or anything on this issue?

@sseyalioglu, any chance you are using multiple applications with different code bases that share the same storage? Please tell me the output of these queries again when nothing is working, and I will start looking into this deeply tomorrow.

@odinserj only a single application and same exact version.

Results from first query:
image

Results from second query:
image

I really would like this problem be solved because everyday I have to manually trigger the jobs myself.

Also using the following strorage package
image

dotnet core 2.2, running on ubuntu 16.04 to provide a bit more context.

@sseyalioglu, any chance you are using multiple applications with different code bases that share the same storage? Please tell me the output of these queries again when nothing is working, and I will start looking into this deeply tomorrow.

I do this, but I made sure to update all libraries to the same version of hangfire.

@sseyalioglu, I've found at least one problem with Hangfire.MySqlStorage – the precision of the Set.Score column is not high enough, and the type of that column should be changed to double (currently it's float).

This add delays to scheduled jobs and will cause issues for recurring jobs. Please change the column type (it's better to also submit a PR to the https://github.com/arnoldasgudas/Hangfire.MySqlStorage repository) and let me know about the results.

But lower precision doesn't explain why it's set to -1 in the Set table, and has the correct value in the Hash table.

@Tieno, @sseyalioglu please use the modified version of the Hangfire.Core.dll assembly to diagnose the issue – https://www.dropbox.com/s/l3yo6zwsvtaa8cq/Hangfire-1.7.6-track-recurring.zip?dl=0. You'll need to configure the logging subsystem first to record logs starting from the warning level.

It will dump recurring job, changed fields and new next execution date when attempting to record the '-1' value to the database. Please post those messages here for further analysis.

@Tieno same code base and same application (same version) only used here. @odinserj I will make the changes and let you know the results and if all works, will send PR as well.

@odinserj tomorrow will be two years since I filed Set.Score precision issue there :)
It doesn’t have schema update capability though, so existing databases would still require manual fix.

Great @sseyalioglu, thanks! Debug information will point what's going on, because I was unable to reproduce the issue.

@pieceofsummer sometimes it takes time to merge PRs ;-(

Hey @odinserj, did you get a chance to find a fix for postgre as well? The same issue as described above happens on postgre.

@odinserj from the logs, what kind of log am I looking for?
I see there are few loggers there which are
Hangfire.Processing.BackgroundExecution
Hangfire.Server.ServerHeartbeatProcess
Hangfire.MySql.ExpirationManager
Hangfire.MySql.CountersAggregator
Hangfire.BackgroundJobServer
Hangfire.Server.BackgroundServerProcess

I honestly don't see anything related to recurring jobs. Is there any keyword I can focus on to get data?
I made a search with "recurring" keyword and was able to get the following:
image

And the following is the current condition:
image

Honestly don't know when we went into "ago" state.
The ones in "ago" state were never executed (or never started) but the one that needs to happen 19:30 is executed normally. And I don't really see any interesting log about it.

Finally log level is set to "warning" however I can only see "Debug" and "Info" logs (mainly debug).

The following line is added when persisting a recurring job with -1 value in the Score table to catch what original and changed values cause the behavior. So if the problem is caused by the RecurringJobScheduler component and your application does use the Hangfire.Core.dll assembly from that Dropbox link, it should emit a warning message.

logger.Warn($"Recurring job '{recurringJobId}' is being disabled. Job: ({String.Join(";", recurringJob._recurringJob.Select(x => $"{x.Key}:{x.Value}"))}), Changes: ({String.Join(";", changedFields.Select(x => $"{x.Key}:{x.Value}"))}), NextExecution: ({nextExecution})");

However previously update package version was triggering this logging message only from the RecurringJobScheduler class, and the following link will add logging also for the RecurringJobManager class. I was sure that it can't be the source of the problem, because it's too simple, but everything is possible.

Please download the new package and try again, sorry for yet another required step.

https://www.dropbox.com/s/l3yo6zwsvtaa8cq/Hangfire-1.7.6-track-recurring.zip?dl=0

Ok, I was not able to see any log having "is being disabled". I will download the new one and try again. Will let you know the outcome possibly by tomorrow.

@odinserj just happened again however I cannot see any log like this, all I see is the following:
image

As you can see here, it actually triggered the job once and then it got lost as you can see here:
image

I am suspecting, same will happen to the jobs once their time will hit.

And here is a quick context of how I run those jobs (not sure if it matters)
image

@odinserj as suspected, just after execution, second job fall into "ago" state again without any log matching. (I have warnings working in the system)

@odinserj sorry, need to correct something here. This time it seems like it did not happen. If I don't refresh the page, it says "ago" but after refreshing says "in a day" so I will wait and update when it actually happens. Sorry about so many messages here as I got confused. But clearly in the morning, all jobs were in "ago" state.

No worries, thank you for your assistance on this issue. Data on the page
itself isn’t refreshed automatically, but counters and relative dates do.
But as we’ve seen earlier, -1 value is being stored in your database for an
unknown reason, so will wait for your response!

пн, 14 окт. 2019 г. в 22:25, sseyalioglu notifications@github.com:

@odinserj https://github.com/odinserj sorry, need to correct something
here. This time it seems like it did not happen. If I don't refresh the
page, it says "ago" but after refreshing says "in a day" so I will wait and
update when it actually happens. Sorry about so many messages here as I got
confused. But clearly in the morning, all jobs were in "ago" state.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/HangfireIO/Hangfire/issues/1459?email_source=notifications&email_token=AAIHLPT4XUQILVAYXMDOQ63QOTBRJA5CNFSM4H4PTDE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBGE56A#issuecomment-541871864,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAIHLPVRW32KMKTBEGQAQE3QOTBRJANCNFSM4H4PTDEQ
.

>

Kind regards, Sergey Odinokov,
Founder/developer @ http://hangfire.io
Contacts:
Email: [email protected]
Skype: odinserj

@sseyalioglu any news, do your logs contain any interesting messages?

@odinserj funny enough that this didn't happen in the past 2 days (first time ever). Did you change anything else? I will wait a little more and try using the nuget versions once again if it does not occur.

Very strange, there are some changes between 1.7.6 and the first updated package (but I doubt they can affect recurring jobs), but there are no changes between modified versions except some more logging.

@odinserj I am not sure, if I say "good" but it happened again. I will dig in the logs now.

@odinserj unfortunately I was not able to get any logs in that fashion. But for better understanding, I've done some screen recording (with voice), please take a look. https://d.pr/v/XK1Gxv
I will dig a little more to see if I can get anything

That's great anyway, because at least we can get to know the source of the problem.

unfortunately, the expected log "is being disabled" or any other warning from hangfire does not appear during the expected timeframe.

Here are the only things I can see after analyzing deeply. https://d.pr/v/azZkpA

Thank you for the screen-casts. Wow the problem is completely insane, I see your logs are produced, and even heartbeats are here. Are you sure your production application uses the updated version? Because sometimes during deploys something happens and assembly isn't updated. That's the simplest reason, but I will dig more into this – someone is setting the -1 value. I hope to be able to find this without using WireShark.

I am running this on my dev env (cannot do it on live). I don't think I can use wireshark, all ubuntu servers with no ui. And I can verify that deployed app is using the updated assembly. (In a few hours, I will log in and check the version of the file from the running version in dev)

@odinserj it is more difficult than I've imagined in linux. Seems like I don't have file version info just like in windows.However I can clearly confirm that, all files got updated together without exception as seen here:
image

@sseyalioglu you can use exiftool Hangfire.Core.dll on Linux to get various metadata, including file and assembly version.
Another crude way is something like strings Hangfire.Core.dll | grep 'Hangfire.Core, Version='

It is "Hangfire.Core, Version=1.7.6.0, Culture=neutral, PublicKeyToken=null" but I don't think this tells much. I've verified the dropbox file and deployed file are identical using beyond compare binary comparison / CRC comparison etc.

@sseyalioglu I don't see any other places that can set the score value to -1. I suspect you have another Hangfire Server running on version 1.7.5 or below that can't find a target method (due to its absence or different signature). Nevertheless, I've updated that logging to always (not only when score is -1) emit a trace-level message (not the warning). So if you don't see any message when value is changed to "ago", then it's another server that mess things up.

Updated logging is available in version 1.7.7.

@odinserj thanks for the explanations. I will try 1.7.7 obviously.
Just a simple question. Assuming that you are correct about another server with earlier version, wouldn't I see that other server in this list?
image

How can we see/verify if anything else is causing any trouble? (I know, I don't have any other)

Yes, it should be in that list. I suspect, but I can be wrong, and the best way to ensure there's no intervention by another server is to update to 1.7.7 and check yet another trace log message "Recurring job 'XXX' is being updated" that must be there in any case, because there are no conditions anymore.

I will let you know.

Thank you for your patience @sseyalioglu

@odinserj i am facing the same issue. My setup is:
2 applications are targetting the same Hangfire tables but they are using different queue. In the 'Recurring jobs' overview some jobs show the text 'Could not load file.....', but when visiting the same page for the other application it shows the method correctly.

In general the jobs are being executed correctly, but sometimes (within a day) the 'next execution' is in the past.

I'am using version 1.7.2. Is there a fix already for this problem?

Thank you

@odinserj i have upgraded to 1.7.7 and also ran the DB scheme script. So i am running Scheme version 7. When checking the hangfire dashboards the recurring-job which is known in one application but not in the other is running perfectly a single time. But then in the 'Next execution' column the text 'ERROR' is shown and the recurring job is never executed again.

@odinserj the issue still happens to our systems and there is absolutely no related log unfortunately.

I'm really sorry for the issues, the changes should have been implemented behind a feature toggle with the default behaviour equivalent to the older one. After a short vacation I've realised there's a better and much simpler solution than just disabling recurring jobs when there are troubles.

I've just committed a change to add retry attempts to recurring jobs instead of disabling them. So now even in unsupported scenario of having the same storage for multiple applications, other instances will get a chance to run recurring job to behave the same way as in 1.6.X versions.

I will release the changes in 1.7.8 hopefully next week, but you can update to version 1.7.7-build-03046 from the continuous integration feed – https://ci.appveyor.com/nuget/hangfire.

@sseyalioglu but I'm still not sure whether this will help you, because this change alters the branch where log messages should appear. Do you have trace level enabled for your logs?

@odinserj Why not support for multiple applications using same storage/ui? It’s clearly something people use in the micro service era. I have done it and mostly it worked with the 1.6.x releases, though retries could end up in wrong applications queue.

If hangfire supported multiple apps it would get even more projects to use it.

@odinserj I think I finally found the problem with the actual reason.
After adding an hourly job, and seeing nothing got executed with the problem appearing, I've digged deeper into other logs such as syslogs as well.

What I figured out was, there was a ghost app (very old, and deleted) running against the same database. No files exist, but it was still running in memory.

Here is how it looks in the logs for anyone else having similar issues.
image

I think that ghost app was acting faster than the regular one and always picking it up and disabling it so we were never able to see any logs generated by the newer version.

I've killed that already and hopefully in 1 hour will know the effect.

@odinserj I confirm that, my problem is over. It really was this ghost running app. Sorry for all the trouble.

I had same problem on MSSQL, lots of time deleted tables and started hangfire server but nothing changed. When create a new database problem has solved.

Version 1.7.8 was released yesterday that uses retries instead of immediately disabling a recurring job if there are any exceptions. So the behavior is almost the same as was in version 1.6.X and it can be used for multiple applications, although I don't recommend doing so until there's a proper handling of such a feature.

@sseyalioglu could you tell me where that ghost application was running and whether it was running in IIS' Application Pool? I wrongly told you it's not possible to have an unlisted server that's still running on version 1.7.0, but that's true only for SQL Server storage, where 'BackgroundServerGoneException' is thrown when server was already removed on a heartbeat. MySQL implementation should also throw this exception, and a PR should be added.

@mikanyg I could add support for multiple applications in one day, but that will break a lot of applications. I was attempted to solve this without introducing breaking changes for several times, but failed. I will perform yet another try in the near future to see whether it's possible in 1.8.0, because it's indeed a really great feature.

Thank you for your assistance on this issue, the result is much better than it was in the original version! I'm closing this issue, please open new ones if you catch a similar problem.

Version 1.7.8 was released yesterday that uses retries instead of immediately disabling a recurring job if there are any exceptions. So the behavior is almost the same as was in version 1.6.X and it can be used for multiple applications, although I don't recommend doing so until there's a proper handling of such a feature.

@sseyalioglu could you tell me where that ghost application was running and whether it was running in IIS' Application Pool? I wrongly told you it's not possible to have an unlisted server that's still running on version 1.7.0, but that's true only for SQL Server storage, where 'BackgroundServerGoneException' is thrown when server was already removed on a heartbeat. MySQL implementation should also throw this exception, and a PR should be added.

@mikanyg I could add support for multiple applications in one day, but that will break a lot of applications. I was attempted to solve this without introducing breaking changes for several times, but failed. I will perform yet another try in the near future to see whether it's possible in 1.8.0, because it's indeed a really great feature.

Thank you for your assistance on this issue, the result is much better than it was in the original version! I'm closing this issue, please open new ones if you catch a similar problem.

@odinserj in our case, we are running on ubuntu 16.04 over kestrel (core 2.2) with MySQL. Ghost app was on kestrel (its service was disabled, its files were deleted, but service was never stopped and it kept on running for months). Because I've got into db and regenerate the DB from scratch, ghost app (hf v1.6.something was there using the DB connection where DB was not listing anything about it)

I hope this info helps.

Hi,

I am still facing this issue. Can you please confirm if this is fixed? I have got two recurring jobs and the next execution is coming up with an error for one of the jobs. This is running as an IIS application on windows OS. Can you please help me on this?

Thanks
Karthik

@kartthikn Not sure if this is the same root cause as yours, I had this problem when running two job queues on two servers that do not share code. This is not a supported use case by Hangfire. I've developed a workaround though

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tompazourek picture tompazourek  Â·  3Comments

cindro picture cindro  Â·  3Comments

shorbachuk picture shorbachuk  Â·  4Comments

plmwong picture plmwong  Â·  3Comments

vikramjb picture vikramjb  Â·  3Comments