Nugetgallery: If CollectAzureCdnLogs fails to collect raw log files, it will report success

Created on 15 Aug 2017  路  2Comments  路  Source: NuGet/NuGetGallery

https://github.com/NuGet/NuGet.Jobs/blob/master/src/Stats.CollectAzureCdnLogs/Ftp/FtpRawLogClient.cs#L134

By returning an empty enumerable here, it is impossible for the code that is calling this method to determine if the ftp was empty or if it failed to log in.

Statistics Priority - 3 Bug ops grabs

All 2 comments

I'm not sure we need to cause the job to actually fail.

As this particular failure will result in the job to exit gracefully and retry on the next run, I'm wondering if there's anything actionable for us when we cause the job to fail upon exit, and alert on this particular exception.
We may get lots of false positives due to temp glitches, which auto-heal themselves (and currently indeed may go unnoticed). Measuring lag on the importer job could be sufficient? (e.g. no files imported in the 30 or 60 minutes) If the logs container does not contain any files, and no imports happened, that's an indicator for a failure earlier in the pipeline, which is this FTP file collection stage.

FTP is not the most reliable or best performing protocol, and I'd love to actually get rid of the FTP dependency here (which would make this entire job obsolete, as well as the App Service deployment acting as our FTP... )
If we could simply have the CDN provider point to a blob container as drop location, the importer job could fetch directly from that container.
However, I don't think our current CDN provider currently supports anything else as a drop location for the W3C logs, which is a pity.

I agree with you completely; however, what I mean specifically is not that the job should fail and exit or do anything that prevents it from continuing the next loop, but instead that it should just not report success.

If you're familiar with the NuGet.Jobs job runner, you'll know that in terms of logging, every job follows the following output pattern:

Starting job.
... (logging from job) ...
Job succeeded/failed!
Looping again in X seconds.

Currently, when the job fails to access the FTP, it says Job succeeded! even though it, for our purposes, failed. The job should emit Job failed! and then try again on the next loop, so that nobody is confused into thinking the job actually succeeded.

Was this page helpful?
0 / 5 - 0 ratings