Dnn.platform: RFC: Event Log, Exceptions table

Created on 17 Oct 2018 · 36Comments · Source: dnnsoftware/Dnn.Platform

Request For Comments.

Please review the problem and suggested solution and provide your comments to help drive this feature implementation into a future platform release.

Description of problem

We have an Exceptions table which computes a hash of each exception. The idea behind this table is we could display which exceptions are occurring frequently compared to the exceptions which only happen a small handful of times.

Description of possible solution 1

When expanding an exception, we currently list the exception hash. We should also display and offer the option to sort by the number of occurrences of that exception hash. Since this is only valid for exceptions, we need to make sure it doesn't cause any issues with the other event log types.

Description of possible solution 2

We remove this feature/table if people do not find it valuable.

Source

ohine

👍1

Most helpful comment

I definitely think we need to keep the exceptions abstracted into a new table to allow us to use the data for the original intent. This way, folks like us can do better analysis of what is affecting a DNN instance. We just need to fix the data and have a UI for it.

hismightiness on 30 Oct 2018

👍2

All 36 comments

Hi @ohine
Not strictly related to this issue, but when I do

SELECT [ExceptionHash], COUNT(1) [Count]
FROM [dbo].[Exceptions]
GROUP BY [ExceptionHash]
HAVING COUNT(1) > 1
ORDER BY COUNT(1) DESC

I get empty results:

ExceptionHash Count
------------- -----------

(0 rows affected)

Yet, there are indeed duplicate records in the Exceptions table, just the hash is different. I'm now grouping by all columns excluding the hash:

SELECT [Count], [Message]
FROM (
  SELECT [Message], [StackTrace], [InnerMessage], [InnerStackTrace], [Source], COUNT(1) [Count]
  FROM [dbo].[Exceptions]
  GROUP BY [Message], [StackTrace], [InnerMessage], [InnerStackTrace], [Source]
  HAVING COUNT(1) > 1
) query
ORDER BY [Count] DESC

and getting:

Count Message                                                                                              
----- -----------------------------------------------------------------------------------------------------
  169 The Controls collection cannot be modified because the control contains code blocks (i.e. <% ... %>).
  168 Object reference not set to an instance of an object.
   48 Exception has been thrown by the target of an invocation.
   14 Error occurred while obtaining a token for portalId 0 and application Analytics
    4 Object reference not set to an instance of an object.
    2 Error occurred while obtaining a token for portalId 0 and application Structured Content
    2 Thread was being aborted.

(7 rows affected)

Maybe this hash is including more metadata than it should?

daguiler on 17 Oct 2018

@daguiler I think it's actually an issue that the function being used to calculate the hash doesn't use a stable key/salt, so the same content will generate different hashes. I discovered this recently, but haven't had a chance to verify and submit an issue.

Assuming that's true, it would certainly seem like no one has found this data useful since it started being collected (other than, perhaps, granting some false assurance against consistent exceptions). On the other hand, if we fix this issue and expose some more exception details in the UI, perhaps it would end up being good information to have.

bdukes on 17 Oct 2018

👍1

Yes, makes sense.

daguiler on 18 Oct 2018

Thanks @daguiler and @bdukes for looking into this RFC and providing feedback! I'm hoping @donker will chime in with more details behind that table and how he intended it to function.

I do recall we've had quite a few regression issues with the feature over the last few years and hopefully we can nail down what it should be doing and make it happen. I think it would be valuable to have that information, right now it appears it's just causing more issues then it's solving.

ohine on 18 Oct 2018

please be aware that by default, all Log types for exceptions (SCHEDULER_EXCEPTION, GENERAL_EXCEPTION, MODULE_LOAD_EXCEPTION, PAGE_LOAD_EXCEPTION) are configured to keep only the last 100 entries. When the eventlog entry is deleted, the exceptionevent is deleted as well and only the exeption with the stack trace is left, i.e. with the current structure, you won't get any data for statistics, unless you configure all these types to keep ALL entries. (In my scripts, I recently limited those entries to max age of - by default - two years, als either the DNN version is too old or the item would no longer match the code.

sleupold on 18 Oct 2018

I don't see much benefit of having two 1:1 tables EventLog and ExceptionEvents, merging both into one would result in a slightly larger single table with a lot of empty columns (for non-exceptions), but the duplicate columns UserID and PortalID would be eliminated and a single table would be easier to handle.
diagr_eventlog

If there is no use for Exceptions being stored separately any more, we should consider moving the messages and stacktrace back to EventLog table as well (they don't take significant space, if the columns are Null) and we would avoid the costly join when retrieving the data.

sleupold on 18 Oct 2018

Referrer and UserAgent might be interesting on Login or first visit as well.

We also need to be aware that this tables may contain personal data, affected by GDPR.

sleupold on 18 Oct 2018

@daguiler I think it's actually an issue that the function being used to calculate the hash doesn't use a stable key/salt, so the same content will generate different hashes. I discovered this recently, but haven't had a chance to verify and submit an issue.

Assuming that's true, it would certainly seem like no one has found this data useful since it started being collected (other than, perhaps, granting some false assurance against consistent exceptions). On the other hand, if we fix this issue and expose some more exception details in the UI, perhaps it would end up being good information to have.

I wrote the original change, so I'd better jump on this discussion. Before my intervention there was no Exceptions table. And to find an answer to "Which bit of malfunctioning code is wrecking my DNN's performance?" was near impossible to answer. So the thought was as Oliver mentioned to be able to get to grips with bugs in your installation. The UI, however, was never done because ... well, lots of reasons.

In all their wisdom at some point someone at Corp injected the LogGUID back into the hash so that defeated the purpose of the intervention. I believe (I'd have to trace this) that I took that out again.

What I do see, however, is this bit of code in the Hash function:

            // DNN-8845: using a FIPS compliant HashAlgorithm
            using (var hasher = new HMACSHA1())
            {
                var byteArray = hasher.ComputeHash(Encoding.Unicode.GetBytes(sb.ToString().ToLowerInvariant()));
                return Convert.ToBase64String(byteArray);
            }

I'm not familiar with HMACSHA. I implemented MD5 first. But it looks like they changed this to comply with FIPS: https://dnntracker.atlassian.net/browse/DNN-8845. But from what I can see this now defeats our purpose because HMACSHA1 is a keyed algorithm so when the key changes the hash changes (I think). They probably should have chosen SHA1CryptoServiceProvider instead which is not keyed. Can anyone second that observation?

Also I was hoping to offload the EventLog table from some data as well. The exceptions are stored in there as well. That should be (IMHO) deprecated.

Finally the UI: we should have a screen that shows the top 10 exceptions. Then you should be able to drill down by clicking on one to show the individual events. I.e. the list we now see but filtered by exception hash.

donker on 23 Oct 2018

Yeah, it looks like that FIPS-compliant issue was the cause of the issue I was seeing. (Some government computers have a FIPS-compliant setting which causes programs to crash when they try to use the MD5 function or other non-compliant algorithms, so even though it's not being used in a security context, it's important not to use it).

I found this out when I used this code as the basis for a function I was using and realized it didn't work. Here's what I ended up with:

using (var hasher = SHA1.Create())
{
    var byteArray = hasher.ComputeHash(Encoding.Unicode.GetBytes(data));
    var hashedToken = Convert.ToBase64String(byteArray);
    return hashedToken;
}

Personally, I don't know that I've ever used the data in the table to look at exceptions in the aggregate, but if there were a view of aggregate exceptions, I would think it was useful.

bdukes on 23 Oct 2018

👍2

So for the worklist for this RFC we can add changing this algorithm at the very least.

donker on 23 Oct 2018

I fear, we are mixing logging and analysis.
The key requirements for a log table:

ver fast write with minimal performance affect on the transaction
performant table maintenance.
optional analysis as a specific process (background, nightly or dedicated machine)
If working as intended, the current solution tries to perform analysis during write and relies on all base data to be present upon querying the data. A more suitable solution would
write EventLog in a single table as originally designed.
periodically extract Exceptions and ExceptionHistory (count per day)
clear EventLog items according to rules
clear outdated ExceptionHistory (minimum occurrences not reached)

sleupold on 23 Oct 2018

Yeah, I'm not a proponent of the single table. The idea was to offload the largest block of text from the main table, namely the stacktrace. If you have a bit of malfunctioning code that runs on every page hit (and this is not an uncommon scenario) your EventLog grows like mad.

donker on 23 Oct 2018

So for the worklist for this RFC we can add changing this algorithm at the very least.

Yup, I'll get it added into the description and suggest to use the code @bdukes provided.

I assumed it still was working after the last time I fixed it. I didn't verify the hash was still valid before creating the RFC. It's been something I've wanted to fix for awhile now but just haven't had the chance to create the UI for the table.

ohine on 23 Oct 2018

@donker, offloading rows would only make sense, if these columns take much space. If we consider most of the rows having empty fields for stack trace, inner stack trace, etc. each Null value just takes one bit (not byte). If there is a value, duplicate values occupies space for the text size of all messages, but by default limiting each exception type to 100 items kept, the size is neglectable. And by deleting old items, occurrence info will be deleted permanently as well.
However, any analysis (e.g. checking for Hash already existing) takes a lot of extra effort when recording a log entry.

sleupold on 24 Oct 2018

@donker, offloading rows would only make sense, if these columns take much space. If we consider most of the rows having empty fields for stack trace, inner stack trace, etc. each Null value just takes one bit (not byte). If there is a value, duplicate values occupies space for the text size of all messages, but by default limiting each exception type to 100 items kept, the size is neglectable. And by deleting old items, occurrence info will be deleted permanently as well.
However, any analysis (e.g. checking for Hash already existing) takes a lot of extra effort when recording a log entry.

I see your point. Would you add columns then for the error message, stacktrace and/or inner exception? They are currently folded in an XML blurb which makes SQL manipulation next to impossible. The XML blurb could just as well be offloaded to disk as far as I'm concerned. But the exception is crucial to determine big from small errors in the system.

donker on 24 Oct 2018

@donker, I would use individual columns, which ate taking less space and are easier to retrieve.

sleupold on 24 Oct 2018

I wouldn't offload the XML blurb to disk, that would make analysis unnecessarily more difficult.

sleupold on 24 Oct 2018

I wouldn't offload the XML blurb to disk, that would make analysis unnecessarily more difficult.

There is a lot of "junk" in there as well which looks more like something for a regular (disk based) log. I'd keep the best bits in SQL.

donker on 24 Oct 2018

👍1

hismightiness on 30 Oct 2018

👍2

Yup, I'm all for keeping it, fixing the hash, and creating the UI.

If that's the consensus of the group, I can get this created as a feature and close out this RFC.

Any objections? @donker @sleupold @bdukes @daguiler @hismightiness

ohine on 31 Oct 2018

I'm all for it.

donker on 1 Nov 2018

We need to get back to the requirements for logging:

it should gather all necessary data and store it as quick as possible, not to interfere with online transactions
it should support analysis of this log
Ideally, 1 and 2 are decoupled - the logging is triggered in-process and writes a single log entry and a (scheduled) process afterwards analyze the log entries and take appropriate actions (update statistics, notify person in charge or disable a component).

Following this pattern, any event should be logged using a single database call, which just adds a single record to the log table. We should add a frequently running scheduler job, which analyses the log, separates exceptions and exception events, maintains a statistics table for it (e.g. events per day) and delete all analyzed items from the log keeping it small and fast.

sleupold on 1 Nov 2018

@sleupold I think completely rewriting the logging engine is out of scope for this RFC. We are trying to put things back together and make small improvements to the event logging UI.

I think we can start another RFC for gathering requirements for a completely rewrite of how we're logging in DNN today. After we get the async flag added, we should be able to at least make logging a non-blocking call.

ohine on 2 Nov 2018

Any objections?

Sounds good to me.

daguiler on 2 Nov 2018

No, I don't request a complete rewrite, but a restructure of processing the data. the current structure might be generated in the background process, but we need to free up logging from unnecessary burden to regain original performance.

sleupold on 2 Nov 2018

PS: just adding a UI for no data is not useful imo.

sleupold on 2 Nov 2018

I'd like to elaborate on the performance effect, as we were recently discussing the problem of incrementing CRM version, which may cause a multi hour downtime on high perf sites. Now consider a situation, when such a site gets into a loop of constantly logging errors (e.g. when the database size is limited and this limit is reached). Either the number of errors force the site to be shut down or the error logging continuously slows down the site, as each logging is analyzing the event generating hash, searching for the hash in the database and if not found, adding it. Writing EventLog and ExceptionEvent entries, which includes statements to delete outdated records per log type.
The original AddEventlog method just took the parameters and added a single log entry to the log table (and deleted outdated entries from this table, which often timed out, ending up in log tables increasing significantly and affecting performance). The split into multiple tables might reduce database size, but doesn't fix the general problem.
The optimal solution would be a log table, where just a single record gets added in process while a background job is executed frequently to analyze the log entries, taking appropriate actions, which includes preparing data for UI, and deleting analyzed data from the log.

sleupold on 3 Nov 2018

The more I think and investigate about logging events and exceptions as well as auditing data changes in DNN, I feel a need to review the current practice, to make sure, it suits the needs of our users and to eliminate redundancies.

EventLog contains exceptions (with inner message and stack trace) and events (like UserCreated, ModuleUpdated, …), both including portal and user reference, exceptions include page reference as well.
Exceptions are also might get logged into a log file using Log4Net.
CreatedByUser, CreatedAtDate, UpdatedByUser and UpdatedAtDate (so called "audit columns") are added to most of the tables for each record.
Except for exceptions, EventLog does not contain much additional data (which module setting was updated, previous and new values, …)
By default, only a small number of items per "Log Type" is kept (100) , all previous items are deleted, when a new item is logged

Issues and requirements:

Logging should have minimal effect on performance
We need improved, more intelligent options to identify problems and investigate the log
we need to make sure, we get all necessary information to identify and fix issues (e.g. if there is suspect of the site being hacked, which might be result of thousands of login attempts, …
admins would like to get an option to provide a "Change Log" for a site
admins sometimes need to restore old data, e.g. the Email address of a violating user from registration, after he defaced his profile.

Are there other needs from other users?

sleupold on 5 Nov 2018

@sleupold I think you have a great start here. I would say that if we are taking a step back I'd want to add a few more goals to the process.

Logging of actions should be separated from actions that are exceptions or at least a quick filter to limit the view so we can see what is going on. As really we are looking at different things when we look for exceptions or audit logs
Ideally = Log4Net logs would be integrated, or visible in a usable manner with the other logging
Logging needs cleanup, including the log4net, which by default DOES NOT clean itself up

mitchelsellers on 5 Nov 2018

@mitchelsellers,
I agree - and like to add:

cleanup of EventLog does currently not include orphaned items in Exceptions table (except for users of my Turbo Scripts)
"in-process" cleanup of EventLog items might affect performance
We need a logging and auditing concept, which includes support for most common needs for small, medium and large sites (with little up to high usage), taking into account typically assigned ressources (file space, database size and memory)

sleupold on 5 Nov 2018

we should review current cleanup strategy - IMO count per log type is not useful in many cases.
if there are situations with e.g. a few hundred system exceptions, one caused by the other, they might overwrite each other including the originally causing issue.
I see a need to reduce redundancies, while the same time we need an integrated view of all activities, e.g. the "update module" option causing a page load exception.
it would be helpful, if we could generate an "item history" per page, module or item from the log (even if the item doesn't support versioning)

sleupold on 5 Nov 2018

I don't run huge sites that have a severe performance hit due to logging, but a UI for viewing log4net logs, changing the log4net settings and that they do cleanup would be super useful!

valadas on 5 Nov 2018

So I think to resume this thread:

We repair the current situation
We plan for an overhaul of the logging system

To come to the performance issue: exceptions should be exceptional. Ideally you have none. But if they occur frequently then there is an issue with your site and the logging performance hit will be the least of your worries. It's like improving insurance payout speed after accidents. Sure it's nice to have but ultimately you just want to not have accidents.

donker on 5 Nov 2018

@donker
with 1. you mean fixing the hash algorithm and the cleanup issues?
regarding performance, I am having a different POV. If we want to improve DNN in this area (preventing it from being called the "old webforms turtle amongst CMS") we need to change the developers' mindset, to consider performance implications and possible improvements on every change of the platform. There are situations, where thousands of errors are logged (sometimes one error causing the other) and we should consider strategies to handle it. Ideally, those checks and logs and cleanups should not significantly affect DNN performance.

sleupold on 6 Nov 2018