Sentry: Data scrubbing question

Created on 28 Jan 2016 · 16Comments · Source: getsentry/sentry

Hi there,

I'm having an issue where a field in my extra context would be filtered out because it contains the string 'password'.

The structure can be summarized as:

"extra": {
  "field": "something_containg_the_string_password"
}

The value of the field is filtered out.

Is there a reason for this ?

Cheers,

Math

Source

math2k

Most helpful comment

Scrubbing happens before we store any data on disk.

mattrobenolt on 30 Jul 2017

👍2

All 16 comments

Our data filtering is covered in the docs:

https://docs.getsentry.com/hosted/learn/sensitive-data/

dcramer on 28 Jan 2016

Hi,

We're having the issue as well.

The code here: https://github.com/getsentry/sentry/blob/master/src/sentry/utils/data_scrubber.py#L104

means that our (standard) Django session cookie named "s", if added to the "sensitive fields" will make Sentry filter-out almost all the data.

What's a good reason not to simply filter on dictionary names (matching exactly) ?

vperron on 4 Feb 2016

The reason we do contains is this: "my_secret_password"

This is something we'll need to expand support on in the future, but it might be things like allowing you to explicitly whitelist an exact match.

dcramer on 4 Feb 2016

I see, but still am pretty sure this is an anti-pattern, isn't it ? We want to hide based on the keys we define software-side for our systems, not based on the assumption that some value would maybe contain the "password" value.

It may hide valuable information (what if my server resides on passwords.mycompany.com ?)
Also, one may say that hiding a secret as secure as "my_magic_password" is, ahem... :)

I really think we should limit sentry to filter only on exact or fuzzy key matches for the sensitive fields, eventually with regexes for the values (like the card information as you already do, even if I'd say it should be simply field-name protected instead of recursively matched in every field)

I still don't see a very good reason to do it the way it is now :)

vperron on 4 Feb 2016

We are aggressive here because we absolutely do not want to leak sensitive data. It's important to understand that Sentry works cross-platform on just about every device you can imagine. In many cases data is not presented as key/value pairs. For example "foo=bar&password=baz" could be the value of a field, and we'd absolutely not want to capture that in case password is sensitive here. If you're situation is safe, you can disable disable the data filtering.

dcramer on 4 Feb 2016

OK, that's a better use case. I'll see if I can come up wih something to enable finer control on what is filtered or not in a subsequent PR or discussion.

We do need filtering :) Just not that aggressive.

vperron on 4 Feb 2016

This also might be something where we can just make an additional setting that is for excluding by exact match. Definitely open to improving this as it's obviously problematic, but we don't want to change the defaults to be less restrictive.

dcramer on 4 Feb 2016

Glad to see that I'm not the only one running into this issue.

After being told to RTFM, I've patched my own install to suit my needs.
I understand Sentry's approach, but it would be nice to be able to make it fit our needs without having to patch it.

Would be nice to have a way to whitelist keys for example.

math2k on 4 Feb 2016

Would having a "whitelist these key names" be sufficient enough? Do we need CONTAINS on the key names?

dcramer on 5 Feb 2016

Hi,

I personnally feel it's not the best option I'd choose. Whitelisting every possible field name for every form, http request, for our ~100 different services seems harassing :)

@dcramer: if you wish so I think your proposed approach of choosing either fuzzy or exact match for the fields is a better one; but my personal favourite would be adding another whitelist.

So there would be the "contains" search on keys & values for all the sensitive words, including the default ones, AND an "exact match" blacklist based on key names (which may include default ones or not)

I'd gladly code that if need be.

vperron on 5 Feb 2016

My main concern about having multiple styles of matching is the user experience suffers.

There's already three options:

Enable data filtering
Enable default filters
Additional filters to apply

Now we'd need two fields for additional filters, one which does CONTAINS and one which does EXACT. Now what if we do a whitelist? We'd need two more fields for the same thing.

I'm greatly opposed to simply having an option to change all matching, as that's another poor user experience.

This might be something where we need to do something equivalent to rules and let you create a set of rules. "scrub data when key CONTAINS x".

dcramer on 6 Feb 2016

@dcramer When you say 'Scrubbing' removes filtered data, what does that exactly mean?