Elasticsearch: Fingerprinting Ingest Processor

Created on 3 Mar 2016  路  28Comments  路  Source: elastic/elasticsearch

A potentially useful processor is one that can generate one or more "fingerprints" from an incoming document. This could aid in finding duplicates, detecting plagarism, or clustering similar documents together.

I think there are two realms of fingerprinting: content fingerprinting and structural fingerprinting.

Content Fingerprinting

Hashes the content of fields to generate a fingerprint-per-field, and optionally, a fingerprint that represents all the fields. Could use simple hashing, or perhaps something more sophisticated like MinHash, SimHash, Winnowing or Probabilistic Fingerprinting.

The API could look something like:

{
  "fingerprint": {
    "type": "content",
    "fields": ["foo", "bar"],
    "hash": "minhash",
    "hash_all": true
  }
}

E.g. specify the type of fingerprinting we want to do (content), a list of fields to hash, the style of hashing and if we should also hash all the hashes together. The output would then be the document + new fingerprint fields:

{
  "foo": "...",
  "bar": "...",
  "fingerprint_foo": 1283891,
  "fingerprint_bar": 8372038,
  "fingerprint_all": 3817273
}

Structural Fingerprinting

The other mode of fingerprinting could be structural in nature (this is the one I'm more interested in, tbh). Instead of fingerprinting the content of fields, we are actually fingerprinting the structure of the document itself. Essentially, we would recursively parse the JSON and hash the keys at each level in the JSON tree. These hashes then become a fingerprint for the structure of the document.

Importantly, this type of fingerprinting ignores the leaf values...we just want to fingerprint the JSON keys themselves.

{
  "fingerprint": {
    "type": "structure",
    "root": ["foo"],
    "recursive": true,
    "hash": "murmur3",
    "hash_all": true
  }
}
  • root: defines where to start recursing, in case you only care about a portion of the document. Could be omitted or set to "*" to process the entire document
  • recursive if you want the processor to fingerprint all the layers. False if you just want the top-level of keys hashed.
  • hash: murmur, minhash, etc
  • hash_all: if all the hashes should be hashed together to build a final fingerprint

And the new document:

{
  "foo": {
    "bar": {
      "baz": "buzz"
    },
    "beep": {
      "boop": "bop"
    }
  },
  "fingerprint_level1": 001734,
  "fingerprint_level2": 992727,
  "fingerprint_level3": 110293,
  "fingerprint_all": 235240
}

Instead of a fingerprint-per-field, we now have one per "level".

I can think of a number of objections to both of these, but this should at least kick off the discussion :)

/cc @martijnvg

:CorFeatureIngest discuss

Most helpful comment

I'll leave it up to @talevy if he wants to reuse this issue, or open a new ticket, for his implementation. :)

@talevy do we have another opened issue or is it just a matter of days/hours to have a PR on it so we can wait for the description in the PR?

I'll open a PR with a description of its features and link to this issue to preserve a breadcrumb to this discussion

All 28 comments

I love this!

:+1: great idea

@polyfractal btw have you seen this? https://github.com/elastic/elasticsearch/issues/13325

just waiting to be exposed /cc @markharwood

Interesting! I had not seen that. That could potentially cover some of the features needed for content fingerprinting (or could be used as one of the "hashes").

The major downside I see to the FingerprintFilter is that large blocks of text will still generate large numbers of tokens, and for some use-cases you really just want a single fingerprint token to represent the whole thing (like minhash, winnowing, etc)

The major downside I see to the FingerprintFilter is that large blocks of text will still generate large numbers of tokens,

I was thinking a downstream filter could take care of hashing the stemmed, sorted, deduped etc set of tokens FingerprintFilter and friends produces.
I didn't want to build the hashing into FingerprintFilter because sometimes you might want the readability of the raw tokens.

Makes sense. I suppose this goes back to the debate of analyzers vs. ingest pipelines ... e.g how much work should be done by analyzers, vs deferring more complex/expensive computations to dedicated ingest pipelines. Dunno :)

analyzers, vs deferring more complex/expensive computations to dedicated ingest pipeline

If fingerprinting can be done on a per field basis then this should be done via analyzers, because that will perform much better. However if fingerprinting requires several fields or the entire document then this should be done via a pipeline, since there all fields are accessible.

Ok, so in that case we can probably scratch the entire "content fingerprinting" section. That could be done with FingerprintFilter + various hashes, and if you want a total hash, you can copy_to a new field that is also hashed.

The "structural fingerprinting" stuff would have to be a pipeline, since it requires multiple fields in the doc

Still, I think there's a lot of value. There's a pretty common use case for multiple-field "document fingerprinting," including in records management and e-mail search use cases. Usually something like the concatenation of to+from+cc+timestamp+subject and then md5/sha1/md5+sha1 hashes used for deduplication. It would be most interesting to make sure some of those hash algorithms were available as they're industry standards for those use cases. I'm a +1 for this overall.

@polyfractal out of curiosity - which you always manage to trigger in me :) - what use cases do you see for the structural fingerprinting? I'm also asking because I presume it will be more useful in the case where people have many many optional fields , in which case the might want to model their data differently and put the field name as value to avoid mapping explosion. In this case we're back to content fingerprinting?

@bleskes For me, the use-case is parsing JSON logs (slow query logs in particular) and fingerprinting that structure. And the situation is a bit unique, since I actually want _only_ the fingerprint and original _source, but skip the actual doc fields.

The root problem (for me) is that fingerprinting the structure of hierarchal json is much more reliable than treating the JSON as textual content and n-gram'ing it. I was hoping this could be massaged into a general purpose feature that is used in other contexts, but I'll be 100% honest and say I dunno what else it can be used for. It may be too special-purpose :)

Since this is essentially locality-sensitive hashing for tree structures, I suppose it could be used to cluster taxonomies, call stacks, lineage trees, etc?

I absolutely see the problem with many fields + mapping explosion. Don't have a good answer to that :/

I suppose this could be re-imagined as a content fingerprint that expects JSON, and loads/hashes that from a single string field?

For me, the use-case is parsing JSON logs (slow query logs in particular) and fingerprinting that structure.

@polyfractal - I see. So the "document" you are interested in is the query type / combination irrespective of the values. I.e., the tree structure, including node names is the document. Indeed the question becomes - do we see more use cases for that? o.w. we can preprocess the tree into a value (by keeping all node names and {}) and finger print that.

Yep, and I'm not sure. It definitely could be too niche. Processing the tree into a text value is doable, but introduces a variety of problems. For example, take this tree:

query:
  match:
    baz: abc,
  bool:
    must:
      match:
        bar: xyz
      match:
        baz: abc

How do you transform that into a textual value? Assuming there is still some kind of processor that understand JSON and can emit node names:

  • Go depth-first which maintains subtree structure: match baz bool must match bar match baz.
  • Go breadth-first which maintains per-level co-ocurrence: match bool baz must match match bar baz.
  • Go depth-first, but generate a value for each "branch" in the tree: ["match baz", "bool must match bar", "bool must match baz"], and then use a bunch of phrase magic to accomplish per-level co-ocurrence.

I dunno, I'll keep playing. Maybe one of those above schemes will work nicely when combined with MinHash/SimHash/Winnowing. I definitely agree it doesn't make sense to have functionality that is special-purpose and not generally useful

Ok, so I've been mulling this over. I think we can simplify this proposal down to just "content fingerprinting" with a variety of algorithms.

In the place of "structural fingerprinting", I think we should have a second ingest processor that is essentially a "recursive JSON flattener" processor. Given a JSON _string_ (rather than a complete JSON document), it flattens it into one or several fields, according to your configurations (all keys in one field, keys-per-level fields, prepending level to key name, etc). This also sidesteps the mapping explosion issue, since it requires the JSON to be in string form.

To accomplish structural fingerprinting, you just run it through the "flattener" first to get a set of regular fields, then use content fingerprinting. More generic, flexible, can be used with entirely different processor chains. I'll think on what that kind of processor would look like and open another separate proposal ticket for it.

This also sidesteps the mapping explosion issue, since it requires the JSON to be in string form.

Since all of this happens at ingest time, I don't think there is any risk of running into too many fields - if we remove the source after fingerprinting/collapsing it. This could be an option of that flatner (maybe default?) or we can use another processor.

Closing for now since there is no longer demand for this feature. Feel free to re-open

I'd like to propose we repopen this with the emergence of GDPR. This encourages data owners to pseudonimize fields using hashing algorithms - typically users will want to hash a field (overwriting the value) with a salt - to minimise the possibility of using rainbow tables for reversal in the event the data is lost. @MikePaquette

Initially, i don't think we need to support hashing objects. The ability to support primitives i.e. strings and numerics is sufficient. Adding a salt to a value allows us to add resiliency to brute force reversal techniques - this would be consistent with the logstash fingerprint filter also. Finally id add documents have multiple fields which will need fingerprinting. Each of these will need to be output to unique fields e.g.

definition:

{
  "fingerprint": {
    "fields": ["username", "ip"],
    "hash": "sha256",
    "key": "a_random_salt"
  }
}

output document:

{
   "username":"joe_blogs",
   "ip"::192.168.2.1",
   "fingerprint_username":"blah",
   "fingerprint_ip":"blah"
}

If just a simple hash is needed, we have the murmur3 plugin. You can copy_to the various fields you need hashed, either as a combined field or individually.

Admittedly, murmur3 is not cryptographically secure, and we don't allow it to be salted. But if GDPR just requires simple simple hashing functionality, perhaps we could extend the murmur3 plugin to a more generic "hashing" plugin with a number of common algos?

@polyfractal

I have multiple questions:

  1. what if someone doesn't want to index the original value?
  2. aren't we semi-deprecating mapper plugins?
  3. Ingest feels like the right place to do this; and, an implementation here is in line with our [dream] goal of keeping IngestNode and Logstash feature-compatible.

disclaimer: I am implementing logstash-filter-fingerprint right now as an Ingest Node Processor

disclaimer 2: (3) isn't a question

@talevy why did you close this issue then? Is that unrelated ?

@polyfractal we are trying to conceal the original value in_source. I appreciate your original intent was for these hashes to provide additional value within the index e.g. for clustering. Here we would actually mask the value in the source and use the hashed value in visualisations.

@dadoonet new information came to light 馃槃and I am not going to implement the full description of this issue with regards to the structured-fingerprinting approach.

@talevy @gingerwizard

  1. Good point, I wasn't thinking about it in the context of security where you actually want to obscure the original value, instead of just adding hash functionality for search/filtering
  2. I have no idea tbh :)
  3. Agreed! Since changing the actual source is required for the security context, it'd make most sense to do it in Ingest. Esp. since there is a logstash analogue as you mentioned.

I'll leave it up to @talevy if he wants to reuse this issue, or open a new ticket, for his implementation. :)

postscriptum re: Disclaimer 2: lol

@talevy do we have another opened issue or is it just a matter of days/hours to have a PR on it so we can wait for the description in the PR?

I'll leave it up to @talevy if he wants to reuse this issue, or open a new ticket, for his implementation. :)

@talevy do we have another opened issue or is it just a matter of days/hours to have a PR on it so we can wait for the description in the PR?

I'll open a PR with a description of its features and link to this issue to preserve a breadcrumb to this discussion

Is this feature available for the end users?

@sachinaraballi - It is not available yet. It has stalled a bit by a requirement to ensure consistent hashing keys across the cluster. You can follow https://github.com/elastic/elasticsearch/issues/34085 for updates.

Was this page helpful?
0 / 5 - 0 ratings