Elasticsearch: Path hierarchy aggregation

Created on 11 Dec 2014 · 11Comments · Source: elastic/elasticsearch

A few users have used nested terms aggregations to try to visualise each level in a tree, such as a file system, eg:

{
  "aggs": {
    "first_level": {
      "terms": {
        "field": "first_level"
      },
      "aggs": {
        "second_level": {
          "terms": {
            "field": "second_level"
          },
          "aggs": {
            "third_level": {
              "terms": {
                "field": "third_level"
              }
            }
          }
        }
      }
    }
  }
}

This is very costly as it results in combinatorial explosion. However, because this is a tree, it would be more efficient to store first_level+second_level+third_level in a single field, and to do a single pass over these "leaf buckets". Once we have the most popular leaves, we can backfill the branches (ie first_level+second_level and first_level).

The results would obviously be different to the nested terms agg: instead of having the most popular first_levels, then the most popular second_levels in the most popular first_levels (etc), you'd have the most popular leaves, plus the first_level and second level that those leaves belong to.

A complete example could look like this:

PUT /filesystem
{
  "mappings": {
    "file": {
      "properties": {
        "path": {
          "type": "string",
          "index": "not_analyzed",
          "doc_values": true
        }
      }
    }
  }
}

PUT /filesystem/file/1
{
  "path": "/My documents/Spreadsheets/Budget_2013.xls",
  "views": 10
}

PUT /filesystem/file/2
{
  "path": "/My documents/Spreadsheets/Budget_2014.xls",
  "views": 6
}

PUT /filesystem/file/3
{
  "path": "/My documents/Test.txt",
  "views": 1
}

GET /filesystem/file/_search?search_type=count
{
  "aggs": {
    "tree": {
      "path_hierarchy": {
        "field": "path",
        "separator": "/",
        "order": "total_views"
      },
      "aggs": {
        "total_views": {
          "sum": {
            "field": "views"
          }
        }
      }
    }
  }
}

And the result like this:

{
  "aggregations": {
    "tree": {
      "buckets": [
        {
          "key": "My documents",
          "doc_count": 3,
          "total_views": {
            "value": 18
          },
          "tree": {
            "buckets": [
              {
                "key": "Spreadsheets",
                "doc_count": 2,
                "total_views": {
                  "value": 17
                },
                "tree": {
                  "buckets": [
                    {
                      "key": "Budget_2013.xls",
                      "doc_count": 1,
                      "total_views": {
                        "value": 10
                      }
                    },
                    {
                      "key": "Budget_2014.xls",
                      "doc_count": 1,
                      "total_views": {
                        "value": 7
                      }
                    }
                  ]
                }
              },
              {
                "key": "Test.txt",
                "doc_count": 1,
                "total_views": {
                  "value": 1
                }
              }
            ]
          }
        }
      ]
    }
  }
}

:AnalyticAggregations discuss

Source

clintongormley

Most helpful comment

Hello, any solutions to get this functionality in 5.x?

ale0xb on 24 Mar 2017

👍7

All 11 comments

I came across a use case that might be related. I want a tool to help me write IP address blocking rules by examining log records. A rule might be _"ban everything from 121.205."_ or perhaps I need to be more selective _"ban everything from 121.205.247."_
This is a decision every webmaster takes on the basis of how much good vs bad traffic comes from any one level. If you think about how a hierarchical aggregation might ideally help with this it could progressively inflate sections of the full agg tree but only pursuing those branches where the "bad vs good" mix is high i.e. where there are more logged 404s than 200s in that address range. The existing breadth_first logic is a way of delaying computation of only the best branches but in this case we may need to introduce a new aggregation to help determine what "best" is because in this case "risk" is not a scriptable property found in docs but is derived from a mix of the contents of 2 buckets (404s vs 200s), neither of which know about each other
Of course with things like risky IP address ranges or directories on your hard drive that are using all your disk space, the real culprits that need to be chased down do not necessarily all exist at the same level in the hierarchy. Progressively expanding all branches of the tree in lock-step, a level at a time is perhaps not the only approach required here. In some respects I like the idea of an energy-dissipation based model for exploring large information spaces using finite resources. Using the 'pulsing' model I originally outlined for tackling combinatorial explosions we could direct pulses of doc streams down the various branches of the agg tree that could do with further inflation using a prioritisation system. When the time is up we can cut short the exploration but we have at least directed our efforts down the most promising channels which could be at different depths in the tree.

markharwood on 20 Jan 2015

I have made an implementation for this aggregation as a plugin.
You can test it here : https://github.com/opendatasoft/elasticsearch-aggregation-pathhierarchy

clement-tourriere on 17 Jun 2015

@jpountz this was your idea originally. Do you still think it is worth doing?

clintongormley on 21 Nov 2015

I haven't seen it as much as I initially expected. But I think this can still be interesting indeed. Closing for now, we can re-open in the future if needed.

jpountz on 23 Nov 2015

Something from the forums - https://discuss.elastic.co/t/aggregation-on-a-materialized-path/36519

markwalkom on 8 Dec 2015

I'm the author of the forum post, would be neat if you guys can have a look at my problem as it's related to this issue.

@clement-tourriere Does your plugin support ES 2.x ?

deviantony on 8 Dec 2015

Just found out that the plugin does not support ES 2.x: https://github.com/opendatasoft/elasticsearch-aggregation-pathhierarchy/issues/3#issuecomment-142402724

deviantony on 9 Dec 2015

Just wondering, any plans to add this type of aggregation?

saimaz on 8 Jan 2016

Hi all,

Thought give you a heads up on searchkit which uses nested aggregations to build a hierarchical tree to filter results from. Check it out here: http://demo.searchkit.co/taxonomy

More details can be found here: http://docs.searchkit.co/stable/docs/components/navigation/hierarchical-refinement-filter.html

joemcelroy on 18 Jan 2016

I'm late to the party, but some months ago, with the help of @clement-tourriere at https://github.com/jprante/elasticsearch-aggregations/pull/1 it was possible to port the path hierarchy approach to the ES 2.x aggregation framework. With ES 5 now released, I plan to move forward. Please comment at https://github.com/jprante/elasticsearch-aggregations/issues if you have questions about my project or want to contribute.