Elasticsearch: Support IP aggregation by range

Created on 11 Jun 2020 · 13Comments · Source: elastic/elasticsearch

As per a request in https://github.com/elastic/kibana/issues/68424#issuecomment-641982887, I'm logging this issue in elasticsearch too.

I would like to perform aggregations of documents containing an ip field into CIDR subnets of a specific size.

If I collect IPs of services talking to each others from a network point-of-view (ex. within a datacenter), I can produce top talkers per subnet, top subnets, etc.

Ideally, I would be able to configure subnet mask to gradually refine during investigations (group per /16 first, then /24, etc.).
With raw IP addresses indexed, in the end I could produce such views:
Subnet (based on ip_address field) | Avg performance (ms)
------------ | -------------
10.2.0.0/16 | 1.2
10.3.0.0/16 | 1.5
10.4.0.0/16 | 5.8 < any issue?

Thanks,

:AnalyticAggregations >enhancement Analytics

Source

pierrecdn

👍1

Most helpful comment

Discussed this today in our meeting: we generally agreed that this would be good functionality to support. There's not really a good way to do it today, and the ip_range agg is a poor substitute for the behavior. We'd probably want to introduce a dedicated "ip-histogram" aggregator, since it would be sufficiently different from regular histogram (different bucketing semantics based on ip blocks, IPv4/IPv6 support, etc)

Not quite clear how/when we'll be able to get to this agg, but leaving open as a valid enhancement request because we'd like to support it! :+1:

polyfractal on 2 Jul 2020

👍4

All 13 comments

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

elasticmachine on 16 Jun 2020

@pierrecdn, it'd help if you could post json for what you are looking for. Best would be a complete recreation including index creation, adding a couple of docs, the search with the aggregation that you want, and the result you'd like to see. That'd tell us specifically what you want and we can talk about it from there.

nik9000 on 16 Jun 2020

I'm not ultra familiar with all the process since a few years, more consuming the service as a Kibana user nowadays, but I'll try to do so:

Index & mappings

$ curl -XPUT -H 'Content-type: application/json' 'http://localhost:9200/my_index/' -d'
{
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },  
      "src_ip": { "type": "ip"  }, 
      "rtt":  { "type": "integer"  }     
    }
  }
}

Some documents

{
  "timestamp": "2020-06-16T17:00:00Z",
  "src_ip": "2001:db8:1::2",
  "rtt": 11
}

{
  "timestamp": "2020-06-16T17:00:00Z",
  "src_ip": "2001:db8:1::18",
  "rtt": 42
}



md5-98a89db43a5b0efc46fad6a1884d8eef



{
  "timestamp": "2020-06-16T17:00:00Z",
  "src_ip": "2001:db8:2::10",
  "rtt": 48
}



md5-6de5d9ef17f6ca90edf593aa7c358fe7



{
  "aggs": {
    "1": {
      "histogram": {
        "field": "rtt",
        "interval": 10,
        "min_doc_count": 1
      }
    }
  },
  "size": 0
}



md5-e6d55d40af547fc9b38e3cb9d1527419



$ curl -s -H 'Content-type: application/json' http://localhost:9200/my_index/_search -d @request | jq '.aggregations'
{
  "1": {
    "buckets": [
      {
        "key": 10,
        "doc_count": 1
      },
      {
        "key": 40,
        "doc_count": 2
      }
    ]
  }
}



md5-ddc31c6945a7d7278235b0ea622ba183



{
  "aggs": {
    "1": {
      "histogram": {
        "field": "src_ip",
        "interval": 64,
        "min_doc_count": 1
      }
    }
  },
  "size": 0
}



md5-c0549ae98a2ac2b66ec7b0a4310221fe



{
  "1": {
    "buckets": [
      {
        "key": "2001:db8:2::/64",
        "doc_count": 1
      },
      {
        "key": "2001:db8:1::/64",
        "doc_count": 2
      }
    ]
  }
}

By chaining aggregations, I can deduce the average RTT per subnet for example. It would unlock some quite nice analytics use-cases.

pierrecdn on 16 Jun 2020

I'm not ultra familiar with all the process since a few years, more consuming the service as a Kibana user nowadays, but I'll try to do so

Like those folks on reddit who say "please excuse my English, It isn't my first language" and then proceed to write more eloquently than I'd ever manage. You even used jq to filter out the important bits!

I think I understand now: you want to make a histogram keyed by cidr blocks.

Thanks!

nik9000 on 16 Jun 2020

Like those folks on reddit who say "please excuse my English, It isn't my first language"

Actually it is not my first language 😁

you want to make a histogram keyed by cidr blocks.

See? It is very concise this way 😄.

Thanks!

pierrecdn on 17 Jun 2020

Its worth noting that you can kind of do this with ip_range. It's not the same because you have to specify all the buckets and we'd return the empty ones, but it is the closest thing that we have now.

I'm asking around with folks to see if this is something that we want. Personally I think it'd be useful but I don't know where it falls in our overall list of work to do.

nik9000 on 17 Jun 2020

Yes, actually I did use this already multiple times. But the approach is relatively different. While it works to investigate specific events or patterns, it cannot help in general purpose analytics or drill-down strategy where you don't know in advance where in the IP space the focus should be.

My point generally speaking is that we shouldn't loose features but instead only get more when using ip fields.
Because if I do a ip2int before indexing, I can get the feature (a bit less readable though 😁).

pierrecdn on 18 Jun 2020

Not quite clear how/when we'll be able to get to this agg, but leaving open as a valid enhancement request because we'd like to support it! :+1:

$polyfractal picture$ polyfractal on 2 Jul 2020

👍4

For reference, we have a few histograms within Elastic Security (demo) where we allow bucketing by IP, so this enhancement would be great for these use-cases! 🙂

spong on 7 Jul 2020

👍1

I wonder about ipv4 vs ipv6. One way to do it is to make folks specify the subnet mask in ipv6 cidr:

{
  "aggs": {
    "subnet": {
      "ip_histogram": {
        "field": "src_ip",
        "subnet_mask": "/120"
      }
    }
  },
  "size": 0
}

Then the response looks like:

{
  "subnet": {
    "buckets": [
      {
        "key": "::ffff:192.0.2.128/120",
        "doc_count": 1
      }
      {
        "key": "2001:db8:2:...:/120",
        "doc_count": 1
      },
      {
        "key": "2001:db8:1:...:/120",
        "doc_count": 2
      }
    ]
  }
}

But I think lots of folk still think in terms of /24 getting you 192.0.2.128/24.

nik9000 on 7 Jul 2020

Indeed, maybe an ip_family field has to be supported to disambiguate, because a mask alone doesn't make sense.

But I think lots of folk still think in terms of /24 getting you 192.0.2.128/24

You probably mean 192.0.2.0/24, right?

pierrecdn on 8 Jul 2020

You probably mean 192.0.2.0/24, right?

Yeah. Sorry.

Indeed, maybe an ip_family field has to be supported to disambiguate, because a mask alone doesn't make sense.

If only IPv6's mask had a different format or something to make it obvious.

nik9000 on 9 Jul 2020

Would also really like this feature. It would be awesome if I could provide 2 subnets, one for v4 and one for v6 so that I don't have to do 2 aggregations. Something like.

{
  "aggs": {
    "subnet": {
      "ip_histogram": {
        "field": "src_ip",
        "subnet_mask_v6": "/48",
        "subnet_mask_v4": "/24",
      }
    }
  },
  "size": 0
}