Elasticsearch: Support for IPv6 mapping type

Created on 17 Sep 2013 · 42Comments · Source: elastic/elasticsearch

Currently I can't use the ip mapping type as I have fields that can be either IPv4 or IPv6. However, being able to use range queries is really useful but I can't make use of them because I have to treat the field as a string to handle the case when the field contains an IPv6 value.

Obviously this causes extra hassle as storage would then require 128 bits and when searching, range queries using IPv6 addresses shouldn't match IPv4 addresses, unless you're using the ::ffff:d.d.d.d notation, and IPv6 addresses shouldn't match IPv4 range queries at all.

(I found this thread when this has been raised previously)

:SearcMapping >feature high hanging fruit stalled

Source

bodgit

👍3

Most helpful comment

You're in luck. Thx to @rmuir this is getting closer. https://issues.apache.org/jira/browse/LUCENE-7043

nknize on 24 Feb 2016

👍4

All 42 comments

I'd like to convert a postgresql based application to use ES but got hung up on missing this feature, too. The queries are using netmasks/cidrs so just having the IPv6 address as a string won't be "good enough".

abh on 26 Nov 2013

For IP V6, just mark your field as not_analyzed in mapping.

dadoonet on 26 Nov 2013

@dadoonet That doesn't make any sense.

abh on 26 Nov 2013

Do you mean that you don't understand my answer or my answer does not answer to your question?

dadoonet on 26 Nov 2013

@dadoonet What's the point of the "ip type" if a reasonable answer to supporting IPv6 is "just make it a not analyzed string"? They're not the same thing, I'd hope.

abh on 26 Nov 2013

IP type is only for IP v4. Type name should be ipv4 instead of ip.
For ipv6 I don't think a special type is needed. Keeping ipv6 as non tokenized string should do the job.

Hhow do you expect ipv6 content to be converted to?

dadoonet on 26 Nov 2013

It could be converted to a number, for instance, and then allow range searches etc similar to the "ipv4" type.

Better yet the "ip type" should "just work" for both (similar to what postgresql does, for example).

abh on 26 Nov 2013

There are ways of expressing IPv6 addresses that would likely fail a simple string-based match, the whole '::' expansion for one.

bodgit on 26 Nov 2013

@bodgit Very good point! I'm going to think about it a bit more.

dadoonet on 27 Nov 2013

I have an app that stores iPv4/6 addresses as DECIMAL(39,0) in a mysql database which allows for very easy range searching. I wait for the day when ES will support something similar for IPv6 so I can finally use ES for indexing my database.

lifo101 on 17 Dec 2013

When storing IPv6 addresses, I store it as a "fully formatted" IPv6-string, i.e. XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX, regardless of zeros (so, never shortening a segment to less than four digits, and never shortcutting segments with ::).
This way, all IPv6 addresses are fully sortable and searchable, so this should work with ES (using not-analyzed mapping). However, it is very space-consuming when comparing to what an IPv6 address really is, which is 16 bytes (while this becomes 40 bytes...)
Also, if putting IPv4-addresses into this mix, sorting/filtering on a range will lead to problems mixing IPv4 and IPv6.
(This could be solved by using the IPv4-mapping format of IPv6, that is all IPv4 addresses are stored as IPv6 as ::FFFF:XXXX:XXXX (last four bytes being the IPv4 address)

The other approach is to store as a binary field, using 16 bytes, (still storing IPv4-addresses in IPv4-mapped IPv6-format). My approach to this in mysql is actually a BINARY(16) column.
However, this is inconvenient as manually browsing/inspecting the data becomes cumbersome.

So; I am also eagerly awaiting ES support for IPv6, storing IPv6-addresses in numeric format, but with support for properly displaying them and accepting query parameters in IP-format.

jvbrandis on 17 Jan 2014

+1. would like this feature

cpdean on 28 Mar 2014

As @abh pointed out the fact ES is not currently supporting both protocols equally is a show stopper for many applications - to be ported or to be implemented from scratch.

ES is pretty much becoming a de facto standard when it comes to scalable event storage, search and analysis. In my particular use case, and I do not think I am the only one here, I deal with IPv4 just as much as with IPv6. Having both address families under a single, coherent data type is to be desired. Mappings, queries, indexing... would become unified and consequently easier to use for everyone.

@dadoonet I wonder what's the reason for ES to support IPv4-only data types in first place. Was it a technical decision due to implementation difficulties, was it a matter of priorities? Or, on the other hand, was it a consequence of you guys perceiving ES users did not care about IPv6? Is it at least in your roadmap?

ioc32 on 29 Mar 2014

@ioc32 It is on my TODO list for sure! I need to find some quiet time to work on it.

dadoonet on 29 Mar 2014

@dadoonet great! Thank you for updating us!

ioc32 on 29 Mar 2014

the reason is simple, ipv4 can easily be translated to 64bit long, which supports range constructs, ipv6 is more complex.

kimchy on 29 Mar 2014

definitely looking forward to this. It'll really round out the ELK stack for feature complete network analysis. thanks!

cpdean on 31 Mar 2014

understood, though for now, if you can get around with prefix checks, you can map the IP as string.

kimchy on 31 Mar 2014

+1 to defending @dadoonet's quiet time. I'd love to see this happen.

xaque208 on 1 Apr 2014

Wouldn't it be possible to use fixed length lucene binary field types for ips and use binary sorting (I read about binary utf8 sorting in lucene, but I lack somme skills on the subject) ?

Dunaeth on 3 Apr 2014

It is indeed possible to encode ipv6 ips as binary fields, Lucene doesn't require index terms to be UTF-8 sequences, it can be anything. The challenge here is more that for IPs, we need to support efficient ranges because that's typically how these fields are filtered. Lucene provides support for efficient ranges with numeric fields (see NumericRangeQuery): basically every field gets indexed with different precision levels, and this allows range queries to visit few terms no matter how large the range is (the fewer terms are visited the more efficient queries are). So we would need a similar mechanism for storing ipv6 addresses.

jpountz on 3 Apr 2014

seti123 on 3 Apr 2014

Depends on https://issues.apache.org/jira/browse/LUCENE-5596

clintongormley on 11 Jul 2014

I see we're still blocked on Lucene's support for BigInt for this. But that ticket hasn't seen any action in a while either.
Any updates for this @clintongormley? IPv6 is becoming a real thing, so this would be really handy :-)

avleen on 14 Jul 2015

Its a 14 year old protocol. We're well beyond 'real thing' :)

xaque208 on 15 Jul 2015

The Lucene issue is stalled indeed, as it proved very hard to integrate... The feature is currently exposed as an experimental postings format which is not supported in terms of backward compatibility.

With small numbers (up to 64 bits) today we have static pre-computed ranges, which is probably fine. For instance for ints (32 bits) we have a default precision step of 8 bits which means that we pre-compute ranges for all numbers that have the same 24, 16 or 8 upper bits (0-256, 256-512, 512-768, ..., 0-65536, 65536-131072, 131072-196608, ..., 0-16777216, 16777216-33554432, 33554432-50331648, ...). Any arbitrary range can be translated to a union of these pre-computed ranges, and this is the way we manage to have fast ranges on numerics.

With high numbers of bits, like 128 here, the space-time trade-off becomes tricky I think. For instance with a precision step of 16, we would have to index 8 tokens per value while range queries would still visit hundreds of thousands of terms in the worst-case.

Given that ipv6 addresses tend to use the lower bytes less, maybe that would be fine, but I'm a bit reluctant to expose a new field type for ipv6 addresses that would not perform well for range queries. An option could be to have a new type for ipv6 addresses that would only support sorting and aggs but not queries, however I'm not sure how useful it would be?

jpountz on 15 Jul 2015

Agreed.
/64's are the smallest allocations that are generally given out, so
searching for a range may not (initially) need more precision than that.
If we see an IPv6 address, we can store the range of the /64 it is in, and
then work up from there?
/64, /32, /16, /8, /4, /2, /1, /0
That's 8 bits there, and from a practical perspective it might be
sufficient. Most end users get a /64, which makes searching in that easy.
ISPs get at least /32 sized blocks.

Most IPv6 address allocated today, when converted to decimal are about 38
bytes. That means 76 bytes (upper and lower bounds) to store each range. So
about 600 bytes of storage required for the precision, per address, in
addition to the ~38 bytes for the address itself.. That's quite a lot, but
that's really just the way it is - we can't make these numbers smaller ;-)

If we restrict range searches to at least /64, could this then work out?

On Wed, Jul 15, 2015 at 5:57 PM Adrien Grand [email protected]
wrote:

The Lucene issue is stalled indeed, as it proved very hard to integrate...
The feature is currently exposed as an experimental postings format which
is not supported in terms of backward compatibility.

With small numbers (up to 64 bits) today we have static pre-computed
ranges, which is probably fine. For instance for ints (32 bits) we have a
default precision step of 8 bits which means that we pre-compute ranges for
all numbers that have the same 24, 16 or 8 upper bits (0-256, 256-512,
512-768, ..., 0-65536, 65536-131072, 131072-196608, ..., 0-16777216,
16777216-33554432, 33554432-50331648, ...). Any arbitrary range can be
translated to a union of these pre-computed ranges, and this is the way we
manage to have fast ranges on numerics.

With high numbers of bits, like 128 here, the space-time trade-off becomes
tricky I think. For instance with a precision step of 16, we would have to
index 8 tokens per value while range queries would still visit hundreds of
thousands of terms in the worst-case.

Given that ipv6 addresses tend to use the lower bytes less, maybe that
would be fine, but I'm a bit reluctant to expose a new field type for ipv6
addresses that would not perform well for range queries. An option could be
to have a new type for ipv6 addresses that would only support sorting and
aggs but not queries, however I'm not sure how useful it would be?

—
Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/issues/3714#issuecomment-121761672
.

avleen on 16 Jul 2015

How would this work with a type that handles both IPv4 and IPv6? As I originally stated in my use case I don't know the address family ahead of time, only that it is "an IP address" so I would prefer a type that can handle both. If that meant storing IPv4 addresses as IPv6-mapped it means that for such addresses, you _do_ care about the lesser significant bits more as the address is ::ffff:d.d.d.d and so the first 96 bits are always going to be the same.

bodgit on 16 Jul 2015

FWIW, ARIN announced depletion of their free IP pool today:
http://teamarin.net/category/ipv4-depletion/

avleen on 25 Sep 2015

Our access logs use a combination of IPv6 and IPv4 in the same field so we're in the same situation as @bodgit

hanej on 28 Sep 2015

The Lucene ticket mentioned above isn't being worked on.
Instead they implemented a different way of doing things, which could enable an ipv6 type:
https://issues.apache.org/jira/browse/LUCENE-5879
But I think it might be up to Elasticsearch to implement that on top of the work they did on the auto-prefix terms?

avleen on 30 Sep 2015

Thats not really true. @mikemccand and @nknize are hard at work, and have been for a long time, adding all kinds of experimental data structures to lucene: to better solve the issues of numeric-like fields, spatial data structures, etc.

Another one that is promising for cases like this is https://issues.apache.org/jira/browse/LUCENE-6697

But there is still work to do, to graduate them from the sandbox: for example (this is not criticism, these guys are iterating and that is how it goes), some of these formats create large files in /tmp during merge. This kind of "sandy" stuff has to be cleaned up before they are production-strength.

Furthermore integrating them is a little tricky, in the past everyone has jumped to build numerics/spatial on top of what lucene already had (things like inverted index structures), and currently I see them still "wedging" the new stuff behind those apis.

I think in order to fix it properly, we have to expand the index format (Codec apis) with abstractions for these kinds of data structures, simple ones we can live with, improve for users over minor releases, and support backwards compatibility for. We can't just shove this stuff out there quickly: exposing these kinds of features means we are committing ourselves to long-term backwards compatibility of the format, that is one reason it takes longer.

I am not really following all that closely, nobody can keep up with those guys, so I might be wrong, but this is just my high level view on the thing. Its not that we are lazy and don't care about IPv6 or anything like that.

rmuir on 30 Sep 2015

Robert, I don't for a moment think you, or anyone working on ES or Lucene
is lazy.
You folks all do incredible work and give it to us for free. We're very
grateful for you efforts.

I think ipv6 is just a big deal to a lot of people, which is why we see so
much interest in this issue, and we're just waiting for the technology to
catch up to our needs :)

On Tue, Sep 29, 2015, 21:56 Robert Muir [email protected] wrote:

Thats not really true. @mikemccand https://github.com/mikemccand and
@nknize https://github.com/nknize are hard at work, and have been for a
long time, adding all kinds of experimental data structures to lucene: to
better solve the issues of numeric-like fields, spatial data structures,
etc.

Another one that is promising for cases like this is
https://issues.apache.org/jira/browse/LUCENE-6697

But there is still work to do, to graduate them from the sandbox: for
example (this is not criticism, these guys are iterating and that is how it
goes), some of these formats create large files in /tmp during merge. This
kind of "sandy" stuff has to be cleaned up before they are
production-strength.

Furthermore integrating them is a little tricky, in the past everyone has
jumped to build numerics/spatial on top of what lucene already had (things
like inverted index structures), and currently I see them still "wedging"
the new stuff behind those apis.

I think in order to fix it properly, we have to expand the index format
(Codec apis) with abstractions for these kinds of data structures, simple
ones we can live with, improve for users over minor releases, and support
backwards compatibility for. We can't just shove this stuff out there
quickly: exposing these kinds of features means we are committing ourselves
to long-term backwards compatibility of the format, that is one reason it
takes longer.

I am not really following all that closely, nobody can keep up with those
guys, so I might be wrong, but this is just my high level view on the
thing. Its not that we are lazy and don't care about IPv6 or anything like
that.

—
Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/issues/3714#issuecomment-144256889
.