Elasticsearch: Support for IPv6 mapping type

Created on 17 Sep 2013  ·  42Comments  ·  Source: elastic/elasticsearch

Currently I can't use the ip mapping type as I have fields that can be either IPv4 or IPv6. However, being able to use range queries is really useful but I can't make use of them because I have to treat the field as a string to handle the case when the field contains an IPv6 value.

Obviously this causes extra hassle as storage would then require 128 bits and when searching, range queries using IPv6 addresses shouldn't match IPv4 addresses, unless you're using the ::ffff:d.d.d.d notation, and IPv6 addresses shouldn't match IPv4 range queries at all.

(I found this thread when this has been raised previously)

:SearcMapping >feature high hanging fruit stalled

Most helpful comment

You're in luck. Thx to @rmuir this is getting closer. https://issues.apache.org/jira/browse/LUCENE-7043

All 42 comments

I'd like to convert a postgresql based application to use ES but got hung up on missing this feature, too. The queries are using netmasks/cidrs so just having the IPv6 address as a string won't be "good enough".

For IP V6, just mark your field as not_analyzed in mapping.

@dadoonet That doesn't make any sense.

Do you mean that you don't understand my answer or my answer does not answer to your question?

@dadoonet What's the point of the "ip type" if a reasonable answer to supporting IPv6 is "just make it a not analyzed string"? They're not the same thing, I'd hope.

IP type is only for IP v4. Type name should be ipv4 instead of ip.
For ipv6 I don't think a special type is needed. Keeping ipv6 as non tokenized string should do the job.

Hhow do you expect ipv6 content to be converted to?

It could be converted to a number, for instance, and then allow range searches etc similar to the "ipv4" type.

Better yet the "ip type" should "just work" for both (similar to what postgresql does, for example).

There are ways of expressing IPv6 addresses that would likely fail a simple string-based match, the whole '::' expansion for one.

@bodgit Very good point! I'm going to think about it a bit more.

I have an app that stores iPv4/6 addresses as DECIMAL(39,0) in a mysql database which allows for very easy range searching. I wait for the day when ES will support something similar for IPv6 so I can finally use ES for indexing my database.

When storing IPv6 addresses, I store it as a "fully formatted" IPv6-string, i.e. XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX, regardless of zeros (so, never shortening a segment to less than four digits, and never shortcutting segments with ::).
This way, all IPv6 addresses are fully sortable and searchable, so this should work with ES (using not-analyzed mapping). However, it is very space-consuming when comparing to what an IPv6 address really is, which is 16 bytes (while this becomes 40 bytes...)
Also, if putting IPv4-addresses into this mix, sorting/filtering on a range will lead to problems mixing IPv4 and IPv6.
(This could be solved by using the IPv4-mapping format of IPv6, that is all IPv4 addresses are stored as IPv6 as ::FFFF:XXXX:XXXX (last four bytes being the IPv4 address)

The other approach is to store as a binary field, using 16 bytes, (still storing IPv4-addresses in IPv4-mapped IPv6-format). My approach to this in mysql is actually a BINARY(16) column.
However, this is inconvenient as manually browsing/inspecting the data becomes cumbersome.

So; I am also eagerly awaiting ES support for IPv6, storing IPv6-addresses in numeric format, but with support for properly displaying them and accepting query parameters in IP-format.

+1. would like this feature

As @abh pointed out the fact ES is not currently supporting both protocols equally is a show stopper for many applications - to be ported or to be implemented from scratch.

ES is pretty much becoming a de facto standard when it comes to scalable event storage, search and analysis. In my particular use case, and I do not think I am the only one here, I deal with IPv4 just as much as with IPv6. Having both address families under a single, coherent data type is to be desired. Mappings, queries, indexing... would become unified and consequently easier to use for everyone.

@dadoonet I wonder what's the reason for ES to support IPv4-only data types in first place. Was it a technical decision due to implementation difficulties, was it a matter of priorities? Or, on the other hand, was it a consequence of you guys perceiving ES users did not care about IPv6? Is it at least in your roadmap?

@ioc32 It is on my TODO list for sure! I need to find some quiet time to work on it.

@dadoonet great! Thank you for updating us!

the reason is simple, ipv4 can easily be translated to 64bit long, which supports range constructs, ipv6 is more complex.

definitely looking forward to this. It'll really round out the ELK stack for feature complete network analysis. thanks!

understood, though for now, if you can get around with prefix checks, you can map the IP as string.

+1 to defending @dadoonet's quiet time. I'd love to see this happen.

Wouldn't it be possible to use fixed length lucene binary field types for ips and use binary sorting (I read about binary utf8 sorting in lucene, but I lack somme skills on the subject) ?

It is indeed possible to encode ipv6 ips as binary fields, Lucene doesn't require index terms to be UTF-8 sequences, it can be anything. The challenge here is more that for IPs, we need to support efficient ranges because that's typically how these fields are filtered. Lucene provides support for efficient ranges with numeric fields (see NumericRangeQuery): basically every field gets indexed with different precision levels, and this allows range queries to visit few terms no matter how large the range is (the fewer terms are visited the more efficient queries are). So we would need a similar mechanism for storing ipv6 addresses.

+1

I see we're still blocked on Lucene's support for BigInt for this. But that ticket hasn't seen any action in a while either.
Any updates for this @clintongormley? IPv6 is becoming a real thing, so this would be really handy :-)

Its a 14 year old protocol. We're well beyond 'real thing' :)

The Lucene issue is stalled indeed, as it proved very hard to integrate... The feature is currently exposed as an experimental postings format which is not supported in terms of backward compatibility.

With small numbers (up to 64 bits) today we have static pre-computed ranges, which is probably fine. For instance for ints (32 bits) we have a default precision step of 8 bits which means that we pre-compute ranges for all numbers that have the same 24, 16 or 8 upper bits (0-256, 256-512, 512-768, ..., 0-65536, 65536-131072, 131072-196608, ..., 0-16777216, 16777216-33554432, 33554432-50331648, ...). Any arbitrary range can be translated to a union of these pre-computed ranges, and this is the way we manage to have fast ranges on numerics.

With high numbers of bits, like 128 here, the space-time trade-off becomes tricky I think. For instance with a precision step of 16, we would have to index 8 tokens per value while range queries would still visit hundreds of thousands of terms in the worst-case.

Given that ipv6 addresses tend to use the lower bytes less, maybe that would be fine, but I'm a bit reluctant to expose a new field type for ipv6 addresses that would not perform well for range queries. An option could be to have a new type for ipv6 addresses that would only support sorting and aggs but not queries, however I'm not sure how useful it would be?

Agreed.
/64's are the smallest allocations that are generally given out, so
searching for a range may not (initially) need more precision than that.
If we see an IPv6 address, we can store the range of the /64 it is in, and
then work up from there?
/64, /32, /16, /8, /4, /2, /1, /0
That's 8 bits there, and from a practical perspective it might be
sufficient. Most end users get a /64, which makes searching in that easy.
ISPs get at least /32 sized blocks.

Most IPv6 address allocated today, when converted to decimal are about 38
bytes. That means 76 bytes (upper and lower bounds) to store each range. So
about 600 bytes of storage required for the precision, per address, in
addition to the ~38 bytes for the address itself.. That's quite a lot, but
that's really just the way it is - we can't make these numbers smaller ;-)

If we restrict range searches to at least /64, could this then work out?

On Wed, Jul 15, 2015 at 5:57 PM Adrien Grand [email protected]
wrote:

The Lucene issue is stalled indeed, as it proved very hard to integrate...
The feature is currently exposed as an experimental postings format which
is not supported in terms of backward compatibility.

With small numbers (up to 64 bits) today we have static pre-computed
ranges, which is probably fine. For instance for ints (32 bits) we have a
default precision step of 8 bits which means that we pre-compute ranges for
all numbers that have the same 24, 16 or 8 upper bits (0-256, 256-512,
512-768, ..., 0-65536, 65536-131072, 131072-196608, ..., 0-16777216,
16777216-33554432, 33554432-50331648, ...). Any arbitrary range can be
translated to a union of these pre-computed ranges, and this is the way we
manage to have fast ranges on numerics.

With high numbers of bits, like 128 here, the space-time trade-off becomes
tricky I think. For instance with a precision step of 16, we would have to
index 8 tokens per value while range queries would still visit hundreds of
thousands of terms in the worst-case.

Given that ipv6 addresses tend to use the lower bytes less, maybe that
would be fine, but I'm a bit reluctant to expose a new field type for ipv6
addresses that would not perform well for range queries. An option could be
to have a new type for ipv6 addresses that would only support sorting and
aggs but not queries, however I'm not sure how useful it would be?


Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/issues/3714#issuecomment-121761672
.

How would this work with a type that handles both IPv4 and IPv6? As I originally stated in my use case I don't know the address family ahead of time, only that it is "an IP address" so I would prefer a type that can handle both. If that meant storing IPv4 addresses as IPv6-mapped it means that for such addresses, you _do_ care about the lesser significant bits more as the address is ::ffff:d.d.d.d and so the first 96 bits are always going to be the same.

FWIW, ARIN announced depletion of their free IP pool today:
http://teamarin.net/category/ipv4-depletion/

Our access logs use a combination of IPv6 and IPv4 in the same field so we're in the same situation as @bodgit

The Lucene ticket mentioned above isn't being worked on.
Instead they implemented a different way of doing things, which could enable an ipv6 type:
https://issues.apache.org/jira/browse/LUCENE-5879
But I think it might be up to Elasticsearch to implement that on top of the work they did on the auto-prefix terms?

Thats not really true. @mikemccand and @nknize are hard at work, and have been for a long time, adding all kinds of experimental data structures to lucene: to better solve the issues of numeric-like fields, spatial data structures, etc.

Another one that is promising for cases like this is https://issues.apache.org/jira/browse/LUCENE-6697

But there is still work to do, to graduate them from the sandbox: for example (this is not criticism, these guys are iterating and that is how it goes), some of these formats create large files in /tmp during merge. This kind of "sandy" stuff has to be cleaned up before they are production-strength.

Furthermore integrating them is a little tricky, in the past everyone has jumped to build numerics/spatial on top of what lucene already had (things like inverted index structures), and currently I see them still "wedging" the new stuff behind those apis.

I think in order to fix it properly, we have to expand the index format (Codec apis) with abstractions for these kinds of data structures, simple ones we can live with, improve for users over minor releases, and support backwards compatibility for. We can't just shove this stuff out there quickly: exposing these kinds of features means we are committing ourselves to long-term backwards compatibility of the format, that is one reason it takes longer.

I am not really following all that closely, nobody can keep up with those guys, so I might be wrong, but this is just my high level view on the thing. Its not that we are lazy and don't care about IPv6 or anything like that.

Robert, I don't for a moment think you, or anyone working on ES or Lucene
is lazy.
You folks all do incredible work and give it to us for free. We're very
grateful for you efforts.

I think ipv6 is just a big deal to a lot of people, which is why we see so
much interest in this issue, and we're just waiting for the technology to
catch up to our needs :)

On Tue, Sep 29, 2015, 21:56 Robert Muir [email protected] wrote:

Thats not really true. @mikemccand https://github.com/mikemccand and
@nknize https://github.com/nknize are hard at work, and have been for a
long time, adding all kinds of experimental data structures to lucene: to
better solve the issues of numeric-like fields, spatial data structures,
etc.

Another one that is promising for cases like this is
https://issues.apache.org/jira/browse/LUCENE-6697

But there is still work to do, to graduate them from the sandbox: for
example (this is not criticism, these guys are iterating and that is how it
goes), some of these formats create large files in /tmp during merge. This
kind of "sandy" stuff has to be cleaned up before they are
production-strength.

Furthermore integrating them is a little tricky, in the past everyone has
jumped to build numerics/spatial on top of what lucene already had (things
like inverted index structures), and currently I see them still "wedging"
the new stuff behind those apis.

I think in order to fix it properly, we have to expand the index format
(Codec apis) with abstractions for these kinds of data structures, simple
ones we can live with, improve for users over minor releases, and support
backwards compatibility for. We can't just shove this stuff out there
quickly: exposing these kinds of features means we are committing ourselves
to long-term backwards compatibility of the format, that is one reason it
takes longer.

I am not really following all that closely, nobody can keep up with those
guys, so I might be wrong, but this is just my high level view on the
thing. Its not that we are lazy and don't care about IPv6 or anything like
that.


Reply to this email directly or view it on GitHub
https://github.com/elastic/elasticsearch/issues/3714#issuecomment-144256889
.

+1

+1 This would be extremely helpful

You're in luck. Thx to @rmuir this is getting closer. https://issues.apache.org/jira/browse/LUCENE-7043

Yes it should be closer; I hope ES 5? :)

Fixed via #17746

Thank you for the effort(s).

Thank you for the work on this!

:cake: :tada: 👍

+1

Was this page helpful?
0 / 5 - 0 ratings

Related issues

clintongormley picture clintongormley  ·  3Comments

ppf2 picture ppf2  ·  3Comments

DhairyashilBhosale picture DhairyashilBhosale  ·  3Comments

Praveen82 picture Praveen82  ·  3Comments

malpani picture malpani  ·  3Comments