Elasticsearch: Support uppercase in index names

Created on 7 Apr 2018  路  6Comments  路  Source: elastic/elasticsearch

For use cases that have index name format like:
"user-{base64_user_id}"

discussions on forum
https://discuss.elastic.co/t/why-cant-an-index-start-with-an-uppercase-letter/72435/6

Christian Dahlqvist [Elastic Team Member]

I suspect the reason is due to the fact that the index name historically used to be present as directory names under the data directory. As Elasticsearch can be deployed on file systems that are case-insensitive, this means that case sensitive index names would not work.

David Pilato

Yeah. I agree that now things changed we could potentially support it again.
Not sure if it's really worth it TBH.

:CorFeatureIndices APIs discuss

Most helpful comment

Historically this arose because we use to use the index name as the name of the directory on disk. Since Elasticsearch can run on case-insensitive filesystems, that would mean a and A would map to the same directory on disk and that's horrible if these are meant to be different indices.

The historical reason is indeed relic of the past now that we no longer use the index name as the name of the directory on disk; instead, we use an index UUID.

That said, we think that it is trappy that we could have two indices in the cluster whose name differs only by case: a and A. It means the user could be a typo or application bug away from searching the wrong index. That is bad, we don't like that. As such, we are going to reject this feature request. Additionally, we are going to revisit how we handle index names in general (the caf茅 problem: the sequence of bytes 0x63 0x61 0x66 0xe9 and 0x63 0x61 0x66 0x65 0x0301 display identically (small e with acute versus e followed by combining acute accent)) and consider only allowing lowercase alpha, numbers, and a few special characters such as _ and -: #29503

For your specific case:

For use cases that have index name format like: "user-{base64_user_id}"

We tend to frown upon an index per tenant. It leads to many small indices and we think that is a bad practice.

@jasontedor Apologies upfront, since I'll disagree with this line of reasoning severely...
Its presumptuous and dictatorial... Trying to guess upfront the types of mistakes people might make is a complex undertaking and finding a simple answer to that just creates problems.
For example, it is correct that users might make a typo between 'a' and 'A' if they create these indexes by hand.
It is also true that another group of users might not be owners of the Index name generation and might be using "business significant" IDs from other systems as Index names. These users will be forced to to do a toLower() while creating indexes via the API. Additionally they'll be forced to maintain a mapping between original and ES IDs.
And I suspect the size of the second group of users will be larger compared to the first over time.

All 6 comments

Pinging @elastic/es-core-infra

Historically this arose because we use to use the index name as the name of the directory on disk. Since Elasticsearch can run on case-insensitive filesystems, that would mean a and A would map to the same directory on disk and that's horrible if these are meant to be different indices.

The historical reason is indeed relic of the past now that we no longer use the index name as the name of the directory on disk; instead, we use an index UUID.

That said, we think that it is trappy that we could have two indices in the cluster whose name differs only by case: a and A. It means the user could be a typo or application bug away from searching the wrong index. That is bad, we don't like that. As such, we are going to reject this feature request. Additionally, we are going to revisit how we handle index names in general (the caf茅 problem: the sequence of bytes 0x63 0x61 0x66 0xe9 and 0x63 0x61 0x66 0x65 0x0301 display identically (small e with acute versus e followed by combining acute accent)) and consider only allowing lowercase alpha, numbers, and a few special characters such as _ and -: #29503

For your specific case:

For use cases that have index name format like: "user-{base64_user_id}"

We tend to frown upon an index per tenant. It leads to many small indices and we think that is a bad practice.

@xfumihiro For that question, please use the forum.

@jasontedor - There don't appear to be any similar naming restrictions for aliases, so that means that I can have a collection with the name "cars_v1" with the alias "Cars". Then if you write to "cars" it will just create a new collection. Is there any reason not to have the same restriction on alias names, or conversely not allow the same freedom as alias names when naming collections?

Historically this arose because we use to use the index name as the name of the directory on disk. Since Elasticsearch can run on case-insensitive filesystems, that would mean a and A would map to the same directory on disk and that's horrible if these are meant to be different indices.

The historical reason is indeed relic of the past now that we no longer use the index name as the name of the directory on disk; instead, we use an index UUID.

That said, we think that it is trappy that we could have two indices in the cluster whose name differs only by case: a and A. It means the user could be a typo or application bug away from searching the wrong index. That is bad, we don't like that. As such, we are going to reject this feature request. Additionally, we are going to revisit how we handle index names in general (the caf茅 problem: the sequence of bytes 0x63 0x61 0x66 0xe9 and 0x63 0x61 0x66 0x65 0x0301 display identically (small e with acute versus e followed by combining acute accent)) and consider only allowing lowercase alpha, numbers, and a few special characters such as _ and -: #29503

For your specific case:

For use cases that have index name format like: "user-{base64_user_id}"

We tend to frown upon an index per tenant. It leads to many small indices and we think that is a bad practice.

@jasontedor Apologies upfront, since I'll disagree with this line of reasoning severely...
Its presumptuous and dictatorial... Trying to guess upfront the types of mistakes people might make is a complex undertaking and finding a simple answer to that just creates problems.
For example, it is correct that users might make a typo between 'a' and 'A' if they create these indexes by hand.
It is also true that another group of users might not be owners of the Index name generation and might be using "business significant" IDs from other systems as Index names. These users will be forced to to do a toLower() while creating indexes via the API. Additionally they'll be forced to maintain a mapping between original and ES IDs.
And I suspect the size of the second group of users will be larger compared to the first over time.

Was this page helpful?
0 / 5 - 0 ratings