Orientdb: [PROPOSAL] - make all the identifiers case sensitive in v 3.0

Created on 27 Apr 2016  ·  28Comments  ·  Source: orientechnologies/orientdb

In v 2.x there are some incoherences in how OrientDB manages case sensitivity of identifiers.
Two examples:

  • class names are case insensitive, but property names are case sensitive
  • class names are case insensitive everywhere, except as out(), in()... arguments

I propose, in v 3.0, to make all the identifiers case sensitive.
Identifiers include:

  • class names
  • property names
  • index names
  • cluster names
  • function names
  • method names
  • aliases
  • named parameters
  • variable names (LET)
enhancement

Most helpful comment

@lvca different RDBMS implementations have different behaviors on this.
Let me add some reasons to have case sensitive implementation:

  • no need for toLowerCase() toUpperCase() in the code, that means less CPU cycles, less memory consumption, less GC
  • no problems with locales (see https://en.wikipedia.org/wiki/Dotted_and_dotless_I)
  • no problems with non-ascii characters
  • compatibility with TinkerPop standard (property names are case sensitive)

All 28 comments

:+1: Without consistency, intuitiveness is unreachable. :smile:

Scott

I vote for the opposite: let's make fields case insensitive, so everything is coherent and is back compatibile. The only things it wouldn't work if somebody created properties 'name' and 'Name' on the same record, but this is not common.

Also identifiers are case insensitive in many SQL grammars too. The RDBMS user is expecting this.

@lvca different RDBMS implementations have different behaviors on this.
Let me add some reasons to have case sensitive implementation:

  • no need for toLowerCase() toUpperCase() in the code, that means less CPU cycles, less memory consumption, less GC
  • no problems with locales (see https://en.wikipedia.org/wiki/Dotted_and_dotless_I)
  • no problems with non-ascii characters
  • compatibility with TinkerPop standard (property names are case sensitive)

btw, case insensitive field names would not be backward compatible

In Mysql the table names and aliases are case sensitive, because the Linux file system is case sensitive. On Windows and OSX (with HFS), they aren't. Column/ Field names are case insensitive. http://dev.mysql.com/doc/refman/5.7/en/identifier-case-sensitivity.html

In PostgreSQL, the identifiers are automatically reduced to lower case, unless they are in double quotes. So, a semi-forced case sensitivity. https://le-gall.net/pierrick/blog/index.php/2006/12/01/90-postgresql-and-case-sensitivity

MSSQL is partially case sensitive. You can't mix cases when creating, but can for the RUD commands. I think I explained that right. https://msdn.microsoft.com/en-us/library/aa290089%28v=vs.71%29.aspx

So, summa summarum, the programmer won't be expecting case insensitivity necessarily. But for sure, he or she would very much expect consistency, which case sensitivity offers. Also, it means we devs don't have to add quotes to queries to be case sensitive. It also means user land must hold up (and can use) a particular case type for their identifier naming convention https://danielmiessler.com/blog/a-list-of-different-case-types/ which I really like. :smile:

Scott

Hi,

I'm for the case sensitive, also because the other protocol we use appart sql are case sensitive (Json,our binary protocol).

In general case sensitive make the engine simpler and faster, on the other side the IDE (studio) should give suggestion in case is defined a property with differnet case of the one in the sql.

+1 for case sensitive

Just to add even more fuel to the flame:

Also identifiers are case insensitive in many SQL grammars too. The RDBMS user is expecting this.

I'm for Luca's approach here :)
And Postgres way (see @smolinari's comment) feels as most common sensetive.

:D It's a hard decision to take, so all the fuel is welcome

Thanks

Luigi

+1 for case-sensitive, for all the reasons outlined already.

Mario

On Wed, Apr 27, 2016 at 10:47 AM, luigidellaquila [email protected]
wrote:

:D It's a hard decision to take, so all the fuel is welcome

Thanks

Luigi


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/orientechnologies/orientdb/issues/6052#issuecomment-215107479


Mario Cormier

@a-unite - How do you see the semi-forced case sensitivity, as PostreSQL does it, as the best way? What are the advantages and disadvantages?

The only thing I can think of is the ability to use spaces in identifiers with double quotes, which is still possible to do with ODB too. So, ODB could also be completely case sensitive without quotes, and only need double quotes for the spaces in identifiers.

What I personally don't like about the PostgreSQL method is the forcing of all identifiers (without double quotes) to lower case. You send in aClassWithALongNameWhichYouCanRead and get back aclasswithalongnamewhichyoucantread (yeah, PostgreSQL won't add the "t" :smile: ). That's a problem and limits us as developers. It "dumbs down" the API.

Scott

@smolinari,
My main reason is:
If ODB is going to be popular - then we have to provide most convenient way for ordinal users, not only for advanced and highly professional ;) geeks as we are.
This is especially important since we don't have and will not have any possibility to signal user (in schema-less mode), that he has empty result set not because there is no data for his query, but because he wrote one or whatever letters (i.e. AClassWithALongNameWhichYouCanRead.Property) in wrong case.

About identifiers returned in low-case when select query executed - I agree, this is not so good. Sorry, I must misunderstood your comment and didn't check actual Postgres behavior (already forgot how it worked). But I guess, we could store two versions of identifiers - in low case for system use (if it is chipper than conversion and to use them as file names, so there are no problems in transition between operating systems) and actual (as is) version to return for humans to read.

There is still the problem with case conversion if non Latin letters used for identifiers, but I believe, again - it is better to strict them with Latin only symbols, to prevent any possible problems with OS file systems support.

Regards,
Ata

Thanks for the explanation @a-unite.

Scott

+1 for case sensitive for all the reasons reported above and because it allows a better management of unstructured data.

Hi, :+1: for case sensitive . this is orientdb in the new age of databases absolutely :)

👍 for case sensitive everywhere.

Looks like Case Sensitivity is winning! :smile:

Scott

Guys, I'd rather prefer looking at the big picture here. Changing this means that all the current users could have to rewrite most of their queries if they want to upgrade to v3.0 and in some application this could be really expensive. Maybe users won't upgrade just for this reason.

Take a look at the biggest DBMS players:

  • Oracle is case insensitive
  • MySQL change behaviour on Linux and Windows...
  • Postgres is case insensitive
  • Cassandra is case insensitive by default

This change is not about resolving a critical bug or increasing the performance by 2x.

I'm against introducing this big break in compatibility, just for the sake of being more "coherent" (coherent with what?). I can be coherent with the syntax if everything is declared and documented. There is no need to be "uniform" (this is a better word to represent the intent) across all the names in every section of OrientDB and plugins.

Hi Luca,

I understand your point and it makes sense, but let's define an expected behavior for class name equality.

  1. How is a class name stored in the db? I'd say the original name that is defined at creation time.
  2. How is class name equality calculated? The String.equalsIgnoreCase() method does not accept a locale, the toUpperCase() and toLowerCase() can be used with a locale. Which locale? The db locale or the machine locale? What does it happen when you move your DB to another machine or when you manually change the locale?

Thanks

Luigi

  1. Agreed (so like now)

About (2), this is the implementation of Java String:

    public boolean equalsIgnoreCase(String anotherString) {
        return (this == anotherString) ? true
                : (anotherString != null)
                && (anotherString.value.length == value.length)
                && regionMatches(true, 0, anotherString, 0, value.length);
    }

So we could use regionMatches() that supports case insensitivity:

public boolean regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len)

But this doesn't support Locale that it should be always what you defined in the database (everywhere), or if not defined, the default that is the system Locale. Behind Locale there are a lot of edge cases. Look at This and The "turkish case".

Hi guys,
The only problem is performance, equalsIgnoreCase is very slow.
So if we keep case-insensitive behavior, I think we should normalize names
of classes, clusters and so on before they will go inside of database
engine.
The problem now that we do not know whether all those names normalized or
not and as a result we have tons of toUppperCase toLowCase, equalsIgnorCase
over the system.
We should be sure that if I take clasName inside of storage it always in
lower case for example.

Also, I support Luca because the implementation of the case sensitive
approach will be a disaster for our existing users.

On Thu, Jul 7, 2016 at 4:55 PM Luca Garulli [email protected]
wrote:

  1. Agreed (so like now)

About (2), this is the implementation of Java String:

public boolean equalsIgnoreCase(String anotherString) {
    return (this == anotherString) ? true
            : (anotherString != null)
            && (anotherString.value.length == value.length)
            && regionMatches(true, 0, anotherString, 0, value.length);
}

So we could use regionMatches() that supports case insensitivity:

public boolean regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len)

But this doesn't support Locale that it should be always what you defined
in the database (everywhere), or if not defined, the default that is the
system Locale. Behind Locale there are a lot of edge cases. Look at This
http://blog.eyallupu.com/2010/08/four-things-to-remember-about.html and The
"turkish case"
http://mattryall.net/blog/2009/02/the-infamous-turkish-locale-bug.


You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub
https://github.com/orientechnologies/orientdb/issues/6052#issuecomment-231084474,
or mute the thread
https://github.com/notifications/unsubscribe/AAGaahyK4L4VQv5jI4bV19ZeuyXSEj35ks5qTQU3gaJpZM4IQtZB
.

Best regards,
Andrey Lomakin, R&D lead.
OrientDB Ltd

twitter: @Andrey_Lomakin
linkedin: https://ua.linkedin.com/in/andreylomakin
blogger: http://andreylomakin.blogspot.com/

@laa, exactly!
My proposal was to store low case _class_ names in some new structure (this means new database format though), but progressively.
That means, that every create class or alter class touching class name or superClass name should have store low case name _in addition_ to its original version. Original case version could be used in all query results, though, as it is now.

When read - we should check, if there is such a "lower case" data in the class (means database was imported, or this class was altered/created with recent API version), then we could use this value for quick comparison (still have to make provided query value lowercased, of course).
By version 3.0 we could switch database version completely, making sure, that DB was re-imported and classes|clusters all have the needed (system) lowcase field.

@luigidellaquila about the speed, I'm still wonder about the next case:
let's suppose our file system is case insensitive (it is easy - try to make folder "name" and folder "Name" in the same place in Windows) - then we will have to store clusters and classes in some special way to distinguish them. Lets say name.cpm and Name_.cpm respectively. Then you either will need to store that mapping somewhere (and lookup for it every time), or resolve that problem with some other quite complicated (if not expensive) method.

well i think we could separe a bit the sections, i know that have only one behaviour would be better, but not sure if is doable.

The most important case for performance and consistency overall is the "properties" also called "field names"
As today we have this beaviour:
Schema properties are case insensitive
document fields are case sensitive
sql fields are case sensitive

to unform the not consistent current behaviour we have two way, make everything case sensitive or case insesitive.

case insensitive:
Advantages:

  • maybe more user friendly in case of misstyping ecc.
    Disadvantges:
  • incompatibility for old schemaless case sensitive
  • slow down of filed access in odocument
  • impossible to implement binary query
  • issue on locale corner cases

case sensitive:
Advantages:

  • locale independecy
  • ability to do binary query ecc
  • compatible with current ODocument api
    Disadvantges:
  • maybe less user friendly
  • incompatibility with current schema properties api

for the properties/fields it seems more convinient to do case sensitive.

the other cases as far as i got does not make much difference for case sensistive and insensitive, you have equal (opinion based)vantages on usability on case insensitive, no huge performance differences, and some small issues based on the locale.

so from what i listed here and from my point of view is better have case sensitive properties, and maybe does really matter the rest.
It may only have meaning keep just one way to work to be more user friendly.

my 2 cents bye

hi @a-unite,

I understand your point, you said the engine should work lower case (or case sensitive) and should show the result to the user as the case was configured with.

well i'm not sure we want to keep both version of the field name in the database, and the mapping later on may be expensive, due to all "SchemaLess" cases.

Hi guys,
If it is related only to fields and because most of the areas use case
sensitive fields IMHO better to make them case sensitive.

On Thu, Jul 7, 2016 at 5:56 PM tglman [email protected] wrote:

hi @a-unite https://github.com/a-unite,

I understand your point, you said the engine should work lower case (or
case sensitive) and should show the result to the user as the case was
configured with.

well i'm not sure we want to keep both version of the field name in the
database, and the mapping later on may be expensive, due to all
"SchemaLess" cases.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/orientechnologies/orientdb/issues/6052#issuecomment-231103232,
or mute the thread
https://github.com/notifications/unsubscribe/AAGaak9KpHV1f9SvbcVAFau6JBUedIf5ks5qTROkgaJpZM4IQtZB
.

Best regards,
Andrey Lomakin, R&D lead.
OrientDB Ltd

twitter: @Andrey_Lomakin
linkedin: https://ua.linkedin.com/in/andreylomakin
blogger: http://andreylomakin.blogspot.com/

@tglman, thanks. You are right - I'd prefer to have case insensitive case for fields too (I see too often issues here, which are related to mistyping or misunderstanding of case sensitivity for field names).

But as I understand, we are discussing now classcluster names only and I agree, that for field names we have to count on back compatibility anyway.

Edit: sorry, we are discussing "coherency" actually. So this (sensitive or insensitive in in()/out()) should be addressed anyway:

class names are case insensitive everywhere, except as out(), in()... arguments

Changing this means that all the current users could have to rewrite most of their queries if they want to upgrade to v3.0 and in some application this could be really expensive.

This is assuming that all of the current users change the case of their class name identifiers, when they write their queries. Why would they do that? I can't say it is completely false, but I can say, it probably isn't completely true that they ALL do that. So, I'd say going to case sensitivity for classes won't mean all the users have to rewrite all of their queries. It will be more like some of the users will have to correct their misspellings on some of their queries. That doesn't sound as painful as you put it. :wink:

This might get a grumble from you guys, but couldn't we do it like MySQL and have a global setting for what is wanted by the user? That way, all bases are covered.

From the Mysql docs:
Value Meaning
0 Table and database names are stored on disk using the lettercase specified in the CREATE TABLE or CREATE DATABASE statement. Name comparisons are case sensitive. You should not set this variable to 0 if you are running MySQL on a system that has case-insensitive file names (such as Windows or OS X). If you force this variable to 0 with --lower-case-table-names=0 on a case-insensitive file system and access MyISAM tablenames using different lettercases, index corruption may result.
1 Table names are stored in lowercase on disk and name comparisons are not case sensitive. MySQL converts all table names to lowercase on storage and lookup. This behavior also applies to database names and table aliases.
2 Table and database names are stored on disk using the lettercase specified in the CREATE TABLE or CREATE DATABASE statement, but MySQL converts them to lowercase on lookup. Name comparisons are not case sensitive. This works only on file systems that are not case sensitive! InnoDB table names are stored in lowercase, as for lower_case_table_names=1.

Scott

I was going through the top most reacted posts for ODB and this was one of them and I'd like to pull back my suggestion above. Don't do it the MySQL way, as it only adds way too much complexity.

I am still for case sensitivity and I feel backwards compatibility won't or rather shouldn't be a concern. We are now talking 3.0, which is supposed to be breaking old ways to do them better. I feel going full case sensitive is a move forward, as can be seen by the majority of sentiment shown in this thread.

I'll throw in one other reason, case sensitivity is necessary to be able to support full NoSQL. As someone else mentioned, JSON is case sensitive. I really feel ODB needs to break with the RDBMS world as much as possible. Doing too many things like RDBMSes in an effort to attract people stuck in that paradigm is bending over backwards for all the wrong reasons. It holds ODB back from doing smarter things, if you ask me. ODB needs to be a leader in the NoSQL fold, but doing relationships much better the rest of them. That is the real trick to winning hearts like mine. 😄

Scott

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MartinBrugnara picture MartinBrugnara  ·  35Comments

janjilek picture janjilek  ·  33Comments

xavier66 picture xavier66  ·  40Comments

dmitrytokarev picture dmitrytokarev  ·  25Comments

saeedtabrizi picture saeedtabrizi  ·  24Comments