Nim: strutils/isUpperAscii and unicode/isUpper consider space, punctuations, numbers as "lowercase"

Created on 5 Jun 2018 · 36Comments · Source: nim-lang/Nim

Test:

import strutils except isUpper
import unicode
echo "A B".isUpperAscii()  # Returns false
echo "A B".isUpper()  # From unicode, Returns false
echo "AB?!".isUpper()  # From unicode, Returns false
echo "1, 2, 3 GO!".isUpper() # From unicode, Returns false

Just for precedent, Python, Emacs Lisp return a "true" for upper-case checking of "A B", "AB?!" and "1, 2, 3 GO!".

Python isupper vs Nim isUpper

Unicode

Source

kaushalmodi

Most helpful comment

Well since I got tagged here, I will also provide my two cents. I don't like the idea to copy Python semantics just for the sake of copying python semantics. If Python has reasonable design decisions behind it's library and those dosign decisions do not stand in conflict with priorities that you would want in a statically typed compiled language, then yes I think it is a good idea to design the nim library around an established system.

So please provide different reasoning than "this is how it is done in python". For me isUpper(' ') == falseis totally fine, because a space is neither upper case nor lower case. So also isUpper("A B") is false, because not every character in that string is strictly upper case. What you want to check semantically is "does not contain lower case letters". so maybe something like this is what you are looking for: {'a'..'z'} notin "A B". Maybe you want to have some predefined sets, too. So lowercaseASCII notin "A B".

krux02 on 10 Jun 2018

👍4

All 36 comments

Just to add, the isUpper documentation says:

Returns true iff s contains all upper case unicode characters.

I believe that all non-alpha characters should be skipped from the check for "upper case unicode characters".

.. because with the current logic.. both of the below return false:

echo "A B".isUpper()
echo "a b".isLower()

.. which seems counterintuitive.

It feels like the isUpper, isLower are conflating their behavior with that of isAlpha.

kaushalmodi on 5 Jun 2018

Seems right to me, only alphabetic characters can be either uppoer or lower

andreaferretti on 6 Jun 2018

all non-alpha characters should be skipped from the check for "upper case unicode characters".

Why?

echo "A B".isUpper()
echo "a b".isLower()

Your tests contains space, it's non alpha. And this the right logic.

data-man on 6 Jun 2018

@data-man That's what I said.. the isUpper/isLower checks are conflating with isAlpha.

Bear with me.. here is the reasoning..

Consider the current isUpper be named isUpperX
Consider the behavior I am suggesting be named isUpperY

Then currently str.isUpperX is behaving like str.isAlpha and str.isUpperY. We need to remove the isAlpha check out of isUpper. All of this applies to isLower too.

When we are dealing with real world text, parsing docs, web pages, we rarely face all alphabetical characters strings. We normally deal with sentences that usually contain at least a space. In that case both isAlpha and isUpper will always be false which is not very useful.

If the implementation is changed to isUpperY, getting the current behavior is as simple as anding with isAlpha. But deriving isUpperY from isUpperX as I need to do today is very hacky.

kaushalmodi on 6 Jun 2018

Another way to think about this.. you would do an uppercase/lowercase check only on alphabets (or characters that can have upper and lower cases). You wouldn't do that check on spaces (and numbers, full stops, etc) as they don't have cases. So such characters should not affect the isUpper, isLower, etc.

Crude analogy follows..

It's like asking a 3-year old if he likes Guinness (beer), and if he says no, you don't order that for the whole table of 20 adults.

That space is being that 3-year old in that "A B".isUpper ().. it's giving vote for something it's not eligible for.

kaushalmodi on 6 Jun 2018

We normally deal with sentences that usually contain at least a space.

No. Passwords and usernames almost always can't contain spaces.
You can implement your MyRightIsUpper for your needs.

data-man on 6 Jun 2018

No. Passwords and usernames almost always can't contain spaces.

I agree and also almost is the key there. And consider how few cases of parsing such space-less, digit-less, punctuation-less are out there in the huge text parsing world. As I said, you can still get the current behavior by simply anding the proposed isUpperY with isAlpha.

kaushalmodi on 6 Jun 2018

In the upcoming unicode module (RFC #7902) will be possible this:

let myRunes = alphas + spaces + digits + '.' + ','
if check(myString, myRunes):
...

data-man on 6 Jun 2018

@data-man why did you close this issue? New unicode module wasn't implemented yet

Yardanico on 6 Jun 2018

Well since these have been designed with Python compat in mind, they should follow Python's behaviour (as weird as that is to me).

Araq on 6 Jun 2018

@Yardanico
Do you want to do double work? Well, go ahead!

data-man on 6 Jun 2018

@data-man what double-work? You would just make a PR with "fixes #7963", and as I can tell we didn't fully decide for the Unicode module RFC yet.

Yardanico on 6 Jun 2018

@data-man Just from language semantics, wouldn't you agree that the following statement is all uppercase: "I AM USING ALL CAPS, BUT JUST AS AN EXAMPLE, NOT BEING RUDE."

If someone wants to parse space-less passwords, relying on isUpper for "space-less check " sounds wrong.. they should first use isAlpha, isAlphaNumeric or whatever that meets the valid password chars check.

kaushalmodi on 6 Jun 2018

👍2

@kaushalmodi

wouldn't you agree that the following statement is all uppercase: "I AM USING ALL CAPS, BUT JUST AS AN EXAMPLE, NOT BEING RUDE."

Of course, I disagree. Spaces aren't letters.

If someone wants to parse space-less passwords

If someone wants to check whole sentences, he can use split and isUpper.

data-man on 6 Jun 2018

The current behavior shouldn't change, this is a breaking change with unpleasant consequences for someone.

data-man on 6 Jun 2018

The current behavior shouldn't change, this is a breaking change with unpleasant consequences for someone.

I have some data sample to disprove that! :D

I can of course see only the Nim code that in public on GitHub..

Github search of public repos containing isUpper in Nim files

This returned just 50 hits in all, many of which are false hits like proc g_unichar_isupper*(c: gunichar): gboolean{.cdecl, dynlib: gliblib,, or just hits in Nim forks.

I started documenting and understanding the use of isUpper in all those, and it got boring quite quickly.. below 13 rows cover the use of isUpper in public Nim code on GitHub in the last 2 years.

As you see my proposal won't break any of those.. in fact it's fixing the py2nim behavior.

| | Line | Update year | User | Will break? | Reason |
|----|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|-------------------|-------------------------|-------------------------------------------------------------------------|
| 1 | regex.nim | 2018 | @nitely | No | isUpper used on Rune |
| 2 | Nim/pegs.nim | 2018 | @Araq | No | isUpper used on Rune |
| 3 | Nim/unicode.nim | 2018 | @Araq | No/Yes? | This is what we are fixing here! |
| 4 | bob/example.nim | 2018 | @amscotti | No | isUpper used on Rune |
| 5 | unicodeplus.nim | 2018 | @nitely | No | N/A - That library implements its own isUpper, again for Rune |
| 6 | string_methods.nim | 2018 | @metacraft-labs | It will actually fix it | Because that library translates Python isupper to Nim isUpperAscii. |
| 7 | builtin_methods.nim | 2018 | _same as above_ | _same as above_ | _same as above_ |
| 8 | NimMirror/tpegs.nim | 2017 | @georgiy-pruss | N/A -- Nim Mirror | |
| 9 | NimMirror/unicode.nim | 2017 | _same as above_ | _same as above_ | |
| 10 | NimMirror/strutils.nim | 2017 | _same as above_ | _same as above_ | |
| 11 | NimMirror/pegs.nim | 2017 | _same as above_ | _same as above_ | |
| 12 | uetypes.nim | 2017 | @pragmagic | No | isUpper (deprecated version from strutils) used on char |
| 13 | another Nim mirror/tpegs.nim | 2016 | @FedericoCeratto | N/A -- Nim Mirror | |

NOTE: My proposal is to fix isUpper, isUpperAscii, isLower, isLowerAscii only with the string input, not char or Rune input. See this summary: https://scripter.co/notes/string-fns-nim-vs-python/#is-upper -- The Nim implemention of the above 4 deviates from Python only for string input, and that too, only for strings that contain alphabetical chars too.

(Below "WRONG" is with respect to Python implementation.)

echo "[Wrong] ", "A B".isUpperAscii()  # Returns false! BUG https://github.com/nim-lang/Nim/issues/7963
echo "[Wrong] ", "A B".isUpper() # from unicode, Returns false! BUG https://github.com/nim-lang/Nim/issues/7963
echo "[Wrong] ", "AB?!".isUpper() # from unicode, Returns false! BUG https://github.com/nim-lang/Nim/issues/7963
echo "[Wrong] ", "1, 2, 3 GO!".isUpper() # from unicode, Returns false! BUG https://github.com/nim-lang/Nim/issues/7963
echo ' '.isUpperAscii() # checking this proc on a non-alphabet char
echo ' '.Rune.isUpper() # checking this proc on a non-alphabet Rune
echo "(*&#@(^#$ ".isUpper() # checking this proc on a non-alphabet string

Output:

[Wrong] false
[Wrong] false
[Wrong] false
[Wrong] false
false
false
false

kaushalmodi on 6 Jun 2018

👍1

@data-man

Of course, I disagree. Spaces aren't letters.

That's a good excuse for violating general email/forum etiquette of not using all-caps :D

If someone wants to check whole sentences, he can use split and isUpper.

In real-world parsing, sentences don't have just spaces.. you have to consider commas, fullstops, other punc chars, digits, spaces with variety of ascii codes, hyphens with variety of ascii codes, so much more.

My point is plain and simple.. the current isUpper (and others) implementation is too specialized.. my proposal allows isUpper to be better composable with other functions.

If you don't care if there are non-alphabetical chars in the parsed string, and just want to check if whatever alphabetical chars are upper cased, use isUpper (my proposed version).
If additionally you want to check if the parsed string does not have any non-alphabetical chars, and that with isAlpha.

kaushalmodi on 6 Jun 2018

I can of course see only the Nim code that in public on GitHub.

What about a private repos, GitLab, BitBucket, etc.?

P.S. I'll not argue anymore, because it's pointless.

data-man on 6 Jun 2018

P.S. I'll not argue anymore, because it's pointless.

I have the same feeling. I am hoping that one or more of the folks using isUpper that I pinged in my above analysis comment chimes in with their opinion.

What about a private repos, GitLab, BitBucket, etc.?

I cannot help with that, and so I made it clear about "public GitHub repos". But if anything, at least from this data-sample, I proved my understanding of current isUpper implementation:

Folks cannot use isUpper elegantly for strings. So they resort to hacks or using just str.toUpper == str to mimick isUpper. See http://exercism.io/tracks/nim/exercises/bob to see what I mean.
My proposal does not break the use of isUpper and family for chars and Runes.
Even if this breaks things, the fix is as simple as anding with isAlpha to get back the current behavior.
The proposed change fixes this single-case incompatiblity with Python isupper: the case where a string (not char or Rune) contains at least one alphabetical char with optional non-alphabetical chars. But fixing this will make str.isUpper Do The Right Thing (TM), and you will see isUpper being then used for strings too in the wild.
It's not that drastic breakage (no breakage at all based on the above small sample set).. we are still pre-1.0 release.

I'll rest my arguments till we get more opinions; I'm now just repeating all the points I made earlier.

kaushalmodi on 6 Jun 2018

We collect breaking changes and call the new thing version 0.19. Please fix it to be compatible with Python, these have been added for Python compat in the first place!

Nim's native way is s.allCharsInSet({'a'..'z', ' '}) etc

Araq on 6 Jun 2018

🎉2

IMO:

doAssert "I AM SHOUTING".isUpper

Should hold. Especially if it's consistent with Python (which from what I can tell it is?)

dom96 on 6 Jun 2018

👍1

@Araq

Please fix it to be compatible with Python, these have been added for Python compat in the first place!

Thank you!

@dom96

doAssert "I AM SHOUTING".isUpper

Should hold

Exactly! Below fails the assertion currently.

import unicode
doAssert "I AM SHOUTING".isUpper

kaushalmodi on 6 Jun 2018

Here are the tests for isUpperAscii and isLowerAscii that should pass after this behavior change:

    doAssert isLowerAscii("a b")
    doAssert isLowerAscii("ab?!")
    doAssert isLowerAscii("1, 2, 3 go!")
    doAssert(not isLowerAscii("(*&#@(^#$ ")) # None of the string chars are alphabets

    doAssert isUpperAscii("A B")
    doAssert isUpperAscii("AB?!")
    doAssert isUpperAscii("1, 2, 3 GO!")
    doAssert(not isUpperAscii("(*&#@(^#$ ")) # None of the string chars are alphabets

kaushalmodi on 6 Jun 2018

tagging @krux02 since I also had some ideas on where strutils should go forward (the trash? :P).

For me spaces shouldn't false isUpper or isLower, otherwise those are completely useless.

mratsim on 8 Jun 2018

👍1

@kaushalmodi I think a better metric would be checking what other languages do. I doubt many libraries make use of isUpper/isLower, the only time I used them was for case swapping, and that was 'cause there is no swapCase(Rune). That said, I think this should be always valid: assert isUpper(toUpper(mystring)) == true which currently is not true.

nitely on 8 Jun 2018

That said, I think this should be always valid: assert isUpper(toUpper(mystring)) == true which currently is not true.

That is NOT true if we copy Python's design! Because isUpper("") is false.

Araq on 8 Jun 2018

Right, in python the string must contain at least one cased character, which makes sense, otherwise "123" would also return true. But Nim doesn't has to follow python, see what other languages with a rich stdlib do.

nitely on 8 Jun 2018

Same for isUpper(";: "), there are plenty of special cases here...

Araq on 8 Jun 2018

I realized that and edited my comment.

nitely on 8 Jun 2018

see what other languages with a rich stdlib do

Python is the only language I personally know that has isUpper/isLower for strings in its standard library (perhaps it's forced to since it's dynamic). Other languages only have it for chars and codepoints. In Apache Commons Lang however the StringUtils class has isAllUpperCase/isAllLowerCase methods that do what the Nim methods do currently. As bloaty as it is, my suggestion would be to change isUpper/isLower behaviour and add isAllUpper/isAllLower for the old behaviour.

hlaaftana on 9 Jun 2018

@hlaaftana I didn't think that would be the case, but you are right, I checked: go, java, rust, swift, ruby, C++, C# and none of them has isUpper for strings, only for code points :open_mouth:

nitely on 9 Jun 2018

@nitely
Julia has is* for strings.
With the correct behavior. :-)

data-man on 9 Jun 2018

@data-man Except they're deprecated now with all(isupper, str) being the alternative.

hlaaftana on 9 Jun 2018

IMO if those methods are not compatible with python they might as well be removed, they are nearly useless anyway. The fact that not even a single statically typed language has been found that implements this proc should be an indication that it might not be that useful...

No. Passwords and usernames almost always can't contain spaces.

In the entire discussion about this, I think this is the only actual use case that has been mentioned for either of the implementations, and this use case seems strange. Why would a website care if my username only contains uppercase letter? Passwords sometimes has rules, but they are typically much more complex than just "not entirely uppercase" (and fwiw, it's very uncommon that passwords can't contain whitespace).

@data-man Except they're deprecated now with all(isupper, str) being the alternative.

That's what Nim should do as well IMO. There are like ~20 procs in strutils that are just bloat on top of allCharsInSet and the in operator.

GULPF on 9 Jun 2018

👍2

krux02 on 10 Jun 2018

👍4

@krux02

So also isUpper("A B") is false

see https://github.com/nim-lang/Nim/issues/8600 in which I argue we can completely avoid this senseless debate by deprecating isUpper(string) and only have isUpper on char, Rune

timotheecour on 10 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

deprecation syntax inconsistent for objects, not possible for const/let/var

timotheecour · 3Comments

feature: have 'if' treat nil as false

kobi2187 · 4Comments

try expression doesn't work with return on expect branch

alaviss · 3Comments

Casting a pointer to an UncheckedArray[uint8] crashes the compiler

liquid600pgm · 3Comments

Support default arguments in sugar.`=>` and sugar.`->`

hlaaftana · 3Comments