Rust: <ctype.h> functions for AsciiExt

Created on 8 Feb 2017 · 30Comments · Source: rust-lang/rust

As discussed at https://internals.rust-lang.org/t/improvements-to-asciiext/4689 I have added is_ascii_* equivalents of most (not quite all) of C's ctype.h's isxxx functions to the AsciiExt trait.

The new functions are

is_ascii_alphabetic
is_ascii_uppercase
is_ascii_lowercase
is_ascii_alphanumeric
is_ascii_digit
is_ascii_hexdigit
is_ascii_punctuation
is_ascii_graphic
is_ascii_whitespace (matches the Infra Standard definition of ASCII whitespace, not POSIX)
is_ascii_control

They are implemented for char, u8, str, and [u8], and, for backward compatibility with external trait implementations, have default implementations that call unimplemented!().

PR to follow, but I needed to file this bug in order to get an issue number to put in the stability annotations.

B-unstable C-tracking-issue T-libs

Source

zackw

❤8 👍5

Most helpful comment

I’ve definitely used is_ascii, to_ascii_lowercase, and eq_ignore_ascii_case on [u8] and u8.

SimonSapin on 22 Aug 2017

👍4

All 30 comments

Stabilize?

I just wrote if let b'A' ... b'Z' = b {, it’s not too bad but might have looked nicer as if b.is_ascii_uppercase() {

SimonSapin on 1 May 2017

👍4

There was a request over in #39659 to deprecate AsciiExt and convert all the methods to intrinsics on str, [u8], etc. but I never got around to it and don't have time in the near future either. I dunno how a plan to do that would affect the stabilization schedule.

zackw on 1 May 2017

What's the status here? Could I try to "deprecate AsciiExt and convert all the methods to intrinsics on str, [u8]", create a PR and it might be accepted? Or are there other plans? If there is nothing except "lack of time" blocking this, I might be able to help.

LukasKalbertodt on 14 Aug 2017

From my end it's just lack of time. I think the team would be glad to at
least look at a patch if you made one.

zackw on 14 Aug 2017

Right now AsciiExt is implemented for &[u8] and u8. When copying all AsciiExt methods to the individual types, shall I also copy them for &[u8] and u8? I kinda doubt those implementations are really useful. And it also doesn't make sense not to implement it for u16 (e.g. an array of utf16 codepoints), too.

What do you think?

LukasKalbertodt on 22 Aug 2017

I’ve definitely used is_ascii, to_ascii_lowercase, and eq_ignore_ascii_case on [u8] and u8.

SimonSapin on 22 Aug 2017

👍4

CC @alexcrichton / @rust-lang/libs for FCP to stabilize as inherent methods: #44042

SimonSapin on 3 Oct 2017

@rfcbot fcp merge

aturon on 3 Oct 2017

Team member @aturon has proposed to merge this. The next step is review by the rest of the tagged teams:

[x] @BurntSushi
[x] @Kimundi
[x] @alexcrichton
[x] @aturon
[x] @dtolnay
[x] @sfackler

Concerns:

str and [u8] methods (https://github.com/rust-lang/rust/issues/39658#issuecomment-333977502)

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

rfcbot on 3 Oct 2017

The semantics of e.g. str::is_ascii_hexdigit as meaning every byte is a hex digit is a bit of a stretch. Now that these are inherent methods rather than a common extension trait across char and str, we should consider that we no longer need all the methods across all the types.

Is "A" an ASCII hex digit? Sure.
Is "AA" an ASCII hex digit? Not obvious to me.
Is "" an ASCII hex digit?

Would it be bad to require the user to be more explicit in some of these cases?

s.bytes().all(|b| b.is_ascii_hexdigit())

@rfcbot concern str and [u8] methods

dtolnay on 3 Oct 2017

👍2

I think I agree with @dtolnay, and given https://github.com/rust-lang/rust/pull/44042 I think we should "soon" have the ability to implement that relatively easily?

alexcrichton on 4 Oct 2017

Discussed at the libs triage meeting today, we're going to scope this issue to be blocked on https://github.com/rust-lang/rust/pull/44042 and then we'll stabilize these methods only on the u8 and char types.

alexcrichton on 31 Oct 2017

44042 has landed

SimonSapin on 5 Nov 2017

I would like to prepare a PR for stabilization. A few things:

I would remove the inherent ascii_ctype methods of [u8] and str entirely
I would replace the #[unstable] attribute with a #[stable] one on the inherent ascii_ctype methods for u8 and char. I should keep the feature name ascii_ctype, right?
I will remove all ascii_ctype methods of the AsciiExt trait, ok? They are still unstable so we can remove them. Or do you want to keep them around for a bit and maybe deprecate them later? Is that even possible, to have an unstable and a stable feature with the same name? Anyway, removing them from AsciiExt seems like the best idea to me.

LukasKalbertodt on 7 Nov 2017

Sounds pretty good to me! It's ok to tweak the feature name if necessary (not critical to remain the same). I'd probably give the stable methods a new feature name to leave the unstable methods. Removing the methods from [u8] and str makes sense to me as well.

For now I'd also leave the methods on the trait and not delete anything just yet, we can do that once this has all hit stable.

alexcrichton on 7 Nov 2017

@alexcrichton Right now, the AsciiExt trait is implemented by just calling the inherent methods. It's done by a macro to avoid duplicate code. If we don't remove all ascii_ctype methods from AsciiExt, I need to move the real implementation back to the AsciiExt impl and cannot use the macro for str and [u8] anymore.

It's not a tragedy, but it would be a bit annoying and produces a lot of diff noise. I'd prefer removing the ascii_ctype methods on AsciiExt immediately. But I can understand if we want to keep them around for a few cycles still. I just wanted to ask before I would move all the code around. What do you think?

LukasKalbertodt on 9 Nov 2017

@LukasKalbertodt yeah keeping code around for awhile to minimize the diff seems fine by me! I think we'll be leaving AsciiExt for awhile anyway

alexcrichton on 10 Nov 2017

Any update here?

codeyash on 22 Feb 2018

Rust 1.24 was recently released with these methods available as inherent methods of u8 and char. For example: https://doc.rust-lang.org/std/primitive.char.html#method.is_ascii_alphabetic

I think we can now deprecate the AsciiExt trait. (And, a release cycle or two after that, perhaps remove its unstable methods.) This issue should probably be re-opened to track this.

SimonSapin on 22 Feb 2018

👍1

Reopened as per request on IRC.

nagisa on 22 Feb 2018

Is it still the plan to get at least is_ascii_alphabetic, is_ascii_uppercase ,is_ascii_lowercase, is_ascii_digit and is_ascii_alphanumeric on str? It's a bit verbose to have to create an iterator over chars for each string.

zbraniecki on 7 Mar 2018

@zbraniecki We decided that we prefer the explicit version of s.chars().all(|c| c.is_foo()) (see comment). I'd doubt that those methods will be added to str and [u8] anytime soon...

LukasKalbertodt on 7 Mar 2018

👎2 👍1

Deprecation PR for the AsciiExt trait: https://github.com/rust-lang/rust/pull/49109

SimonSapin on 17 Mar 2018

Shouldn't this issue also track "removing of unstable AsciiExt methods"? Otherwise we might forget to remove those. Or was there already a PR doing that which I missed?

LukasKalbertodt on 23 Mar 2018

I don’t know how useful it is to track such features individually. We can occasionally do a pass on the entire standard library looking for items that are both unstable and have been deprecated for a while.

SimonSapin on 23 Mar 2018

@SimonSapin Ah, I missed the fact that the unstable methods were also marked deprecated. That makes sense then!

LukasKalbertodt on 23 Mar 2018

I'm disappointed that there is now no officially-supported version of is_ascii_lowercase() and friends for str. This was functionality I was using, and it's annoying to go back and write out fancy iterators everywhere or provide my own crate for that functionality. I understand moving this stuff out of AsciiExt, but could we please have the full functionality back?

BartMassey on 12 Jun 2018

@BartMassey IMO str::is_ascii_lowercase might have an unexpected behavior. On char, the same method tells whether the character is in the range a to z, whether it is an ASCII lower-case letter. The "simple" generalization of that to strings is: is this string made entirely of ASCII lower-case letters?

However, what one might be looking for, and not realize that str::is_ascii_lowercase is different is: if I were to call to_ascii_lowercase, would I obtain the same string? which really means does this string not contain any upper-case letter?

Forcing users to write s.chars().all(something something) makes it apparent what a call actually does.

(Maybe char::is_ascii_lowercase (respectively uppercase) should be named something like is_ascii_lowercase_alphabetic instead…)

SimonSapin on 12 Jun 2018

That is an interesting point. There are certainly a number of fine distinctions that can be drawn here: one might reasonably expect that str::is_ascii_lowercase() might return true if and only if the target contained only lowercase alphabetic ASCII characters, for example.

That said, I still feel like the AsciiExt crate already defined a behavior for this function, and that the right thing to do would be to preserve that functionality across the move, at least for now. It feels a bit iffy to me to remove existing functionality without going through the RFC process (as far as I can tell?) in the interest of protecting users from using functions wrong. Do we have evidence that people have routinely misunderstood or misused is_ascii_lowercase()? Its documentation seemed pretty clear to me.

There's an underlying problem here in that anything we do to make ASCII more of a first-class citizen is a step toward encouraging folks to routinely write ASCII-only programs. Nobody wants that.

Having said that, tutorials and examples almost always use and pretty much depend on ASCII: things like Exercism and the Advent Of Code series have caused me an occasional "any other language" kind of feeling because the exercises just assume quality ASCII support. Asking newer users to understand how to write the appropriate iterators to complete these exercises seems to me like it might have its own set of issues.

I don't know. I guess I just need to identify or write a crate that has good support for ASCII-only strings and chars and start using it in situations where that's what I'm working with. AsciiExt was never that, and so maybe removing it makes a hole that can be filled with something better. (A quick survey of the ascii crate looks promising, although I'd still have to add the string functions, looks like. It's been downloaded quite a lot, which might indicate a real demand for this kind of thing.)

For now, s.chars().all(|c| c.is_ascii() && !c.is_ascii_uppercase()) it is, I guess.

BartMassey on 12 Jun 2018

to remove existing functionality without going through the RFC process

For what it’s worth the addition of AsciiExt::to_ascii_lowercase never went through the RFC process in the first place.

For now, s.chars().all(|c| c.is_ascii() && !c.is_ascii_uppercase()) it is, I guess.

Interesting, because this behavior is neither <str as AsciiExt>::is_ascii_lowercase (that’s self.bytes().all(|b| b.is_ascii_lowercase())) nor what I had in mind for the alternative behavior (.bytes().all(|b| !b.is_ascii_uppercase()), without checking that the string is entirely ASCII).

Anyway, it’s always possible to add more methods but a closed issue may not be the best place to propose them.

SimonSapin on 12 Jun 2018

👍2

Was this page helpful?

0 / 5 - 0 ratings