As discussed at https://internals.rust-lang.org/t/improvements-to-asciiext/4689 I have added is_ascii_*
equivalents of most (not quite all) of C's ctype.h
's isxxx
functions to the AsciiExt trait.
The new functions are
is_ascii_alphabetic
is_ascii_uppercase
is_ascii_lowercase
is_ascii_alphanumeric
is_ascii_digit
is_ascii_hexdigit
is_ascii_punctuation
is_ascii_graphic
is_ascii_whitespace
(matches the Infra Standard definition of ASCII whitespace, not POSIX)is_ascii_control
They are implemented for char
, u8
, str
, and [u8]
, and, for backward compatibility with external trait implementations, have default implementations that call unimplemented!()
.
PR to follow, but I needed to file this bug in order to get an issue number to put in the stability annotations.
Stabilize?
I just wrote if let b'A' ... b'Z' = b {
, it’s not too bad but might have looked nicer as if b.is_ascii_uppercase() {
There was a request over in #39659 to deprecate AsciiExt
and convert all the methods to intrinsics on str
, [u8]
, etc. but I never got around to it and don't have time in the near future either. I dunno how a plan to do that would affect the stabilization schedule.
What's the status here? Could I try to "deprecate AsciiExt and convert all the methods to intrinsics on str
, [u8]
", create a PR and it might be accepted? Or are there other plans? If there is nothing except "lack of time" blocking this, I might be able to help.
From my end it's just lack of time. I think the team would be glad to at
least look at a patch if you made one.
Right now AsciiExt
is implemented for &[u8]
and u8
. When copying all AsciiExt
methods to the individual types, shall I also copy them for &[u8]
and u8
? I kinda doubt those implementations are really useful. And it also doesn't make sense not to implement it for u16
(e.g. an array of utf16 codepoints), too.
What do you think?
I’ve definitely used is_ascii
, to_ascii_lowercase
, and eq_ignore_ascii_case
on [u8]
and u8
.
CC @alexcrichton / @rust-lang/libs for FCP to stabilize as inherent methods: #44042
@rfcbot fcp merge
Team member @aturon has proposed to merge this. The next step is review by the rest of the tagged teams:
Concerns:
Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!
See this document for info about what commands tagged team members can give me.
The semantics of e.g. str::is_ascii_hexdigit
as meaning every byte is a hex digit is a bit of a stretch. Now that these are inherent methods rather than a common extension trait across char and str, we should consider that we no longer need all the methods across all the types.
"A"
an ASCII hex digit? Sure."AA"
an ASCII hex digit? Not obvious to me.""
an ASCII hex digit?Would it be bad to require the user to be more explicit in some of these cases?
s.bytes().all(|b| b.is_ascii_hexdigit())
@rfcbot concern str and [u8] methods
I think I agree with @dtolnay, and given https://github.com/rust-lang/rust/pull/44042 I think we should "soon" have the ability to implement that relatively easily?
Discussed at the libs triage meeting today, we're going to scope this issue to be blocked on https://github.com/rust-lang/rust/pull/44042 and then we'll stabilize these methods only on the u8
and char
types.
I would like to prepare a PR for stabilization. A few things:
ascii_ctype
methods of [u8]
and str
entirely#[unstable]
attribute with a #[stable]
one on the inherent ascii_ctype
methods for u8
and char
. I should keep the feature name ascii_ctype
, right?ascii_ctype
methods of the AsciiExt
trait, ok? They are still unstable so we can remove them. Or do you want to keep them around for a bit and maybe deprecate them later? Is that even possible, to have an unstable and a stable feature with the same name? Anyway, removing them from AsciiExt
seems like the best idea to me.Sounds pretty good to me! It's ok to tweak the feature name if necessary (not critical to remain the same). I'd probably give the stable methods a new feature name to leave the unstable methods. Removing the methods from [u8]
and str
makes sense to me as well.
For now I'd also leave the methods on the trait and not delete anything just yet, we can do that once this has all hit stable.
@alexcrichton Right now, the AsciiExt
trait is implemented by just calling the inherent methods. It's done by a macro to avoid duplicate code. If we don't remove all ascii_ctype
methods from AsciiExt
, I need to move the real implementation back to the AsciiExt
impl and cannot use the macro for str
and [u8]
anymore.
It's not a tragedy, but it would be a bit annoying and produces a lot of diff noise. I'd prefer removing the ascii_ctype
methods on AsciiExt
immediately. But I can understand if we want to keep them around for a few cycles still. I just wanted to ask before I would move all the code around. What do you think?
@LukasKalbertodt yeah keeping code around for awhile to minimize the diff seems fine by me! I think we'll be leaving AsciiExt
for awhile anyway
Any update here?
Rust 1.24 was recently released with these methods available as inherent methods of u8
and char
. For example: https://doc.rust-lang.org/std/primitive.char.html#method.is_ascii_alphabetic
I think we can now deprecate the AsciiExt
trait. (And, a release cycle or two after that, perhaps remove its unstable methods.) This issue should probably be re-opened to track this.
Reopened as per request on IRC.
Is it still the plan to get at least is_ascii_alphabetic
, is_ascii_uppercase
,is_ascii_lowercase
, is_ascii_digit
and is_ascii_alphanumeric
on str
? It's a bit verbose to have to create an iterator over chars for each string.
@zbraniecki We decided that we prefer the explicit version of s.chars().all(|c| c.is_foo())
(see comment). I'd doubt that those methods will be added to str
and [u8]
anytime soon...
Deprecation PR for the AsciiExt
trait: https://github.com/rust-lang/rust/pull/49109
Shouldn't this issue also track "removing of unstable AsciiExt
methods"? Otherwise we might forget to remove those. Or was there already a PR doing that which I missed?
I don’t know how useful it is to track such features individually. We can occasionally do a pass on the entire standard library looking for items that are both unstable and have been deprecated for a while.
@SimonSapin Ah, I missed the fact that the unstable methods were also marked deprecated. That makes sense then!
I'm disappointed that there is now no officially-supported version of is_ascii_lowercase()
and friends for str
. This was functionality I was using, and it's annoying to go back and write out fancy iterators everywhere or provide my own crate for that functionality. I understand moving this stuff out of AsciiExt
, but could we please have the full functionality back?
@BartMassey IMO str::is_ascii_lowercase
might have an unexpected behavior. On char
, the same method tells whether the character is in the range a
to z
, whether it is an ASCII lower-case letter. The "simple" generalization of that to strings is: is this string made entirely of ASCII lower-case letters?
However, what one might be looking for, and not realize that str::is_ascii_lowercase
is different is: if I were to call to_ascii_lowercase
, would I obtain the same string? which really means does this string not contain any upper-case letter?
Forcing users to write s.chars().all(something something)
makes it apparent what a call actually does.
(Maybe char::is_ascii_lowercase
(respectively uppercase
) should be named something like is_ascii_lowercase_alphabetic
instead…)
That is an interesting point. There are certainly a number of fine distinctions that can be drawn here: one might reasonably expect that str::is_ascii_lowercase()
might return true if and only if the target contained only lowercase alphabetic ASCII characters, for example.
That said, I still feel like the AsciiExt
crate already defined a behavior for this function, and that the right thing to do would be to preserve that functionality across the move, at least for now. It feels a bit iffy to me to remove existing functionality without going through the RFC process (as far as I can tell?) in the interest of protecting users from using functions wrong. Do we have evidence that people have routinely misunderstood or misused is_ascii_lowercase()
? Its documentation seemed pretty clear to me.
There's an underlying problem here in that anything we do to make ASCII more of a first-class citizen is a step toward encouraging folks to routinely write ASCII-only programs. Nobody wants that.
Having said that, tutorials and examples almost always use and pretty much depend on ASCII: things like Exercism and the Advent Of Code series have caused me an occasional "any other language" kind of feeling because the exercises just assume quality ASCII support. Asking newer users to understand how to write the appropriate iterators to complete these exercises seems to me like it might have its own set of issues.
I don't know. I guess I just need to identify or write a crate that has good support for ASCII-only strings and chars and start using it in situations where that's what I'm working with. AsciiExt
was never that, and so maybe removing it makes a hole that can be filled with something better. (A quick survey of the ascii
crate looks promising, although I'd still have to add the string functions, looks like. It's been downloaded quite a lot, which might indicate a real demand for this kind of thing.)
For now, s.chars().all(|c| c.is_ascii() && !c.is_ascii_uppercase())
it is, I guess.
to remove existing functionality without going through the RFC process
For what it’s worth the addition of AsciiExt::to_ascii_lowercase
never went through the RFC process in the first place.
For now,
s.chars().all(|c| c.is_ascii() && !c.is_ascii_uppercase())
it is, I guess.
Interesting, because this behavior is neither <str as AsciiExt>::is_ascii_lowercase
(that’s self.bytes().all(|b| b.is_ascii_lowercase())
) nor what I had in mind for the alternative behavior (.bytes().all(|b| !b.is_ascii_uppercase())
, without checking that the string is entirely ASCII).
Anyway, it’s always possible to add more methods but a closed issue may not be the best place to propose them.
Most helpful comment
I’ve definitely used
is_ascii
,to_ascii_lowercase
, andeq_ignore_ascii_case
on[u8]
andu8
.