Looking at https://docs.julialang.org/en/latest/manual/unicode-input/#Unicode-Input-1 There are a few identifiers that would make excellent identifiers for linear algebra and probability DSLs.
U+1D7CE | π | \bfzero | Mathematical Bold Digit Zero
-- | -- | -- | --
U+1D7CF | π | \bfone | Mathematical Bold Digit One
U+1D7D8 | π | \bbzero | Mathematical Double-Struck Digit Zero
U+1D7D9 | π | \bbone | Mathematical Double-Struck Digit One
Note that this is conservative in leaving as many other of the unicode numbers as invalid identifies. In particular, \bsanszero
and \bsansone
look similar, but are left as invalid identifiers for now.
The main use-case for these is to be able to add in automatically reshaping matrices/vectors of 1s and 0s into https://github.com/JuliaArrays/FillArrays.jl in the spirit of the UniformScaling
operator, currently denoted by I
. Of course, this library would not intend to lay claim to that notation, but would want to use it. The π and π might be useful for people who wish to use const π = π
to match their latex notation, or could allow writing a new indicator functions, etc. I know I would use π(a > b)
for that to match algebra.
I can see the appeal of the idea, but I think there's too little benefit for the potential readability and maintanence costs with this. Between font variations and (anti-)aliasing and rendering choices and syntax highlighting, the distinctions between the different zeros (0 π π) or ones (π 1 π) can get pretty blurry. The idea of potential gotchas in such basic entities as 0s and 1s (and the confused stackoverflow questions resulting from them) is not an appealing prospect.
I think you can make that case about almost all unicode characters that have a similar ascii character. Whether it makes sense in a particular case or not is a very reasonable question, and library specific.
In libraries like ApproxFun.jl, they use symbols like π, which looks like a D matches the math notation of using script to denote differential operators.
The only difference with what I am suggesting is that (right now) library writers don't have the option to make an alias to variable names that start with the fancy number looking characters.
If this was changed, then the discussion could come to what you are bringing up: are introducing those aliases a good idea (since they never should be required). Your perspective is reasonable, but it may be domain specific
There are three choices here:
π
another way of writing 0
.The last option seems confusing and fairly pointless to meβunlike characters like ΞΌ
and Β΅
, which are different Unicode characters that look exactly alike, these are not likely to be somehow accidentally input when plain digits were intended. Why allow weird digits variants when literally every keyboard ever created has plain digits directly on it? The only way π
is likely to end up in a program is if someone intended to enter it.
The current behavior of disallowing digit variants entirely seems like a waste of potentially nice syntax. I have yet to encounter a font where these digits variants render and are not visually distinguishable from the corresponding digits.
That leaves option 2: allowing digit-variant characters to be used as letters, which is what this issue proposes. I can understand that people might now want to use these bindings, which is fineβin that case, don't use them. But why should we prevent people who want to from doing so? Especially given that the only other potential use for them is not really sensible.
I think you can make that case about almost all unicode characters that have a similar ascii character.
True, that's why I mentioned "such basic entities as 0s and 1s". 'Is this identifier a π or a D' is a very different sort of question from 'is this thing here a literal or an identifier'. It's a small mental cost when going through a codebase, but such costs add up pretty quickly.
If this was changed, then the discussion could come to what you are bringing up: are introducing those aliases a good idea (since they never should be required). Your perspective is reasonable, but it may be domain specific
I'm a fan of DSLs and would in theory love to have custom infix operators (#16985) and even custom infix named functions, hoping the users use them wisely. But sometimes the guardrails have to be in the language, and in my opinion this is one of those cases.
I can understand that people might now want to use these, in which case, simply don't. But why should we prevent people who want to from doing so?
The same reason the codepoints were restricted in the first place (#5936) - code gets passed down and across teams and people, and sometimes it's more important to prevent "crazy things" being introduced by someone, than to provide a minor nicety.
As far as I can tell, any argument that this is confusing applies equally to β―
(euler). So whatever discussion led to changing e
to β―
in Base applies here.
Agreed; we're way past the point of having any sort of policy against potentially-confusable characters. I agree with Stefan that when fonts have π and π they tend to be more distinguishable than some other examples like e and β―.
The same reason the codepoints were restricted in the first place
The reason to restrict code points was to allow for implementing sane uses of code points in the future without breaking code, not to prevent people from doing silly things. If people want to write unreadable code, they will, no matter what we do to try to prevent it.
I think the de facto policy with potentially-confusable characters is that we identify characters that are easily confused both on input and appearance so there's a real chance that someone may input one when they intended to input the other _and_ not be easily able to tell that this is what has happened. The normal "e" versus Euler's "β―" fails this test on both counts: there's little chance that anyone will have input "β―" by accident when they meant "e" since "e" is on every keyboard and "β―" is on none; they also look fairly distinct in most fonts so even if someone managed to do this somehow, they'd be able to notice what's going on. The case of "ΞΌ" and "Β΅" satisfies this criterion since neither character is on a standard keyboard and some input methods give you one while others give you the other and they look _identical_ so it's extremely hard to discover that this is what's going on after the fact. Applying this test to the "1" versus "π" case leads to the same conclusion as "e" versus "β―"βi.e. that they should be considered distinct characters.
we identify characters that are easily confused both on input and appearance so there's a real chance that someone may input one when they intended to input the other and not be easily able to tell that this is what has happened
My concern was about later readability than about ambiguity during input, "code is read a lot more than it's written" and all that. But since this is probably going in, can we have it so that there's one canonical identifier zero (not multiple) to go alongside the one canonical literal 0 (and similarly for 1)? My vote is for the \bbzero
and \bbone
to be the allowed identifiers, since they're easier to distinguish visually from 0 and 1 (especially in the presence of syntax highlighting, which often makes a bold vs non-bold distinction not so clear).
I see no reason to limit this to just one, when so many of the "1"s are easily distinguished. No one is going to confuse any of the following for each other or for 1
and so at the very least they all should be legitimate identifiers: π, β, βΆ, β΅, β , 1οΈβ£
@JeffBezanson @StefanKarpinski (cc @dlfivefifty ) I realized that a feature freeze is coming soon and was wondering if you would still support having a PR that implements this? It would be very nice to sneak into the 1.3 release.
For the record, 1.3 has a lot of exciting stuff in it already, and so postponing this to 1.4+ makes sense to me.
Oh for sure. This would not be the highlight of the release by any means! But if it is a low "cost" and low probability of side effect issue, it would mean I can write some cool DSLs 6 months earlier.
Triage is ok with this.
Explicitly, triage is ok with option 2: Allow digit-variant characters to be used as letters, distinct from the digits they correspond to. Now it merely needs an implementation.
Fixed by #32838
Most helpful comment
There are three choices here:
π
another way of writing0
.The last option seems confusing and fairly pointless to meβunlike characters like
ΞΌ
andΒ΅
, which are different Unicode characters that look exactly alike, these are not likely to be somehow accidentally input when plain digits were intended. Why allow weird digits variants when literally every keyboard ever created has plain digits directly on it? The only wayπ
is likely to end up in a program is if someone intended to enter it.The current behavior of disallowing digit variants entirely seems like a waste of potentially nice syntax. I have yet to encounter a font where these digits variants render and are not visually distinguishable from the corresponding digits.
That leaves option 2: allowing digit-variant characters to be used as letters, which is what this issue proposes. I can understand that people might now want to use these bindings, which is fineβin that case, don't use them. But why should we prevent people who want to from doing so? Especially given that the only other potential use for them is not really sensible.