Julia: Make 𝟏, 𝟎, 𝟙, 𝟘 into valid identifiers for DSLs

Created on 14 Apr 2018 · 16Comments · Source: JuliaLang/julia

Looking at https://docs.julialang.org/en/latest/manual/unicode-input/#Unicode-Input-1 There are a few identifiers that would make excellent identifiers for linear algebra and probability DSLs.

Note that this is conservative in leaving as many other of the unicode numbers as invalid identifies. In particular, \bsanszero and \bsansone look similar, but are left as invalid identifiers for now.

The main use-case for these is to be able to add in automatically reshaping matrices/vectors of 1s and 0s into https://github.com/JuliaArrays/FillArrays.jl in the spirit of the UniformScaling operator, currently denoted by I. Of course, this library would not intend to lay claim to that notation, but would want to use it. The 𝟘 and 𝟙 might be useful for people who wish to use const 𝟙 = 𝟏 to match their latex notation, or could allow writing a new indicator functions, etc. I know I would use 𝟙(a > b) for that to match algebra.

help wanted unicode

Source

jlperla

👍10

Most helpful comment

There are three choices here:

Disallow all digit-variant characters entirely (what we do now).
Allow digit-variant characters to be used as letters, distinct from the digits they correspond to.
Allow digit-variant characters to be used as if they were simply the plain digit, i.e. make 𝟘 another way of writing 0.

The last option seems confusing and fairly pointless to me—unlike characters like μ and µ, which are different Unicode characters that look exactly alike, these are not likely to be somehow accidentally input when plain digits were intended. Why allow weird digits variants when literally every keyboard ever created has plain digits directly on it? The only way 𝟘 is likely to end up in a program is if someone intended to enter it.

The current behavior of disallowing digit variants entirely seems like a waste of potentially nice syntax. I have yet to encounter a font where these digits variants render and are not visually distinguishable from the corresponding digits.

That leaves option 2: allowing digit-variant characters to be used as letters, which is what this issue proposes. I can understand that people might now want to use these bindings, which is fine—in that case, don't use them. But why should we prevent people who want to from doing so? Especially given that the only other potential use for them is not really sensible.

StefanKarpinski on 14 Apr 2018

👍9 👀1

All 16 comments

I can see the appeal of the idea, but I think there's too little benefit for the potential readability and maintanence costs with this. Between font variations and (anti-)aliasing and rendering choices and syntax highlighting, the distinctions between the different zeros (0 𝟎 𝟘) or ones (𝟙 1 𝟏) can get pretty blurry. The idea of potential gotchas in such basic entities as 0s and 1s (and the confused stackoverflow questions resulting from them) is not an appealing prospect.

digital-carver on 14 Apr 2018

👍3

I think you can make that case about almost all unicode characters that have a similar ascii character. Whether it makes sense in a particular case or not is a very reasonable question, and library specific.

In libraries like ApproxFun.jl, they use symbols like 𝒟, which looks like a D matches the math notation of using script to denote differential operators.

The only difference with what I am suggesting is that (right now) library writers don't have the option to make an alias to variable names that start with the fancy number looking characters.

If this was changed, then the discussion could come to what you are bringing up: are introducing those aliases a good idea (since they never should be required). Your perspective is reasonable, but it may be domain specific

jlperla on 14 Apr 2018

There are three choices here:

Disallow all digit-variant characters entirely (what we do now).
Allow digit-variant characters to be used as letters, distinct from the digits they correspond to.
Allow digit-variant characters to be used as if they were simply the plain digit, i.e. make 𝟘 another way of writing 0.

StefanKarpinski on 14 Apr 2018

👍9 👀1

I think you can make that case about almost all unicode characters that have a similar ascii character.

True, that's why I mentioned "such basic entities as 0s and 1s". 'Is this identifier a 𝒟 or a D' is a very different sort of question from 'is this thing here a literal or an identifier'. It's a small mental cost when going through a codebase, but such costs add up pretty quickly.

If this was changed, then the discussion could come to what you are bringing up: are introducing those aliases a good idea (since they never should be required). Your perspective is reasonable, but it may be domain specific

I'm a fan of DSLs and would in theory love to have custom infix operators (#16985) and even custom infix named functions, hoping the users use them wisely. But sometimes the guardrails have to be in the language, and in my opinion this is one of those cases.

I can understand that people might now want to use these, in which case, simply don't. But why should we prevent people who want to from doing so?

The same reason the codepoints were restricted in the first place (#5936) - code gets passed down and across teams and people, and sometimes it's more important to prevent "crazy things" being introduced by someone, than to provide a minor nicety.

digital-carver on 14 Apr 2018

As far as I can tell, any argument that this is confusing applies equally to ℯ (euler). So whatever discussion led to changing e to ℯ in Base applies here.

dlfivefifty on 15 Apr 2018

Agreed; we're way past the point of having any sort of policy against potentially-confusable characters. I agree with Stefan that when fonts have 𝟘 and 𝟙 they tend to be more distinguishable than some other examples like e and ℯ.

JeffBezanson on 15 Apr 2018

👍2

The same reason the codepoints were restricted in the first place

The reason to restrict code points was to allow for implementing sane uses of code points in the future without breaking code, not to prevent people from doing silly things. If people want to write unreadable code, they will, no matter what we do to try to prevent it.

I think the de facto policy with potentially-confusable characters is that we identify characters that are easily confused both on input and appearance so there's a real chance that someone may input one when they intended to input the other _and_ not be easily able to tell that this is what has happened. The normal "e" versus Euler's "ℯ" fails this test on both counts: there's little chance that anyone will have input "ℯ" by accident when they meant "e" since "e" is on every keyboard and "ℯ" is on none; they also look fairly distinct in most fonts so even if someone managed to do this somehow, they'd be able to notice what's going on. The case of "μ" and "µ" satisfies this criterion since neither character is on a standard keyboard and some input methods give you one while others give you the other and they look _identical_ so it's extremely hard to discover that this is what's going on after the fact. Applying this test to the "1" versus "𝟙" case leads to the same conclusion as "e" versus "ℯ"—i.e. that they should be considered distinct characters.

StefanKarpinski on 16 Apr 2018

👍3

we identify characters that are easily confused both on input and appearance so there's a real chance that someone may input one when they intended to input the other and not be easily able to tell that this is what has happened

My concern was about later readability than about ambiguity during input, "code is read a lot more than it's written" and all that. But since this is probably going in, can we have it so that there's one canonical identifier zero (not multiple) to go alongside the one canonical literal 0 (and similarly for 1)? My vote is for the \bbzero and \bbone to be the allowed identifiers, since they're easier to distinguish visually from 0 and 1 (especially in the presence of syntax highlighting, which often makes a bold vs non-bold distinction not so clear).

digital-carver on 16 Apr 2018

👎2

I see no reason to limit this to just one, when so many of the "1"s are easily distinguished. No one is going to confuse any of the following for each other or for 1 and so at the very least they all should be legitimate identifiers: 𝟙, ₁, ❶, ⓵, ①, 1️⃣

dlfivefifty on 16 Apr 2018

👍3

ref: https://github.com/JuliaLang/julia/issues/10762

sbromberger on 15 Sep 2018

@JeffBezanson @StefanKarpinski (cc @dlfivefifty ) I realized that a feature freeze is coming soon and was wondering if you would still support having a PR that implements this? It would be very nice to sneak into the 1.3 release.

jlperla on 6 Aug 2019

For the record, 1.3 has a lot of exciting stuff in it already, and so postponing this to 1.4+ makes sense to me.

dlfivefifty on 7 Aug 2019

Oh for sure. This would not be the highlight of the release by any means! But if it is a low "cost" and low probability of side effect issue, it would mean I can write some cool DSLs 6 months earlier.

jlperla on 7 Aug 2019

Triage is ok with this.

JeffBezanson on 8 Aug 2019

Explicitly, triage is ok with option 2: Allow digit-variant characters to be used as letters, distinct from the digits they correspond to. Now it merely needs an implementation.

StefanKarpinski on 8 Aug 2019

👍5

Fixed by #32838

JeffBezanson on 12 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

add special display for ≈ test failures

StefanKarpinski · 3Comments

add @callsuper macro

StefanKarpinski · 3Comments

isposdef() is incorrect

wilburtownsend · 3Comments

Help regression for Rational in 0.6

omus · 3Comments

Broadcasting a function on an array gives an expression

dpsanders · 3Comments