Mu: Fix unicode error with dodgy pasted text

Created on 17 Jul 2018 · 19Comments · Source: mu-editor/mu

See log file here. On OSX.

https://pastebin.com/eVMQcrKA

Source

ntoll

All 19 comments

@ntoll would you like me to look at this?

tjguk on 18 Jul 2018

@tjguk yes please (if you have time). I'm up to my eyes in trying to get cryptographic certificates out of certification authorities in time for Friday! :-)

ntoll on 18 Jul 2018

@tjguk apologies for the prompt, but is this going to happen any time soon? I'm conscious of time (it's running out!) and if you won't be able to do it soon then I'll step up and attempt a fix. No pressure, just trying to work out what needs to be done, by when and by whom. :-)

ntoll on 19 Jul 2018

I've pushed a fix in #570 without tests so that (a) it's there; and (b) others can test. Just about to see how to test. Fortunately it's easily reproducible.

However... we have a secondary problem, especially if this is a cut-and-paste from a popular site: it's almost impossible to guess what's wrong when the checker (or [Run]) does complain, because the space is indistinguishable from an 0x20 space.

Should we add some specific check or automatically normalise spaces?

tjguk on 19 Jul 2018

Should we add some specific check or automatically normalise spaces?

When it's pasted text I'd vote for automatically normalise (as you say the difference is invisible) because the text is already broken

ZanderBrown on 19 Jul 2018

My vote would be to normalise to space.

My guess is someone wrote the example code, copied somewhere horrible like "Word" (which munged the whitespace in incomprehensible ways) and then they copied the whole document from Word into the website's CMS. Or perhaps it's even the CMS being "clever".

The point is, it's easy to correctly display code in the browser (hint: use the <code> tag).

@carlosperate, you have visibility on this... what could it be from micro:bit's point of view? If your example code contains odd unicode whitespace characters, is there something you could do to clean them before publication..?

Finally, my main concern is Mu doesn't crash, it's moot if we should handle weird whitespace characters accidentally copied and pasted from someone else's website (i.e. it's the third party publisher's code that's wrong).

ntoll on 19 Jul 2018

👍1

Thanks, both. In principle, although I've been saying "normalise" I don't think this is the place for actual unicode normalisation; rather I think a regex ought to take care of it. [*]

However I'm a tiny bit concerned that, if I convert any whitespace to one U+0020 space which is the obvious operation, that someone, somewhere will be surprised that their code has changed. Do we want to do this?

[*] I know, I know...

tjguk on 19 Jul 2018

@tjguk upon reflect... I think "no" is the simplest / easiest answer to "do we want to do this?"

All Mu needs to do is not crash. If we get reports of strange results from the checker because of munged-up third party example code, then we should reach out to said third party and help them address this (as we've done above by pinging @carlosperate).

Does this make sense..? A classic case of KISS. ;-)

ntoll on 19 Jul 2018

As long as we don't change whitespace in literals [*] and when text is pasted (but not when loaded from a file) I wouldn't have thought the user would notice (as mentioned it was probably pasted from Word)

[*] Here we go with the r̵̡̀e̸̡͆g̷͈͊͘ḛ̴͋x̵̢̟̓...

ZanderBrown on 19 Jul 2018

Agree with @ntoll. I would have said that this was a corner-case, except for the fact that it's a cut-and-paste from what should be a reliable site, so people will assume that the fault is in Mu. I've pushed the test now and all checks pass locally.

tjguk on 19 Jul 2018

👍1

@ZanderBrown I think that's exactly the issue: as soon as you start trying to make some kind of context-sensitive replacement, we're creating a rod for our own backs.

tjguk on 19 Jul 2018

👍1

@carlosperate, you have visibility on this... what could it be from micro:bit's point of view? If your example code contains odd unicode whitespace characters, is there something you could do to clean them before publication..?

Do you mean online? It should be responsability of the the website to make sure that the code can be copy/pasted.
Did something contain odd characters in microbit.org?

I also agree, normalising in Mu is a can of worms (normalising in a website publishing system would be as well), Python/Micropython will throw an error pointing at the trouble line/character, and then as Nicholas has said, if a user reports an issue here we can try to contact the right person to fix the resource.

carlosperate on 19 Jul 2018

👍1

@carlosperate yeah... the original reporter (on Gitter, you'll see it in the chat history) said he cut and pasted it from one of the projects on the micro:bit website.

Of course, it could be something about the settings in his computer too...

Ugh... computers suck. ;-)

ntoll on 19 Jul 2018

😄1

@tjguk thanks for your work with this. I've only two more tutorials to write today (shouldn't take long) and then I'm going to merge all the things and start testing in preparation for tomorrow.

ntoll on 19 Jul 2018

Seems to be this one: https://www.microbit.co.uk/musicfest/that-bass

tjguk on 19 Jul 2018

And, sure enough, search for

yeahmymomma=[

and you can see the somewhat messy text behind the scenes.

tjguk on 19 Jul 2018

Oh god... that's code I wrote for the BBC. I emailed it to them and they put it on the website in the form of screenies and badly pasted code.

Oh god... that's the old website which I'm not sure is under the control of micro:bit foundation AFAICT.

Oh god... the link at the end is to the ancient "Touch Develop" website which is buggy.

;-)

ntoll on 19 Jul 2018