Mu: Fix unicode error with dodgy pasted text

Created on 17 Jul 2018  路  19Comments  路  Source: mu-editor/mu

See log file here. On OSX.

https://pastebin.com/eVMQcrKA

All 19 comments

@ntoll would you like me to look at this?

@tjguk yes please (if you have time). I'm up to my eyes in trying to get cryptographic certificates out of certification authorities in time for Friday! :-)

@tjguk apologies for the prompt, but is this going to happen any time soon? I'm conscious of time (it's running out!) and if you won't be able to do it soon then I'll step up and attempt a fix. No pressure, just trying to work out what needs to be done, by when and by whom. :-)

I've pushed a fix in #570 without tests so that (a) it's there; and (b) others can test. Just about to see how to test. Fortunately it's easily reproducible.

However... we have a secondary problem, especially if this is a cut-and-paste from a popular site: it's almost impossible to guess what's wrong when the checker (or [Run]) does complain, because the space is indistinguishable from an 0x20 space.

Should we add some specific check or automatically normalise spaces?

Should we add some specific check or automatically normalise spaces?

When it's pasted text I'd vote for automatically normalise (as you say the difference is invisible) because the text is already broken

My vote would be to normalise to space.

My guess is someone wrote the example code, copied somewhere horrible like "Word" (which munged the whitespace in incomprehensible ways) and then they copied the whole document from Word into the website's CMS. Or perhaps it's even the CMS being "clever".

The point is, it's easy to correctly display code in the browser (hint: use the <code> tag).

@carlosperate, you have visibility on this... what could it be from micro:bit's point of view? If your example code contains odd unicode whitespace characters, is there something you could do to clean them before publication..?

Finally, my main concern is Mu doesn't crash, it's moot if we should handle weird whitespace characters accidentally copied and pasted from someone else's website (i.e. it's the third party publisher's code that's wrong).

Thanks, both. In principle, although I've been saying "normalise" I don't think this is the place for actual unicode normalisation; rather I think a regex ought to take care of it. [*]

However I'm a tiny bit concerned that, if I convert any whitespace to one U+0020 space which is the obvious operation, that someone, somewhere will be surprised that their code has changed. Do we want to do this?

[*] I know, I know...

@tjguk upon reflect... I think "no" is the simplest / easiest answer to "do we want to do this?"

All Mu needs to do is not crash. If we get reports of strange results from the checker because of munged-up third party example code, then we should reach out to said third party and help them address this (as we've done above by pinging @carlosperate).

Does this make sense..? A classic case of KISS. ;-)

As long as we don't change whitespace in literals [*] and when text is pasted (but not when loaded from a file) I wouldn't have thought the user would notice (as mentioned it was probably pasted from Word)

[*] Here we go with the r痰虁獭e谈蛦獭g谭蜆蛫蛨e檀蛬贪x痰蛢挞虩...

Agree with @ntoll. I would have said that this was a corner-case, except for the fact that it's a cut-and-paste from what should be a reliable site, so people will assume that the fault is in Mu. I've pushed the test now and all checks pass locally.

@ZanderBrown I think that's exactly the issue: as soon as you start trying to make some kind of context-sensitive replacement, we're creating a rod for our own backs.

@carlosperate, you have visibility on this... what could it be from micro:bit's point of view? If your example code contains odd unicode whitespace characters, is there something you could do to clean them before publication..?

Do you mean online? It should be responsability of the the website to make sure that the code can be copy/pasted.
Did something contain odd characters in microbit.org?

I also agree, normalising in Mu is a can of worms (normalising in a website publishing system would be as well), Python/Micropython will throw an error pointing at the trouble line/character, and then as Nicholas has said, if a user reports an issue here we can try to contact the right person to fix the resource.

@carlosperate yeah... the original reporter (on Gitter, you'll see it in the chat history) said he cut and pasted it from one of the projects on the micro:bit website.

Of course, it could be something about the settings in his computer too...

Ugh... computers suck. ;-)

@tjguk thanks for your work with this. I've only two more tutorials to write today (shouldn't take long) and then I'm going to merge all the things and start testing in preparation for tomorrow.

And, sure enough, search for

yeahmymomma=[

and you can see the somewhat messy text behind the scenes.

Oh god... that's code I wrote for the BBC. I emailed it to them and they put it on the website in the form of screenies and badly pasted code.

Oh god... that's the old website which I'm not sure is under the control of micro:bit foundation AFAICT.

Oh god... the link at the end is to the ancient "Touch Develop" website which is buggy.

;-)

The space character it's using is the somewhat arcane U+2005 http://decodeunicode.org/en/u+2005

I wouldn't have any real qualms about translating that to a single space on the way through. (Somewhere, somehow; probably inside save-and-encode)

Fixed in #570.

Thanks @tjguk

Was this page helpful?
0 / 5 - 0 ratings

Related issues

probonopd picture probonopd  路  5Comments

tibs picture tibs  路  8Comments

gohai picture gohai  路  4Comments

mkarikom picture mkarikom  路  5Comments

bennuttall picture bennuttall  路  5Comments