Tdesktop: Proposal: Better RTL detection

Created on 30 Aug 2017 · 12Comments · Source: telegramdesktop/tdesktop

Howdy!

A recurring theme I've noticed in most chat programs today (Telegram included 😢) is that text direction of a given message is determined by the first (relevant) character, which is a shame because:

If I start my message with a link, the message is instantly LTR.
If I use mixed language in a message (insert English words inside of Hebrew/Arabic/Whatever) and the first word is of the wrong direction, the message looks very bad.

So I propose a better algorithm which isn't much more complicated to detect the desired direction of a message, count the number of character in each language/direction (excluding links), and the one with more characters in it wins. More formally:

Strip the message of links
Let x be the number of RTL characters in the message.
Let y be the number of LTR characters in the message.
If x > y, set direction to RTL, finish.
If x < y, set direction to LTR, finish.
If x equals y, use the current algorithm (based on the first character)

Examples of this algorithm can be seen with Google Hangouts (which is the only chat I can tell that actually has a smarter algorithm than "look at the first character").

Of course, this is a proposal and the concrete algorithm is open to change, but I think that it's a very good compromise between code complexity and correctness.

auto closed

Source

MadaraUchiha

👍5 👎1

Most helpful comment

Using the percent of RTL or LTR characters to determine the direction is unpredictable and very bad UX unless you are working with large text paragraphs. This can be seen on Twitter which seems to implement a similar algorithm and you can’t really tell (without counting the characters in your head) if a tweet will end up left to right or right to left.

The first strong character algorithm is at least predictable and can be controlled without rewriting the text to have a different character count. The accessibility of control characters should not be an issue, the application can easily have a RTL/LTR button/shortcut/whatever that inserts RLM/LRM in front of the text before rendering it.

khaledhosny on 4 Sep 2017

👍3

All 12 comments

This would be slow tho, and not easy to implement.
Also, what about emoji, or numbers, or other symbols used in both layouts?

stek29 on 30 Aug 2017

Once per message? (or even once per keystroke?) doesn't sound too slow. It's not something you continually need to do per frame, and if you choose the more UX-y per keystroke, you can just "remember" the current counters for the message on the side, and just increase the relevant one.

As for numbers/emoji/media/symbols, they do not count for either RTL or LTR. (just like they don't today, if the first character is a number, it would look at the second character, etc)

Can you point me towards the relevant piece of code where the direction is selected? I'm willing to PR this.

MadaraUchiha on 30 Aug 2017

I can try to alter the way it works for messages in bubbles, but not in the message input field. Is the problem there as well?

john-preston on 30 Aug 2017

The problem is there as well, yes. And I do think that it's good if we can change it there as well, but the bubbles are more prominent (you write once, read many times and by many people). ~~I can prepare an example of how that can be done in a fiddle if you'd like, I don't think it would be more than 10 LoC~~

Here's an example illustrating what I'm after: https://jsfiddle.net/cv45ku2s/.

This could be further optimized to only consider one character at a time, etc. If you have proper abstractions for the input/transcript combo, it might be a tiny bit trickier, but likely not by much.

MadaraUchiha on 30 Aug 2017

It is standardized that the base direction of a given message is determined by the first character. If you do not like the direction in a specific case you can fix it by adding a U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK. See the Unicode Standard Annex № 9 or the German Wikipedia article on bidirectional control characters for other possibilities.

The maintainers should check if the Unicode bidirectionality algorithm is implemented correctly and not invent an own one – after all, it is unevitable that the people at Unicode have planned more shrewdly than any chat application developer can ever do, and also there are libraries for displaying BiDi well – it is quite disappointing that you can run ldd on the Telegram executable and not find any references to HarfBuzz or Pango in’t.

And I am serious in this case, go check it, for I have been rather unlucky using the bidirectional control characters from the General Punctuation block inside of Telegram.

Other possibilities would arise if Telegram would support full HTML at least by explicit enabling (not only Markdown), because HTML contains tags and attributes to manipulate bidirectional positioning.

Socialdarwinist on 30 Aug 2017

👍1

@Socialdarwinist Harfbuzz should be used internally for text shaping in Qt, so it should be used in Telegram as well — perhaps no references because everything is linked statically in the executable.

john-preston on 30 Aug 2017

@Socialdarwinist Your links are broken (Connection Refused), which is a bit alarming on unicode.org. Aside from that,

If you do not like the direction in a specific case you can fix it by adding a U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK

Really? "Enter a character that's not found in any keyboard" is your solution? What about mobile (if/when it reaches there)?

it is inevitable that the people at Unicode have planned more shrewdly than any chat application developer

How did you come to that conclusion? The people in unicode are omniscient now?

I completely disagree that just because there's a standard you have to implement it, even if it's suboptimal. Especially in a chat environment, it doesn't necessarily makes sense to only take the first (relevant) character into account.

MadaraUchiha on 30 Aug 2017

👎1

How did you come to that conclusion? The people in unicode are omniscient now?

It is because of the same reason whereby open-source software is supposed to be better: If many have interests in it working, many people look on the things. And for Unicode, there are very many proficient people in the environment looking onto things in many stages before publication. If there is something wrong in Unicode, the world has to be blamed. I point out that you claim that the bidirectionality standard is suboptimal while there is no better one visible from your side at least – cocky. If you know an improvement, you can surely initiate the Unicode process to implement it.

Enter a character that's not found in any keyboard

As you might know keyboards do not contain characters but keyboard layouts do. /usr/share/X11/xkb/symbols/ara can easily get a new layout, especially as the default keyboard layout has much room free. I have already played some weeks with the thought of adding bidi signs to it. If somebody fancies to be faster than me, this is what my thoughts have collected to be added to the symbols/ara file in XKB:

U+066B ARABIC DECIMAL SEPARATOR, U+066C ARABIC THOUSANDS SEPARATOR, U+0640 ARABIC TATWEEL, U+200C ZERO WIDTH NON-JOINER

U+202A LEFT-TO-RIGHT EMBEDDING, U+202B RIGHT-TO-LEFT EMBEDDING, U+202C POP DIRECTIONAL FORMATTING, U+200E LEFT-TO-RIGHT MARK, U+200F RIGHT-TO-LEFT MARK, U+061C ARABIC LETTER MARK

U+2066 LEFT‑TO‑RIGHT ISOLATE, U+2067 RIGHT‑TO‑LEFT ISOLATE, U+2068 FIRST STRONG ISOLATE, U+2069 POP DIRECTIONAL ISOLATE

U+2019 RIGHT SINGLE QUOTATION MARK, U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK, U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK, U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK, U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK

Additionally, there is direct Unicode Input possible in GTK+ (Ctrl+Alt+U) and even better in IMEs like there is in IBus and Fcitx with search by name.

For macOS it is also possible to add keyboard layouts. For Windows, people are doomed for using that system, as keyboard layouts there are binary. One can find keyboard layouts installable on Windows, but the compatibility does not appear to last. People choose to be dependent on Microsoft’s grace, that is what they get.

Socialdarwinist on 30 Aug 2017

👍1

khaledhosny on 4 Sep 2017

👍3

@khaledhosny That's actually a really good option as well. Some button or a control to insert the appropriate Unicode character to control direction, assuming those are supported by all major systems, is something I can definitely get behind.

MadaraUchiha on 4 Sep 2017

I have now outpoured my scheme of a new default Arabic keyboard layout. Contriving this has taken my day, and I have yet to put the real Arabic characters to the comments instead of (or in addition to?) transcriptions now used, but with my experience of bringing about XKB layouts it has worked at the first try, so I have published it now this evening; I just keep it a few days for digesting it and to give you’ll the opportunity to evaluate it – the new version of xkeyboard-config is scheduled for the 31th of September.

@khaledhosny @behdad or I don’t know who else, call your polyglot mates to have a look at it! I have mapped the bidirectional control characters to it except the overriding ones (I don’t think LRO and RLO are supposed to be regularly used for text?) and as there has been much unused room on four levels I have mapped all characters additionally used in the Arabic scripts of the Pashto, Sindhi, Punjabi, Urdu, Kashmiri, Turkic and other languages next to the Arabic and Persian letters that have been present in the keyboard layout before my engagement (the same way I have mapped virtually the whole Cyrillic to a Russian-based layout).

I think you can comment at that gist for specific remarks about the layout, as those would be beyond the topic here.

As for this issue here, when that keyboard layout is shipped the issue is solved on Linux – now that, as I have just while writing this comment seen, the default Persian layout already includes the embedding and override characters, and the default Hebrew one the RIGHT-TO-LEFT MARK and the LEFT-TO-RIGHT MARK, and my edition of the Arabic default keyboard layout stretches the signs out.

I dare assume that the suggestion of writing bidirectional characters directly via the keyboard is sore persuasive. But the OP is from Israel according to his profile, so it becomes even more amusing to hear him complain about propositions to “Enter a character that's not found in any keyboard”, as the Hebrew base layout contains:

 key <AE09> { [     9,  parenright, U200E   ]}; // LRM; Paren Mirrored
 key <AE10> { [     0,  parenleft,  U200F   ]}; // RLM; Paren Mirrored

What, people can’t help themselves because they use Windows or macOS? Then this issue has to be closed because it is an operating system issue.

Socialdarwinist on 5 Sep 2017

Hey there!

We're automatically closing this issue since there was no activity in this issue since 398 days ago. We therefore assume that the user has lost interest or resolved the problem on their own. Closed issues that remain inactive for a long period may get automatically locked.

Don't worry though; if this is in error, let us know with a comment and we'll be happy to reopen the issue.

Thanks!

_{(Please note that this is an automated comment.)}