Tldr: Support for multiple (human) languages

Created on 17 Sep 2018  路  52Comments  路  Source: tldr-pages/tldr

It would be nice if tldr pages existed in other (human) languages too, and not just English. Is there any plan for tldr pages to support multiple languages?

architecture decision translation

Most helpful comment

Ok then, let's go with the replicated folder structure, and commit message as cp: add Tamil page

All 52 comments

I am sorry, but the entire page parsing logic is based around English punctuation marks. Like :, >. We cannot _fully_ shift to other languages. English punctuation marks are compulsory.

Just simply converting the pages to other languages is possible. Although it will take architectural changes, which need to be implemented by every client out there. So, even if we make pages in other languages, until the client you are using implements it, you won't be able to use tldr.

But more importantly, someone has to _write_ the pages. This is the first request to support multiple languages. If there is sufficient interest, then we can decide on how to lay out the pages, and then each client can implement it gradually.

For the most part, technical documentation is non-existent in most
Indian languages. And, owing to their size, translating existing
technical documentation requires a lot of labour to achieve even a small
degree of completeness. So, my thinking was that due to its small size,
the tldr pages could be translated with less effort and quickly reach
sufficient completeness to attract more translators and bootstrap the
process of translating bigger manuals. There are large sections of
society that would come forward to learn more about computers if that
knowledge was presented to them in their mother tongue.

Sounds like a great idea, @arunisaac! What architecture / folder structure would you suggest? The current structure for reference is to have a pages folder, with the individual platform folders such as common, windows, linux, etc. inside there.

Just thought of this also: We'll need someone who can speak the target languages to review the PRs as we do for English. Quite often we suggest wording and grammatical changes etc. to improve the clarity of the page.

I wonder if something such as GitLocalize could help with this?

We could have it in a structure such as:

pages/
  en/
    common/
    linux/
    osx/
    windows/
  etc/


But yes, as @sbrl mentioned, this would require more reviewers for other languages. 馃

Breaking down into sub-folders will break all clients immediately. I would go with the usual github idiom for storing READMEs in multiple languages.

Like - <filename>_<2-letter-lang-code>.md. And leave the default as english. So that we can add new files without breaking anything. And clients can implement new language support gradually at ease.

Hmm. That would definitely make more sense. 馃 I wasn't really thinking.

Like - _<2-letter-lang-code>.md. And leave the default as
english.

This sounds great. May I send pull requests for translated tldr pages
now?

We'll need someone who can speak the target languages to review the
PRs as we do for English. Quite often we suggest wording and
grammatical changes etc. to improve the clarity of the page.

Review is good to have. But, what do we do if somebody sends in a PR for
a language with no reviewers? Will that PR be indefinitely stalled?

I am not against this idea, but could you please give an example of what language you would choose to translate in? Although it has been said above in the comments, I would request some specific information. For starters I have some questions:

  • What problems would the translated pages solve? Who is our target audience?

  • What is the value add of translating as it would be very painful to keep all the translation copies in sync? Just as an example, what if I add/modify a page and am not able to update the translation?

Just a thought, I support the idea given you convince me to 馃憤

/cc @sbrl @agnivade @pxgamer @arunisaac

I am not against this idea, but could you please give an example of
what language you would choose to translate in?

I would translate to Tamil, my mother tongue. It is the only language
other than English that I have any proficiency in. I am ready to send in
at least a few PRs the moment there is some consensus and approval for
this proposal.

  • What problems would the translated pages solve? Who is our target audience?

The translated pages would make computer knowledge more accessible to
a large section of the population who don't have a good command over
English and would learn much better in their mother tongue.

Personally, for my translated Tamil tldr pages, my target audience is
young people who have schooled in Tamil medium, and find the English
language to be a barrier for them acquiring computer knowledge with
ease.

  • What is the value add of translating as it would be very painful to
    keep all the translation copies in sync? Just as an example, what if I
    add/modify a page and am not able to update the translation?

Translating and keeping the translations in sync would no doubt be a
painful exercise, and require a lot of labour. But, the amount of labour
that needs to be invested into translation is negligible compared to the
amount of labour that can be saved for non-English speakers.

Also, translations need not be immediately synced. Consider Wikipedia
for example. Often, the English Wikipedia has the most up to date
articles. The other languages eventually sync up based on the
availability and willingness of translators. And, sometimes they don't
sync and have their own unique versions of the same article. Such
divergence is also acceptable.

The need to speak and learn English is not going to vanish just because
tldr pages are translated to other languages. And, people might still be
required to learn English to acquire more advanced computer
proficiency. But, if we can break down at least some barriers to
knowledge access, and pique the curiosity of at least a few people, our
labour is well worth it.

PS: This proposal occurred to me when I was considering translating man
pages. I then realized that translating tldr pages would be much easier
and more worth the trouble.

You can send PRs, but note that without any client support, those pages will be useless. Unless you want people to manually open github pages.

I would recommend that if you want this, fork this repo and add pages in your own repo. And also add support to the client that you are using. Clients can be configured to point to a custom repo instead of the main repo. That way, you can use a client to point to your own repo and get the translated pages.

Once we have atleast one or two clients supporting multi-language pages, we can easily shift those pages to the main repo.

The reason I ask this is because I don't want to add pages to the main repo unless there are clients which support it. Extra pages mean extra bytes that need to be downloaded when you update the client. And if those pages are useless, then those bytes are entirely wasted. Plus, we have this great feature to be able to use custom repos. So let's use that for now, and when clients start to have support for this, we can move pages here.

@agnivade sounds like a plan. Do keep us up-to-date with your progress, @arunisaac! When you've got a collection of pages, perhaps we can start implementing support in some of the clients?

I am somewhat concerned about keeping 'tldr-pages' unified. It would be awkward if it turned out in the long run that we end up with an English tldr-pages, a French tldr-pages, etc.

The other awkward question is about the language of the pull requests themselves. Should we standardise on English in the pull requests - even though the page in question is actually in French? Or should we request that pull requests for French translations of a page should also be in French?

If this takes off, it might also be necessary to translate the README / the style guide / etc., but that's a whole different problem for a separate issue I think.


Regarding the structure, what about a structure like this:

  • pages/ - Master English copy
  • pages.fr/ - French pages directory
  • pages.de/ - German pages directory
  • etc.

Then all the pages for a language are all neatly tied together in a single directory. I wonder if GitHub let's you download a single folder as a .zip, rather than the whole repo.


What problems would the translated pages solve?

I think that @arunisaac answered this in this comment, @mfrw.


I'd love to get @waldyrious's opinion on this.

Regarding the PR reviewing process:

We'll need someone who can speak the target languages to review the
PRs as we do for English. Quite often we suggest wording and
grammatical changes etc. to improve the clarity of the page.

Review is good to have. But, what do we do if somebody sends in a PR for
a language with no reviewers? Will that PR be indefinitely stalled?

Anyone can review PRs :) I would suggest applying a similar policy to what we currently have (2 approvals by repo maintainers). The adaptation could be that, for translation PRs, if it has two approvals from two regular users other than the author (besides passing syntax checks, etc), a maintainer can merge the PR, perhaps running it by google translate to perform a basic sanity check and maybe ask clarifications. This, of course, assuming that no maintainer speaks the language introduced in that PR.

I believe this workflow would fit our current processes well, and ensure that translation PRs would not remain indefinitely stalled.

You can send PRs, but note that without any client support, those pages will be useless. Unless you want people to manually open github pages.

Yes, this is an important point (unless users pass manually the name of the command suffixed with the language code, which is not practical). However, although I understand the motivation for suggesting forking the repo, I think it would be more interesting for the whole community in the long term if @arunisaac produced only a few pages as a proof of concept, and spent his energy in contacting and coordinate with the maintainers of a couple clients to add support for this feature, aiming to have the translated pages in the main repo.

To do that, the user interface and default behavior of the client would need to be specified, e.g. the client could choose to automatically show pages in the system language, if they are available, or start by a manual trigger, say with a configuration setting, environment variable, or a dedicated command option (say, --lang=XX).

If the simplest approach is taken (which would depend on the client's author and the tools used to build it), it seems to me that it wouldn't be too hard to add support for this feature. And once we have clients supporting the pages, it would make sense to have the translated pages in this repo, which would encourage other clients to also add support for them. I think that this fits very well with tldr-pages' mission to demistify command-line tools :)

Sorry everyone for the long message. @sbrl, is there anything in particular you'd like me to comment on?

It's a complex feature! :P I think you've about covered everything I was going to ask you about, @waldyrious.

Yeah, that policy for PRs works for me. I've already got a browser extension that can translate a selection of the page, so that's easy for me to do.

Yeah, standardising a CLI flag & behaviour for clients is definitely a thing we should do. We could then add it to the tldr page for tldr :D

Regarding the structure of the translated pages, I've created a strawpoll for voting, since there's more than one structure and we haven't decided yet.

I think that a simple approach is best to start with. If we encounter issues down the line, we cana always adjust / improve our strategy.

although I understand the motivation for suggesting forking the repo,
I think it would be more interesting for the whole community in the
long term if @arunisaac produced only a few pages as a proof of
concept, and spent his energy in contacting and coordinate with the
maintainers of a couple clients to add support for this feature,
aiming to have the translated pages in the main repo.

I strongly prefer having the translated pages in the main repo. If I
must fork this repo and maintain clients for the forked repo, I gain
very little from the tldr project. I might as well start my own separate
project leading to language based fragmentation of the community.

Would 10 translated pages suffice as a proof of concept before the pages
can be accepted into the main repo?

I think that this fits very well with tldr-pages' mission to demistify
command-line tools :)

Exactly, thank you! :-)

  • pages/ - Master English copy
  • pages.fr/ - French pages directory
  • pages.de/ - German pages directory
  • etc.

This structure is good. I have voted for this in the strawpoll.

Yeah, standardising a CLI flag & behaviour for clients is definitely a thing we should do. We could then add it to the tldr page for tldr :D

That's #1065, btw :) Note, however, that I'm not suggesting adopting a global standard as a pre-requisite for this functionality, otherwise we'd block progress unnecessarily. I was speaking in the context of the specific client(s) that decide to implement this feature.

Regarding the structure of the translated pages, I've created a strawpoll for voting, since there's more than one structure and we haven't decided yet.

I don't actually like the idea of a straw poll. By our community principles, decisions are not supposed to be made by simple majority of individual preferences, but rather by collective, informed discussion among interested community members. I'd prefer a table comparing the pros and cons of each proposal currently under consideration, which could be edited/completed if further information surfaces in the discussion.

By the way, this sort of change also touches on #190, although I'm only mentioning this FYI -- let's not attempt to bite more than we can chew all at once :)

Would 10 translated pages suffice as a proof of concept before the pages can be accepted into the main repo?

IMO that would be more than enough -- 3 or 4 would do the job already. That said, please note that in my previous comment I agreed with @agnivade that without at least one or two clients implementing support for such pages, it wouldn't make sense to add those pages to the main repo. So I'd suggest starting by working with one of the client authors to implement support for such pages, produce one or two pages (locally, even) to test the functionality, and then PRs to the main repo would make sense. Of course, this is my opinion -- we can discuss alternative approaches if there are other ideas that the community deems more reasonable.

If I must fork this repo and maintain clients for the forked repo, I gain very little from the tldr project.

I am not asking you to maintain clients. Rather, get in touch with client authors, or send PRs. Unless the clients have the ability to show the pages, there is not much benefit in creating the pages.

What's the behavior of doing something like tldr --lang=fr git and there is no French translation of git page? Fall back to the english version? Display an error? What about when doing something like TLD_LANG=fr tldr git (or whatever the way to set settings is for a given client)?

I think it should display an error saying no page was found, but the same page exists in English language.

I think it could automatically show the English page, with the warning @agnivade mentioned at the top. That way it is still clear what's happening (no silent "smart" workaround), but convenience for the user is not sacrificed (e.g. by requiring them to run the command twice).

Similar suggestion for the operating system setting, it would be nice to provide a list.

tldr --lang=fr,en tar like tldr -o mac,linux. That way if you do not want a fallback (for example, to autocreate a github issue when not found), you can use the error code.

OS automatically falls back to common. If the page does not exist in common, but only in multiple platforms, then it becomes an issue. Which page do we show ? We could complicate the logic further by checking if there are multiple pages, or only one page. If one, then show, else show a message. But not sure if we want to go to that extent.

Anyways, this issue is about languages. So let's keep it to that. Feel free to open a new issue on the client repo if you want this behavior.

Anyways, this issue is about languages. So let's keep it to that. Feel free to open a new issue on the client repo if you want this behavior.

Agreed -- let's have that discussion at #1065 or in a dedicated issue -- this thread is long enough already :)

I was suggesting that whatever fallback decision we have for one, we keep it consistent for the other.

I have submitted a pull request with proof of concept Tamil pages. I have created a separate pages.ta directory for Tamil pages. Do we have consensus on this directory organization structure?

If these translated pages are fine, can they be merged into master? That way, it would be easier to convince clients to implement multi-language support.

We also need someone who knows Tamil to review these translated pages. I can get somebody I know to do this. Will that be sufficient?

I don't think a consensus has emerged regarding the structure of the directories. I'd like to hear the thoughts from those who commented on that topic, i.e. @agnivade, @sbrl and @pxgamer.

Ideally we'd have each language in their own folder, as @pxgamer suggested, but even if we left the English pages in the root level (so as to not disrupt existing clients), we'd still have to duplicate the language folders for each of the platform folders common/, linux/, osx/, windows/.

So I'd be more inclined towards @agnivade's suggestion of simply placing the translated pages in the same folders as the English ones, but with a language code suffix in the filename.

The separator between the page name and the language suffix should be chosen in a way that doesn't conflict with existing characters. There are a few pages using underscore (e.g. pg_ctl, wpa_supplicant), many using hyphen (e.g. most of the git subcommand ones), and at least one using a period (update-rc.d). I'm not sure what else we could use that would be unambiguous. Perhaps an @ sign? It isn't used in any page name AFAICT, and I think it kinda makes sense. That means an example translated page would be common/[email protected]. What do you guys think?

I don't think a consensus has emerged regarding the structure of the
directories. I'd like to hear the thoughts from those who commented on
that topic, i.e. @agnivade, @sbrl and @pxgamer.

Ok. I'm fine with anything. I'll wait for a consensus to emerge.

The separator between the page name and the language suffix should be chosen in a way that doesn't conflict with existing characters.

Yeah, I actually suggested using the 2 letter ISO language code. So the length of the suffix is known. So even if a page has pg_ko_ko.md, we can construct the target page name from the command and the given language.

Only ambiguity might be just looking at the file names, like pg_ko.md, it is not possible to understand whether the command name like pg_ko or pg. But one can just open such a file if there is such a confusion. Given that cases like these should be rare.

It's just that having a separator like @ is a bit odd for a file name.

I'd prefer the directory-based approach, as it used in #2512. It provides a clear separation between different languages, so they don't get muddled up.

Are you okay with the platform folders being duplicated ?

I think replicating the folder structure is the best bet for retaining both backwards compatibility for existing clients and not having to ever worry about weird edge cases like "what happens if there's a command cp and cp_es, how would I translate that page to Spanish?".

Absolutely, @agnivade, and good point @MasterOdin!

We also need standards for commit messages. How about cp.ta: add page
instead of cp: add Tamil page like I did in the referenced pull
request? The former is shorter and a little more structured.

I don't think @ is a problematic character in any of the common OSes, but if you guys don't like the idea, I agree that the folder duplication is better than using _ as a separator.

As for the commit messages, I personally prefer the cp: add Tamil page format, as it is more explicit, but I'll go with whatever the majority prefers.

Ok then, let's go with the replicated folder structure, and commit message as cp: add Tamil page

We need to modify the Travis script scripts/build-index.js to also
index language while creating pages/index.json. Since we are going
with a replicated folder structure, would it be better to put index.json
outside the pages folder, that is, common to all language folders? I
imagine this would instantly break a lot of clients.

We could instead go with separate index.json files, one inside each
pages* folder. But, that would complicate things on the client side
and seems to defeat the purpose of having an unified index file.

What to do? Ideas?

For index.json - we can add another key called "languages" at the same level as "platform", add add whatever extra language support we have. English will be the default.

For shortIndex.json - it is difficult without making a breaking change. Maybe we can have a shortIndex2.json, which adds the language code for the page name.

"git-svn": {"platform": [], "languages": []}.

Then each client can add support gradually.

It might make things more complex than necessary, but we could have index.ta.json for Tamil, and then an "index of indexes"?

Where are these index json files located?

It might make things more complex than necessary, but we could have index.ta.json for Tamil, and then an "index of indexes"?

That will still break all clients. Which we should ideally avoid.

Where are these index json files located?

In the assets folder of tldr-pages.github.io

(just some ideas :) )

There is still no index.json file in pages.ta.
I think the folder structure is fine but there should be some index file for languages "lang.json", probably next to the pages.* folders and each language should have their own index.json in their pages folder. As mentioned before.
(Currently I'm just scanning the folders for .md files because it's unclear how the language feature is implemented.)

I think the lang.json should be something like this:

{ languages: [
    { "englishName":"Tamil", "2LetterISO":"ta"},
    { "englishName:"English", "2LetterISO":"en"},
]}

Maybe not use englishName but I think that's the easiest thing to do but native name could be fine too.

I managed to generate POT file and PO files for each languages and write the translations back to folder pages.xx (where xx is the 2-letter code of the language, e.g. pages.hu for Hungarian). I'm using po4a for this.

I put a po4a.conf file into the root directory of the repo. The content of the config file is:

[po4a_langs] hu
[po4a_paths] i18n/tldr.pot $lang:i18n/tldr.$lang.po

[type: asciidoc] pages/common/7za.md $lang:pages.$lang/common/7za.md
[type: asciidoc] pages/common/7z.md $lang:pages.$lang/common/7z.md
[type: asciidoc] pages/common/7zr.md $lang:pages.$lang/common/7zr.md
... (here are listed all Markdown files)

[po_langs] lists the supported languages separated by spaces.
I'm using asciidoc type, because po4a doesn't support Markdown format.

I'm using this command to generate the i18n/tldr.pot file and i18n/tldr.xx.po files (where xx is the 2-letter code of the language, e.g. tldr.hu.po for Hungarian):

po4a -v -k 0 po4a.conf

After the translation is ready, I execute the command again to generate the translated Markdown files from the translated PO files. Everything works perfectly, there is only a small problem with the generated Markdown files: lines are wrapped at column 80. Unfortunately this needs to be fixed manually (or create a beautifier script for this).

po4a sounds like a good idea to me. Earlier in this thread, the issue of
keeping translations up to date with the English version was
raised. po4a solves that problem.

Sounds great to me! I assume po4a is a tool that lets you tell when a translation needs updating or something?

Either way, could you send a PR for that @urbalazs please? :smiley_cat:

Shouldn't this be closed? The suport for multiple human languages is already here. The related issues are technical and they have already been opened as new issues.

I am fine with this being closed. Please go ahead unless somebody else
has an objection.

I think the only pending item was the discussion on which translation service to use. We've had some discussion on gitter, but is https://github.com/tldr-pages/tldr/issues/3591 the issue for it ?

I would close this issue in any case, as its goal has been achieved. If #3591 doesn't serve the purpose, we can open a new issue regarding the migration to a translation service (but I don't think that's needed, we already have a few such issues floating around).

Yes the purpose of this issue has been achieved, but I don't want to lose track of any new discussions that sprouted from this issue. That's why I was checking to see if we have other issues filed.

we already have a few such issues floating around

Can you point me to those other than #3591 ? My search-fu is failing me.

With related issues I initially meant:

3886 #3860 #3796 #3591 #190

But it seems only #3591 serves the purpose you described, you're right. In any case, 5 open issues about translations are enough to discuss anything related to the topic.

Oh yeah, the translation hosting issue. That's a really nasty and complicated one.

Perhaps if we get some donations we can rent a server and self-host something lol

Thanks, I have updated #3591 appropriately. With that done, I don't believe there is anything pending to be done here. Closing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pascaliske picture pascaliske  路  3Comments

hrai picture hrai  路  3Comments

michaeldbianchi picture michaeldbianchi  路  3Comments

endearingyoungcharms picture endearingyoungcharms  路  3Comments

zlatanvasovic picture zlatanvasovic  路  3Comments