Pandoc: Feature Request: Auto Detect Languages and Automatically Created the Needed Pandoc Native Span and Div

Created on 10 Apr 2016  Â·  11Comments  Â·  Source: jgm/pandoc

In particular, I am referring to this part of the user guide:

Native pandoc spans and divs with the lang attribute (value in BCP 47) can be used to switch the language in that range.

It seems that some kind of automation can be done, at least to a certain extent. Different languages using the same alphabets can be difficult to deal with. But those involving characters like CJK or Greek or Hebrew can be auto-detected based on the character used alone.

This approached is used in Pomax/ucharclasses in LaTeX. I'm not saying I don't know it can already be used in pandoc to LaTeX generation, but I am saying may be pandoc can have a similar tool such that any output formats can benefits from this as well but not only LaTeX (and it could benefits LaTeX output as well since polyglossia and ucharclasses conflicts each other)?

There's a difficulty ucharclasses didn't solve: say, spaces between Greek alphabets are the same spaces in Latin's. This seems to be solvable by ignore a list of known such characters in the algorithm to auto-generate the lang div?

Another difficulty is different languages using the same alphabets. But some kind of dictionary "attack" should help doing this? This should takes more time and may be it should be treated as a different extension (or different levels of the same extension).

I apologize if someone has suggested this before. A quick search didn't seem to find it.

And if it is a feature too complicated, or not interested in, feel free to close it.

Thanks.

Most helpful comment

I think it's too complicated. Building language-detection algorithms into pandoc would add a lot of code and data. (And it's not at all trivial -- even if you detect Greek script, for example, is it ancient or modern Greek?)

Someone might want to create a filter that automatically inserts spans and divs with language marked based on automatic language recognition.

All 11 comments

I think it's too complicated. Building language-detection algorithms into pandoc would add a lot of code and data. (And it's not at all trivial -- even if you detect Greek script, for example, is it ancient or modern Greek?)

Someone might want to create a filter that automatically inserts spans and divs with language marked based on automatic language recognition.

@ickc, I don’t see the gain in that improvement for a markup language (even if it’s lightweight).

What we need is a proper syntax for languages (#895), besides divisions and spans. Please, feel free to comment there.

  1. Keyboard layouts!
    Limiting such a process to only user's languages (keyboard layouts), thus user's dictionaries, may simplify it.
  2. Also wondering if some libraries in Debian repositories could help, like libexttextcat-2.0-0.

Don’t completely understand but it wouldn’t help. Sometimes you don’t even have control on what language is in a document.

In any case probably this should be done at pre-processing or filtering stage (I opened this when I was a pandoc beginner.)

Ignoring possible performance issue, auto-detect language is not too difficult. Writing a pandoc filter, at the same time not tagging the language like every word is more complicated (or may be actually tagging every word is ok and rely on pandoc to normalize it?) I just don’t intermix too many languages for me to actually have the incentive to do this.

First of all, I'm talking here about a simple case: setting a language metadata tag for just an entire document.

lang: fr
---

# Heading…

Sometimes you don’t even have control on what language is in a document.

Yes, sometimes, but in most cases, an individual user understands and deals with only 1-3 languages.

Suppose we are talking about a Spanish user, a user who understands and deal with only English and Spanish, and wants to set a language metadata tag to a set of documents he/she has.

The process could be like this:

  1. Pandoc will check the user's keyboard-layout. In this example, the result will be probably es and en.
  2. Pandoc will scan these documents by spell-checking each according to available English and Spanish dictionaries.

    • If the most words in a specific document belongs to the English dictionary, Pandoc will set:

      lang: en

    • If the most words in this document belongs to the Spanish dictionary, Pandoc will set:

      lang: es

For a better performance, I think checking only document's title and headings might be enough.

Also code blocks and lines should be out this scanning process.

@anasram Making assumptions about the content of a document based on the system environment (Operating system region, keyboard settings, etc.) is fraught with problems. I haven't even seen an operating system that can get this right consistently for it's own built in text editor yet!

Pandoc is a document format conversion tool and should work the same way given the same inputs no matter where it is run from. Making guesses based on the environment should not be a thing.

Likewise trying to guess based on document content is pretty problematic. I say that as a long time LaTeX user, now one of the authors of a typesetting engine which does a lot of smart things like switching directions when setting font fallbacks based on character sets. This is hard stuff to get right and should be explicitly handled by the reader/writer themselves, not mucked with automatically during conversion.

For best results hard code your lang into the document's meta data or provide it on the command line. If you want heuristic based guessing that should be done by another tool separate from Pandoc.

For best results hard code your lang into the document's meta data or provide it on the command line.

In fact, what led me to comment on this issue is issue #162.

I mean I'm trying here to find a solution for an already exist Pandoc issue, and I think finding a _fuzzy_ solution for it is better than nothing.

I think it is not about fuzzy or not. It is about being stateful, what you suggested is very stateful (which means that it depends on the environment a lot.) Not to mention it will be difficult to make it cross-platform, I think it is fundamentally against pandoc’s philosophy. As pandoc is written in Haskell, pureness is extra important (not that pandoc is completely pure.)

But the beauty of pandoc and a tool following UNIX philosophy is that it enables you to chain different thing together. What you suggested may be best fit to be in a GUI text editor that already is very stateful, and it can be smart to auto guess something for you on the fly. Or may be write a pre-processor (that’s what I’d do if I need it.)

In fact, what led me to comment on this issue is issue #162.

I mean I'm trying here to find a solution for an already exist Pandoc issue, and I think finding a fuzzy solution for it is better than nothing.

@anasram, I’m the original reporter of #162.

Simpler language tagging is required to mix languages in documents.

There is compromise in pure HTML tagging:

This is <a lang="de">Deutsch</a>

But this doesn’t work with LaTeX.

I proposed the simpler syntax (#3451):

This is [Deutsch]{:de}

But I have no idea about the implementation timeline on this particular feature.

@ickc

I think it is not about fuzzy or not. It is about being stateful, what you suggested is very stateful (which means that it depends on the environment a lot.) Not to mention it will be difficult to make it cross-platform,

Now I see.

But the beauty of pandoc and a tool following UNIX philosophy is that it enables you to chain different thing together.

So, my suggestion could be implemented and _shipped_ in an independent package, and Pandoc may then _depends_ on it. Is this possible? Does this against pandoc’s philosophy?

So, my suggestion could be implemented and shipped in an independent package, and Pandoc may then depends on it.

See the pandoc GitHub wiki here for existing preprocessors and filters. They are often packaged using standard package managers like PyPI, stack, etc. however pandoc won’t depend on it. It is the users’ responsibility to install them properly. Currently it is a limitation of pandoc to have “reproducible builds” of pandoc documents that also relies on 3rd parties packages. There’s an effort called pandocpm trying to do that but is now half dead. You are encouraged to pick it up if you want to.

The main problem of having a package manager like this is, it is very stateful (trying to be smart to figure out defaults and auto install, on any OSes) and at the same time there is security implications. May be I’m just too paranoid so I didn’t push through pandocpm in early days and now getting too busy to pick it up again.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

naught101 picture naught101  Â·  5Comments

RLesur picture RLesur  Â·  3Comments

transientsolutions picture transientsolutions  Â·  3Comments

eins78 picture eins78  Â·  5Comments

chrissound picture chrissound  Â·  4Comments