Pandoc: md->latex: spaces in abbreviations e.g. and i.e.

Created on 11 May 2017  路  15Comments  路  Source: jgm/pandoc

Hello,

Pandoc turns "e.g. some word" in markdown to "e.g.~some word" in latex. Hence, it includes a non-breaking space. It would be nice, if pandoc could also add a narrow non-breaking space in between e.g. to get eventually: "e.,g.~some word". Same for i.e.

Further, when I add a unicode narrow non-breaking space, pandoc does not add the non-breaking space (~) afterwards.

I tried to dig into the code to provide a patch but could not identify the responsible code.

I use:
pandoc 1.19.2.1
Compiled with pandoc-types 1.17.0.5, texmath 0.9.1.1, skylighting 0.1.1.5

All 15 comments

See #3641. If we created a general AST filter for abbreviations, it could include this.
We'd have to modify the format of the custom abbreviation file as well.

Please keep support for different languages (for example German) in mind.

We currently have an option to specify a file with a list of
abbreviations, so you can tailor it to your language and
needs.

+++ Clemens Prill [May 18 17 10:56 ]:

Please keep support for different languages (for example German) in
mind.

Are you okay with getting other language's abbreviations being added to this repo, next to the English ones @jgm? So it would be an community-maintained list.

We don't have a system for localization set up yet. See #3559 for another place this might be useful. Until we have localization, though, there's nowhere to put data for other languages -- though you could always put it in the wiki.

Here's a first step:

diff --git a/src/Text/Pandoc/App.hs b/src/Text/Pandoc/App.hs
index 845146f..4c74de5 100644
--- a/src/Text/Pandoc/App.hs
+++ b/src/Text/Pandoc/App.hs
@@ -43,6 +43,7 @@ import qualified Control.Exception as E
 import Control.Monad.Except (throwError)
 import Control.Monad
 import Control.Monad.Trans
+import Control.Monad.State
 import Data.Aeson (eitherDecode', encode, ToJSON(..), FromJSON(..),
                    genericToEncoding, defaultOptions)
 import qualified Data.ByteString as BS
@@ -83,6 +84,7 @@ import Text.Pandoc.SelfContained (makeSelfContained, makeDataURI)
 import Text.Pandoc.Shared (isURI, headerShift, openURL, readDataFile,
                            readDataFileUTF8, safeRead, tabFilter)
 import qualified Text.Pandoc.UTF8 as UTF8
+import Text.Pandoc.Walk (walkM)
 import Text.Pandoc.XML (toEntities)
 import Text.Printf
 #ifndef _WINDOWS
@@ -307,7 +309,6 @@ convertWithOpts opts = do
                       , readerDefaultImageExtension =
                          optDefaultImageExtension opts
                       , readerTrackChanges = optTrackChanges opts
-                      , readerAbbreviations = abbrevs
                       }

   highlightStyle <- lookupHighlightStyle $ optHighlightStyle opts
@@ -373,8 +374,8 @@ convertWithOpts opts = do
             "Cannot write " ++ format ++ " output to stdout.\n" ++
             "Specify an output file using the -o option."

-
-  let transforms = case optBaseHeaderLevel opts of
+  let transforms = [handleAbbreviations abbrevs] ++
+                   case optBaseHeaderLevel opts of
                         x | x > 1 -> [headerShift (x - 1)]
                           | otherwise -> []

@@ -1555,3 +1556,28 @@ splitField s =
   case break (`elem` ":=") s of
        (k,_:v) -> (k,v)
        (k,[])  -> (k,"true")
+
+handleAbbreviations :: Set.Set String -> Pandoc -> Pandoc
+handleAbbreviations abbrevs doc = evalState (walkM doAbbrev doc) False
+  where doAbbrev :: Inline -> State Bool Inline
+        doAbbrev (Str xs) = do
+          if xs `Set.member` abbrevs
+            then do
+              put True
+              return (Str $ insertThinSpaces xs)
+            else do
+              put False
+              return (Str xs)
+        doAbbrev Space = do
+          lastWasAbbrev <- get
+          put False
+          if lastWasAbbrev
+             then return (Str "\160")
+             else return Space
+        doAbbrev x = do
+          put False
+          return x
+        insertThinSpaces :: String -> String
+        insertThinSpaces [] = []
+        insertThinSpaces ('.':x:xs) = '.' : '\x202F' : insertThinSpaces (x:xs)
+        insertThinSpaces (x:xs) = x : insertThinSpaces xs
diff --git a/src/Text/Pandoc/Options.hs b/src/Text/Pandoc/Options.hs
index c7211c8..1fb673e 100644
--- a/src/Text/Pandoc/Options.hs
+++ b/src/Text/Pandoc/Options.hs
@@ -64,7 +64,6 @@ data ReaderOptions = ReaderOptions{
        , readerApplyMacros           :: Bool -- ^ Apply macros to TeX math
        , readerIndentedCodeClasses   :: [String] -- ^ Default classes for
                                        -- indented code blocks
-       , readerAbbreviations         :: Set.Set String -- ^ Strings to treat as abbreviations
        , readerDefaultImageExtension :: String -- ^ Default extension for images
        , readerTrackChanges          :: TrackChanges
 } deriving (Show, Read, Data, Typeable, Generic)
@@ -77,7 +76,6 @@ instance Default ReaderOptions
                , readerTabStop               = 4
                , readerApplyMacros           = True
                , readerIndentedCodeClasses   = []
-               , readerAbbreviations         = defaultAbbrevs
                , readerDefaultImageExtension = ""
                , readerTrackChanges          = AcceptChanges
                }
diff --git a/src/Text/Pandoc/Readers/Markdown.hs b/src/Text/Pandoc/Readers/Markdown.hs
index af75885..7a9d55a 100644
--- a/src/Text/Pandoc/Readers/Markdown.hs
+++ b/src/Text/Pandoc/Readers/Markdown.hs
@@ -42,7 +42,6 @@ import Data.Maybe
 import Data.Monoid ((<>))
 import Data.Ord (comparing)
 import Data.Scientific (base10Exponent, coefficient)
-import qualified Data.Set as Set
 import Data.Text (Text)
 import qualified Data.Text as T
 import qualified Data.Vector as V
@@ -1624,20 +1623,7 @@ str :: PandocMonad m => MarkdownParser m (F Inlines)
 str = do
   result <- many1 (alphaNum <|> try (char '.' <* notFollowedBy (char '.')))
   updateLastStrPos
-  (do guardEnabled Ext_smart
-      abbrevs <- getOption readerAbbreviations
-      if not (null result) && last result == '.' && result `Set.member` abbrevs
-         then try (do ils <- whitespace <|> endline
-                      lookAhead alphaNum
-                      return $ do
-                        ils' <- ils
-                        if ils' == B.space
-                           then return (B.str result <> B.str "\160")
-                           else -- linebreak or softbreak
-                                return (ils' <> B.str result <> B.str "\160"))
-                <|> return (return (B.str result))
-         else return (return (B.str result)))
-     <|> return (return (B.str result))
+  return (return (B.str result))

 -- an endline character that can be treated as a space, not a structural break
 endline :: PandocMonad m => MarkdownParser m (F Inlines)

Note, however, that this causes the abbreviation transforms to be applied in every reader, no matter what. This may also cause poor results for markdown -> markdown, because the markdown writer doesn't "undo" the abbreviation transformations. It would be good to tie this feature to smart (as before) or perhaps to smart_abbrevs. Doing this may require rewriting getReader so that the list of extensions is returned separately.

Please note that both i.e. and e.g. are normally followed by commas. I'm not sure this makes a difference in @jgm's code.

+++ John Muccigrosso [May 27 17 12:20 ]:

Please note that both i.e. and e.g. are normally followed by commas.
I'm not sure this makes a difference in [1]@jgm's code.

The code does two things: it changes a following space (if
there is one) to a nonbreaking space -- nothing will happen
if it's followed by a comma -- and it inserts a nonbreaking
thin space after the internal period -- this will still
happen with the comma.

On further reflection, I'm not sold on the thin space in these abbreviations. I don't actually see this much in printed material.

I guess this is because many authors do their own typesetting nowadays without any prior education on this topic. You should consider the print material of professional typesetters, e.鈥痝. your the monthly newsletter of your local Latex interest group. ;)

@rriemann I'm not talking about people doing their own typesetting. I'm talking about professionally typeset books from Oxford University Press, etc. I checked a few cases and never saw the thin space, which probably explains why it looks wrong to me.

I'd be more convinced if you could point me to authoritative sources that recommends this practice (Chicago Manual of Style or something similar).

I have not found any source for English. However, in Germany, we have some standards that specify that:

DIN 5008 (Part 4.5). Example abbreviations: a.鈥痑.鈥疧., d.鈥痟., v.鈥痩.鈥痭.鈥痳., z.鈥疊. usw.
It is tought in university lectures/guidelines: https://www2.informatik.hu-berlin.de/sv/lehre/typographie.pdf

I asked in the latex IRC channel and got forwarded to https://en.wikipedia.org/wiki/De_gustibus_non_est_disputandum (saying it's a matter of taste).

Oh yes, I'm aware that German wants the spaces. But I don't think we want to do this automatically/globally.

Perhaps for now you could use a filter to do these transformations.

A more ambitious change would be to modify the format of the abbreviations file to allow it to specify transformations inside the abbreviations, instead of just having a list.

@jgm (just a note concerning German wants the spaces):
The benefit really depends on the output capabilities, the half space produced by LaTeX (shown on page 1 of https://www2.informatik.hu-berlin.de/sv/lehre/typographie.pdf, the guideline @rriemann reffered to) looks good, the full space doesn't.

IOW: German wants the thin spaces ;-)

Note that you can simply enter a thin space manually in your Markdown (or other format) source, and it should be translated properly into LaTeX. Example:

% pandoc -t latex
z.&#x202F;B.
^D
z.\,B.

or

% pandoc -t latex
z.鈥疊.
z.\,B.

If I were writing German, I'd simply program my text editor so I had an easy way to enter a U+202F nonbreaking thin space character, and I'd use that in z. B.. Then the source will look good, and pandoc will handle it fine as it is, without modifications.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chrissound picture chrissound  路  4Comments

timtroendle picture timtroendle  路  3Comments

RLesur picture RLesur  路  3Comments

georgewsinger picture georgewsinger  路  4Comments

transientsolutions picture transientsolutions  路  3Comments