When converting markdown to docx double spacing after sentence punctuation is placed with a single space. To comply with APA6 formatting standards I need to be able to preserve this double spacing.
I want to fork the code and try to update it, but I'm not sure where this is handled? How does the spacing get parsed and replaced? Or is there already a way to preserve spacing?
That's what APA 6e wants (“Space twice after punctuation marks at the end of a sentence”, 4.01).
However, the Chicago Manual, 16e, requires a _single_ space: “Like most publishers, Chicago advises leaving a single character space, not two spaces, between sentences and after colons used within a sentence …” (2.9).
So I would _not_ want pandoc to make two spaces the default. Indeed I'd prefer making one-space the default in general. For latex and beamer this would merely require adding \frenchspacing to the default templates.
The only option I see for APA then – provided that pandoc _can_ reliably distinguish end-of-sentence periods from others – is to make the insertion of two spaces conditional upon the presence of csl: apa.csl in a document's metadata. (Or rather, since AFAICT English is the only language that sometimes uses two spaces, presence of apa.csl _and_ absence of a non-English (main) lang tag.)
In LaTeX output, the author can control this (as you note)
by adding \frenchspacing or not. I don't think it's
important to detect apa.sty and en-US locale and switch
this automatically. Let's just leave it up to the author.
Unfortunately, in other output formats it's harder to
control this. For example, I have no idea how one controls
it in HTML.
The problem is that pandoc doesn't parse things as
sentences. You could look for the pattern
Str (something ending with a period), Space
and insert an additional space, but you may get bad
results with abbrevations that end with periods.
Anyway, if you just want to change the docx output,
the thing to look at is src/Text/Pandoc/Writers/Docx.hs.
+++ James Santiago [Feb 27 16 00:57 ]:
When converting markdown to docx double spacing after sentence
punctuation is placed with a single space. To comply with APA6
formatting standards I need to be able to preserve this double spacing.I want to fork the code and try to update it, but I'm not sure where
this is handled? How does the spacing get parsed and replaced? Or is
there already a way to preserve spacing?—
Reply to this email directly or [1]view it on GitHub.References
I guess I'll need to learn haskell first. I'm looking at the docx writer, but I'm thinking maybe a type of "punctuation" needs to be defined in pandoc-types, then detected by a reader, and written out by a writer. It seems that in the docx writer "inline" elements are simply combined in "blocks" such as paragraphs by a single space. Spaces aren't preserved between words either as everything appears to be split by spaces into inlines.
I don't think I could simply modify the docx writer because @jgm said an inline of "al." as in "Joe et al. (2013)" would look the same as an inline of "end." as in "sentence end. Start" I'd need to first detect it when reading and then preserve it.
Well for now I'm making the spacing changes in DocxMerge. I've added a simple regex hack so I can at least continue on with my current workflow which is:
I just ran into this problem converting from Markdown to HTML—all the double-space sequences got converted to single spaces. I’m not clear from the previous comments what I can do to prevent this—explanations, or pointers to same, would be welcome!
The conclusion I came to was that it couldn't be prevented, but had to be fixed after conversion either by modifying the pandoc writer components or completely outside of pandoc (which I did with my docx merge tool).
There’s no way to convince pandoc to just leave whitespace alone? That seems… an odd omission.
You can always escape the second space to make it a NO-BREAK SPACE:
echo 'foo \ bar' | pandoc
<p>foo  bar</p>
+++ Eric A. Meyer [Feb 24 17 17:54 ]:
There’s no way to convince pandoc to just leave whitespace alone? That
seems… an odd omission.
Not really, given pandoc's aims.
Pandoc tries to capture the structural elements of a
document. Semantically insignificant differences --
for example, the number of spaces, the precise bullet
character used, whether the bullet was indented one
or two spaces -- are not represented in its internal model
of the document (the Pandoc structure which serves as an
intermediary in all its conversions).
See #3466.
An alternative would be to change Pandoc's Space element to take a String argument, so that the original spacing can be preserved. (This wouldn't always work well, though -- e.g. what counts as an interword space in HTML might be a paragraph break if translated into LaTeX.)
I think a string attribute to the space element would be a good thing. That way we could have a --whitespace=preserve|collapse option so people can choose for themselves.
+++ Benct Philip Jonsson [Feb 26 17 01:15 ]:
I think a string attribute to the space element would be a good thing.
That way we could have a --whitespace=preserve|collapse option so
people can choose for themselves.
Drawbacks:
This would be a nice feature to have. I'm using Pandoc to convert HTML to Org-mode, and Emacs, by default, considers sentences to be separated by two spaces. Of course, most web pages don't do that, but when you run into one that does, it would be nice to preserve the two spaces to make navigating by sentence easier. In this case, I'm specifically converting in the HTML to plain spaces, but Pandoc then collapses the spaces between sentences, which is annoying.
Most helpful comment
I think a string attribute to the space element would be a good thing. That way we could have a
--whitespace=preserve|collapseoption so people can choose for themselves.