While going through Part 6 of the Gatsby tutorial, I noticed that generated excerpt output is displayed without spaces between paragraphs. So instead of receiving "Pandas are really sweet. Here's a video of a panda eating sweets." the excerpt is "Pandas are really sweet.Here's a video of a panda eating sweets."
Create any markdown file with multiple paragraphs. For example:
Paragraph one sentence one. Sentence two.
Paragraph two sentence one.
Paragraph three sentence one.
Query all remark excerpts in GraphiQL:
query {
allMarkdownRemark {
edges {
node {
excerpt
}
}
}
}
Witness the excerpt is generated without spaces between paragraphs:
Paragraph one sentence one. Sentence two.Paragraph two sentence one.Paragraph three sentence one.
https://github.com/mxxk/gatsby-transformer-remark-excerpt-issue
{
"data": {
"allMarkdownRemark": {
"edges": [
{
"node": {
"excerpt": "Paragraph one sentence one. Sentence two. Paragraph two sentence one. Paragraph three sentence one."
}
}
]
}
}
}
{
"data": {
"allMarkdownRemark": {
"edges": [
{
"node": {
"excerpt": "Paragraph one sentence one. Sentence two.Paragraph two sentence one.Paragraph three sentence one."
}
}
]
}
}
}
Output of gatsby info --clipboard:
System:
OS: macOS High Sierra 10.13.6
CPU: (8) x64 Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
Shell: 3.2.57 - /bin/bash
Binaries:
Node: 10.15.3 - ~/.nvm/versions/node/v10.15.3/bin/node
npm: 6.4.1 - ~/.nvm/versions/node/v10.15.3/bin/npm
Languages:
Python: 2.7.16 - /usr/local/bin/python
Browsers:
Chrome: 74.0.3729.169
Firefox: 67.0
Safari: 12.1.1
npmPackages:
gatsby: ^2.8.2 => 2.8.2
gatsby-plugin-emotion: ^4.0.7 => 4.0.7
gatsby-plugin-typography: ^2.2.13 => 2.2.13
gatsby-source-filesystem: ^2.0.38 => 2.0.38
gatsby-transformer-remark: ^2.3.12 => 2.3.12
npmGlobalPackages:
gatsby-cli: 2.6.4
Hi!
Thanks a lot for reporting this! The issue seems to be in getConcatanatedValue in here that is used here. It's probably possible to fix by checking if there are no leading or trailing spaces in text nodes and then adding a space between if it's the case. One could also check if it's a paragraph break between nodes and then add a space. Would you like to make a PR to do that?
Thanks,
Mikhail
Thanks for triaging @freiksenet. I'm on the fence about making what would be my first commit into this codebase as I don't want to introduce any regressions, but let's talk. 馃槃
Regarding the algorithm, instead of having to check for leading/trailing spaces do you see any issues with this alternative version?
For non-text nodes with children, find the resulting text of each child, trim the result to remove leading/trailing spaces, and then join together with a single space, excluding empty text elements. In other
wordscode,return node.children .map(child => getConcatenatedValue(child).trim()) .filter(value => value) .join(` `)
@mxxk That looks good, great idea!
@freiksenet thanks! How would you recommend to test the change? I modified node_modules/gatsby-transformer-remark/hast-processing.js directly in my test website but Gatsby is not picking up the changes during gatsby develop or gatsby build. I checked the docs on Gatsby Debugging but can't find any info on debugging a change in Gatsby plugins (such as this one).
@mxxk You'd need to edit babel processed files in node_modules to make it work. However, it's better to use gatsby-dev-cli. Check out this guide to set up your local dev env.
Thanks @freiksenet. On debugging this further I've discovered that the fix is not in just in
but also in
Furthermore, the simplistic algorithm I proposed turned out to be flawed, as with my changes the test case
returned Where oh where is that pony ? rather than the expected Where oh where is that pony?.
Back to the drawing board, I guess! Perhaps what you said earlier with
One could also check if it's a paragraph break between nodes and then add a space.
is the way to go here?
@mxxk Thank you for working on this! You could check for AST node types and only add spaces when concatenating paragraph AST nodes, preserving whitespace otherwise. Alternatively you could try checking if the node is top level (so child of a root document node).
@freiksenet at first I thought this would be a quick fix but unexpected challenges always find their way. 馃檪
You could check for AST node types and only add spaces when concatenating paragraph AST nodes, preserving whitespace otherwise. Alternatively you could try checking if the node is top level (so child of a root document node).
Good suggestions! When generating plain format (as opposed to html format) excerpts,
the visit function is used to traverse the AST. What has me stumped is how, in this visitor pattern, to determine (if the current node is a paragraph) whether it is followed by another paragraph and _only then_ add a space. 馃 (The visitor pattern abstracts us from how the graph is traversed, so it seems more difficult to look outside of the current node than the recursive pattern seen here.)
Your second alternative of simply checking if the parent is the root is much simpler. However, the following two cases came to mind:
I am a paragraph.
I am another paragraph.
I am a paragraph with a [link][1].
[1]: http://and-i-am-a-link-in-the-footer.com
Correct me if I'm wrong, but both will have two top-level nodes, yet the former needs to have a space between them and the latter does not. You can even make this crazier:
[1]: http://i-am-a-weirdly-placed-link.com
I am a paragraph with a [link][1] and another [link][2].
[2]: http://and-i-am-a-link-in-the-footer.com
A bit verbose, but hopefully illustrates the point. 馃槃 How would you handle these different cases in the proposed algorithm?
What if we combine the approach 1 and 2? We could only inspect root children and use only paragraphs (and similar nodes that are mostly text like lists), but ignoring the nodes that are not like that.
I understand that it is getting very complicated :man_facepalming: and I'm sorry for suggesting couple dead end solutions. This seems to be a much more complex problem than anticipated :sweat_smile: Thank you for taking your time to work on it! :purple_heart: :muscle:
@freiksenet I've made some progress on this fix! 馃槃 There's still work in terms of formalizing it in a PR, but I'll elaborate my findings here for now.
For this test Markdown document,
Paragraph
# Heading
Broken
Line
| TableCell |
| --------- |
| TableCell |
the current excerpt logic spits out some a single (obviously German 馃槣) word:
ParagraphHeadingBrokenLineTableCellTableCell
This test Markdown document is crafted to be comprehensive for all cases affected by this issue. Why? Because according to https://github.com/syntax-tree/mdast, which lists out all Markdown node types, there is a minimal set of nodes which require special handling: paragraph, heading, tableCell, break. All other nodes (e.g. listItem and blockQuote) do not require special handling, as they have paragraph node children.
_Interestingly, though the break node (i.e. representing the Broken [line break] Line in the above example) is different from the other node types since it does not contain text, elegantly enough it is handled in an identical way with the other node types._
The diff below is just for demonstration; I will make a PR for it once formalized.
+ const SpaceNodeTypesSet = new Set([
+ `paragraph`,
+ `heading`,
+ `tableCell`,
+ `break`,
+ ])
+
async function getExcerptPlain(
markdownNode,
pruneLength,
truncate,
excerptSeparator
) {
const text = await getAST(markdownNode).then(ast => {
let excerptNodes = []
let isBeforeSeparator = true
visit(
ast,
node => isBeforeSeparator,
node => {
if (excerptSeparator && node.value === excerptSeparator) {
isBeforeSeparator = false
- return
- }
- if (node.type === `text` || node.type === `inlineCode`) {
- excerptNodes.push(node.value)
- }
- if (node.type === `image`) {
- excerptNodes.push(node.alt)
+ } else if (node.type === `text` || node.type === `inlineCode`) {
+ excerptNodes.push(node.value.trim())
+ } else if (node.type === `image`) {
+ excerptNodes.push(node.alt.trim())
+ } else if (SpaceNodeTypesSet.has(node.type)) {
+ // Add a space when encountering one of these node types.
+ excerptNodes.push(` `)
}
}
)
- const excerptText = excerptNodes.join(``)
+ const excerptText = excerptNodes.join(``).trim()
if (excerptSeparator) {
return excerptText
}
if (!truncate) {
return prune(excerptText, pruneLength, `鈥)
}
return _.truncate(excerptText, {
length: pruneLength,
omission: `鈥,
})
})
return text
}
With these fixes in place, the excerpt becomes:
Paragraph Heading Broken Line TableCell TableCell
This is a _plain-text_ excerpt. There is still more work to be done for HTML excerpts generated in
Notably, plain-text excerpt generation ignores HTML nodes, which may need special handling for HTML excerpt generation.
Published [email protected]. Thanks @mxxk for fixing this ;)
Most helpful comment
Published
[email protected]. Thanks @mxxk for fixing this ;)