Gatsby: Excerpt lacks space between paragraphs

Created on 4 Jun 2019 · 11Comments · Source: gatsbyjs/gatsby

Description

While going through Part 6 of the Gatsby tutorial, I noticed that generated excerpt output is displayed without spaces between paragraphs. So instead of receiving "Pandas are really sweet. Here's a video of a panda eating sweets." the excerpt is "Pandas are really sweet.Here's a video of a panda eating sweets."

Steps to reproduce

Create any markdown file with multiple paragraphs. For example:

Paragraph one sentence one. Sentence two.

Paragraph two sentence one.

Paragraph three sentence one.

Query all remark excerpts in GraphiQL:

query {
  allMarkdownRemark {
    edges {
      node {
        excerpt
      }
    }
  }
}

Witness the excerpt is generated without spaces between paragraphs:

Paragraph one sentence one. Sentence two.Paragraph two sentence one.Paragraph three sentence one.

Demo project

https://github.com/mxxk/gatsby-transformer-remark-excerpt-issue

Expected result

{
  "data": {
    "allMarkdownRemark": {
      "edges": [
        {
          "node": {
            "excerpt": "Paragraph one sentence one. Sentence two. Paragraph two sentence one. Paragraph three sentence one."
          }
        }
      ]
    }
  }
}

Actual result

{
  "data": {
    "allMarkdownRemark": {
      "edges": [
        {
          "node": {
            "excerpt": "Paragraph one sentence one. Sentence two.Paragraph two sentence one.Paragraph three sentence one."
          }
        }
      ]
    }
  }
}

Environment

Output of gatsby info --clipboard:


  System:
    OS: macOS High Sierra 10.13.6
    CPU: (8) x64 Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
    Shell: 3.2.57 - /bin/bash
  Binaries:
    Node: 10.15.3 - ~/.nvm/versions/node/v10.15.3/bin/node
    npm: 6.4.1 - ~/.nvm/versions/node/v10.15.3/bin/npm
  Languages:
    Python: 2.7.16 - /usr/local/bin/python
  Browsers:
    Chrome: 74.0.3729.169
    Firefox: 67.0
    Safari: 12.1.1
  npmPackages:
    gatsby: ^2.8.2 => 2.8.2 
    gatsby-plugin-emotion: ^4.0.7 => 4.0.7 
    gatsby-plugin-typography: ^2.2.13 => 2.2.13 
    gatsby-source-filesystem: ^2.0.38 => 2.0.38 
    gatsby-transformer-remark: ^2.3.12 => 2.3.12 
  npmGlobalPackages:
    gatsby-cli: 2.6.4

good first issue help wanted confirmed bug

Source

mxxk

👍4

Most helpful comment

Published [email protected]. Thanks @mxxk for fixing this ;)

pieh on 6 Jul 2019

😄3 👍3

All 11 comments

Hi!

Thanks a lot for reporting this! The issue seems to be in getConcatanatedValue in here that is used here. It's probably possible to fix by checking if there are no leading or trailing spaces in text nodes and then adding a space between if it's the case. One could also check if it's a paragraph break between nodes and then add a space. Would you like to make a PR to do that?

Thanks,
Mikhail

freiksenet on 4 Jun 2019

👍1

Thanks for triaging @freiksenet. I'm on the fence about making what would be my first commit into this codebase as I don't want to introduce any regressions, but let's talk. 😄

Regarding the algorithm, instead of having to check for leading/trailing spaces do you see any issues with this alternative version?

For non-text nodes with children, find the resulting text of each child, trim the result to remove leading/trailing spaces, and then join together with a single space, excluding empty text elements. In other ~~words~~ code,
return node.children
  .map(child => getConcatenatedValue(child).trim())
  .filter(value => value)
  .join(` `)

mxxk on 5 Jun 2019

@mxxk That looks good, great idea!

freiksenet on 5 Jun 2019

@freiksenet thanks! How would you recommend to test the change? I modified node_modules/gatsby-transformer-remark/hast-processing.js directly in my test website but Gatsby is not picking up the changes during gatsby develop or gatsby build. I checked the docs on Gatsby Debugging but can't find any info on debugging a change in Gatsby plugins (such as this one).

mxxk on 5 Jun 2019

@mxxk You'd need to edit babel processed files in node_modules to make it work. However, it's better to use gatsby-dev-cli. Check out this guide to set up your local dev env.

freiksenet on 5 Jun 2019

Thanks @freiksenet. On debugging this further I've discovered that the fix is not in just in

https://github.com/gatsbyjs/gatsby/blob/30da56e65431e5aad3ae56ad30744fafda951a3c/packages/gatsby-transformer-remark/src/hast-processing.js#L11-L24

but also in

https://github.com/gatsbyjs/gatsby/blob/30da56e65431e5aad3ae56ad30744fafda951a3c/packages/gatsby-transformer-remark/src/extend-node-type.js#L432-L451

Furthermore, the simplistic algorithm I proposed turned out to be flawed, as with my changes the test case

https://github.com/gatsbyjs/gatsby/blob/30da56e65431e5aad3ae56ad30744fafda951a3c/packages/gatsby-transformer-remark/src/__tests__/extend-node.js#L387-L392

returned Where oh where is that pony ? rather than the expected Where oh where is that pony?.

Back to the drawing board, I guess! Perhaps what you said earlier with

One could also check if it's a paragraph break between nodes and then add a space.

is the way to go here?

mxxk on 7 Jun 2019

@mxxk Thank you for working on this! You could check for AST node types and only add spaces when concatenating paragraph AST nodes, preserving whitespace otherwise. Alternatively you could try checking if the node is top level (so child of a root document node).

freiksenet on 7 Jun 2019

@freiksenet at first I thought this would be a quick fix but unexpected challenges always find their way. 🙂

You could check for AST node types and only add spaces when concatenating paragraph AST nodes, preserving whitespace otherwise. Alternatively you could try checking if the node is top level (so child of a root document node).

Good suggestions! When generating plain format (as opposed to html format) excerpts,

https://github.com/gatsbyjs/gatsby/blob/30da56e65431e5aad3ae56ad30744fafda951a3c/packages/gatsby-transformer-remark/src/extend-node-type.js#L432-L451

the visit function is used to traverse the AST. What has me stumped is how, in this visitor pattern, to determine (if the current node is a paragraph) whether it is followed by another paragraph and _only then_ add a space. 🤔 (The visitor pattern abstracts us from how the graph is traversed, so it seems more difficult to look outside of the current node than the recursive pattern seen here.)

Your second alternative of simply checking if the parent is the root is much simpler. However, the following two cases came to mind:

I am a paragraph.

I am another paragraph.

I am a paragraph with a [link][1].

[1]: http://and-i-am-a-link-in-the-footer.com

Correct me if I'm wrong, but both will have two top-level nodes, yet the former needs to have a space between them and the latter does not. You can even make this crazier:

[1]: http://i-am-a-weirdly-placed-link.com

I am a paragraph with a [link][1] and another [link][2].

[2]: http://and-i-am-a-link-in-the-footer.com

A bit verbose, but hopefully illustrates the point. 😄 How would you handle these different cases in the proposed algorithm?

mxxk on 9 Jun 2019

What if we combine the approach 1 and 2? We could only inspect root children and use only paragraphs (and similar nodes that are mostly text like lists), but ignoring the nodes that are not like that.

I understand that it is getting very complicated :man_facepalming: and I'm sorry for suggesting couple dead end solutions. This seems to be a much more complex problem than anticipated :sweat_smile: Thank you for taking your time to work on it! :purple_heart: :muscle:

freiksenet on 11 Jun 2019

@freiksenet I've made some progress on this fix! 😄 There's still work in terms of formalizing it in a PR, but I'll elaborate my findings here for now.

For this test Markdown document,

Paragraph 

# Heading

Broken  
Line

| TableCell |
| --------- |
| TableCell |

the current excerpt logic spits out some a single (obviously German 😜) word:

ParagraphHeadingBrokenLineTableCellTableCell

This test Markdown document is crafted to be comprehensive for all cases affected by this issue. Why? Because according to https://github.com/syntax-tree/mdast, which lists out all Markdown node types, there is a minimal set of nodes which require special handling: paragraph, heading, tableCell, break. All other nodes (e.g. listItem and blockQuote) do not require special handling, as they have paragraph node children.

_Interestingly, though the break node (i.e. representing the Broken [line break] Line in the above example) is different from the other node types since it does not contain text, elegantly enough it is handled in an identical way with the other node types._

The diff below is just for demonstration; I will make a PR for it once formalized.

+    const SpaceNodeTypesSet = new Set([
+      `paragraph`,
+      `heading`,
+      `tableCell`,
+      `break`,
+    ])
+
     async function getExcerptPlain(
       markdownNode,
       pruneLength,
       truncate,
       excerptSeparator
     ) {
       const text = await getAST(markdownNode).then(ast => {
         let excerptNodes = []
         let isBeforeSeparator = true
         visit(
           ast,
           node => isBeforeSeparator,
           node => {
             if (excerptSeparator && node.value === excerptSeparator) {
               isBeforeSeparator = false
-              return
-            }
-            if (node.type === `text` || node.type === `inlineCode`) {
-              excerptNodes.push(node.value)
-            }
-            if (node.type === `image`) {
-              excerptNodes.push(node.alt)
+            } else if (node.type === `text` || node.type === `inlineCode`) {
+              excerptNodes.push(node.value.trim())
+            } else if (node.type === `image`) {
+              excerptNodes.push(node.alt.trim())
+            } else if (SpaceNodeTypesSet.has(node.type)) {
+              // Add a space when encountering one of these node types.
+              excerptNodes.push(` `)
             }
           }
         )
-        const excerptText = excerptNodes.join(``)
+        const excerptText = excerptNodes.join(``).trim()

         if (excerptSeparator) {
           return excerptText
         }
         if (!truncate) {
           return prune(excerptText, pruneLength, `…`)
         }
         return _.truncate(excerptText, {
           length: pruneLength,
           omission: `…`,
         })
       })
       return text
     }

With these fixes in place, the excerpt becomes:

Paragraph Heading Broken Line TableCell TableCell

This is a _plain-text_ excerpt. There is still more work to be done for HTML excerpts generated in

https://github.com/gatsbyjs/gatsby/blob/8df6f299db04fd875cf6d7d8193899ac53be6815/packages/gatsby-transformer-remark/src/hast-processing.js#L11-L24

Notably, plain-text excerpt generation ignores HTML nodes, which may need special handling for HTML excerpt generation.

mxxk on 21 Jun 2019

Published [email protected]. Thanks @mxxk for fixing this ;)

pieh on 6 Jul 2019

😄3 👍3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[1.0] Can you set the layout component when creating pages?

hobochild · 3Comments

Graphql: Fix better error message when duplicate query name

timbrandin · 3Comments

Error: Minified React error #130

jimfilippou · 3Comments

help making yaml files more powerful

andykais · 3Comments

GraphQL Error Unknown argument `slug`

3CordGuy · 3Comments