Silverstripe-framework: [ORM] Heading tags don't get a new line in DBHTMLText::Plain()

Created on 7 Jun 2018 · 5Comments · Source: silverstripe/silverstripe-framework

Affected Version

I'm observing this in 4.1 but I believe I've seen this going way back in 3.x

Description

With this content:

<h2>This is a H2 tag</h2>
<p><strong>This is a p tag.</strong>&nbsp;Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Praesent commodo cursus magna, vel scelerisque nisl consectetur et.</p>
<p><strong>This is another p tag.</strong>&nbsp;Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Praesent commodo cursus magna, vel scelerisque nisl consectetur et.</p>

$myDBHTMLText->Plain() renders as:

new_page_ _recoverwell_injury_law_llp

Solutions

Overhaul

DBHTMLText::Plain() looks fragile in general because it makes assumptions about how the HTML will be formatted. Only paragraph tags get special treatment for line breaks but headings need them too. Interestingly, list items seem to render out okay - perhaps because they're wrapped in a ul or ol the line breaks work out.

It looks to me like this function might fall apart if the HTML content was formatted in a valid but unexpected way (e.g. not TinyMCE style), because meaning is assigned to line breaks in the html where there may not be any meaning. Overhauling this conversion to be more robust and predictable (perhaps there is a library to handle this?) would be great.

Band-aid

To treat this specific symptom the regex could be tweaked to include heading tags, from:

// Convert paragraph breaks to multi-lines
$text = preg_replace('/\<\/p\>/i', "\n\n", $text);

to something like (I'm not a regex expert):

// Convert paragraph breaks to multi-lines
$text = preg_replace('/\<\/(p|h[1-6])\>/i', "\n\n", $text);

I can provide a pull request for the band-aid solution, but if someone with more time available wanted to explore a more robust solution, that would be better! Please let me know your thoughts.

affectv4 changpatch efformedium impaclow typbug

Source

jonom

All 5 comments

I guess one option could be to add a ->MD() method to convert to markdown with this, then ->Plain() could start with that output and strip out any undesired elements, like heading indicators or links. A Markdown version could be considered plain text in its own right but it probably wouldn't be appropriate for something like a MetaDescription or blog summary.

jonom on 7 Jun 2018

This was a problem I bumped into recently with a client as well @jonom. He had a series of articles with the summary being automatically generated from the article content, which resulted in the same issue. I ended up putting a workaround in my own code, as the built-in strip_tags() wasn't cutting it.

colintucker on 7 Jun 2018

I guess the proper fix for this would be for Plain to treat any block level tags (e.g. <p> <div> <blockquote> <h1>) as a paragraph that require a break line and ignore all inline level elements.

This page has a list of all block-level tags: https://www.w3schools.com/html/html_blocks.asp

maxime-rainville on 13 Aug 2020

👍1

Yeah I guess you could throw in a line break or two wherever you hit a block level tag, then run through the result again so that anywhere you have more than two consecutive line breaks they are condensed to two so you don't get a bunch more breaks than you need.

jonom on 13 Aug 2020

I'd suggest a minor fix - break on a whitelist of block level tags - rather than an overhaul. If someone wants a very robust solution for this kind of functionality then they're better placed to pull in some dedicated package for the purpose. Making a dedicated package a core requirement would bloat core.

Eg you can probably auto-generate a regex given a whitelist of block level tags.

If you wanted we could shift to make it use SimpleXMLElement but I wouldn't say that it's critical we do so.

sminnee on 18 Aug 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Enum without default is selecting a default in DropdownField that has an empty default

zanderwar · 6Comments

Global middleware + custom controller + Controller::curr() triggers `No current controller available`

jakxnz · 5Comments

Epic: Scaling decoupled CMS usage (GraphQL performance)

chillu · 3Comments

Suspect comparison in handleRequest

ntd · 4Comments

Simple theme assets not exposed / composer.json is different when re-installing themes

micmania1 · 6Comments