I'm observing this in 4.1 but I believe I've seen this going way back in 3.x
<h2>This is a H2 tag</h2>
<p><strong>This is a p tag.</strong> Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Praesent commodo cursus magna, vel scelerisque nisl consectetur et.</p>
<p><strong>This is another p tag.</strong> Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Praesent commodo cursus magna, vel scelerisque nisl consectetur et.</p>

DBHTMLText::Plain() looks fragile in general because it makes assumptions about how the HTML will be formatted. Only paragraph tags get special treatment for line breaks but headings need them too. Interestingly, list items seem to render out okay - perhaps because they're wrapped in a ul or ol the line breaks work out.
It looks to me like this function might fall apart if the HTML content was formatted in a valid but unexpected way (e.g. not TinyMCE style), because meaning is assigned to line breaks in the html where there may not be any meaning. Overhauling this conversion to be more robust and predictable (perhaps there is a library to handle this?) would be great.
To treat this specific symptom the regex could be tweaked to include heading tags, from:
// Convert paragraph breaks to multi-lines
$text = preg_replace('/\<\/p\>/i', "\n\n", $text);
to something like (I'm not a regex expert):
// Convert paragraph breaks to multi-lines
$text = preg_replace('/\<\/(p|h[1-6])\>/i', "\n\n", $text);
I can provide a pull request for the band-aid solution, but if someone with more time available wanted to explore a more robust solution, that would be better! Please let me know your thoughts.
I guess one option could be to add a ->MD() method to convert to markdown with this, then ->Plain() could start with that output and strip out any undesired elements, like heading indicators or links. A Markdown version could be considered plain text in its own right but it probably wouldn't be appropriate for something like a MetaDescription or blog summary.
+1
This was a problem I bumped into recently with a client as well @jonom. He had a series of articles with the summary being automatically generated from the article content, which resulted in the same issue. I ended up putting a workaround in my own code, as the built-in strip_tags() wasn't cutting it.
I guess the proper fix for this would be for Plain to treat any block level tags (e.g. <p> <div> <blockquote> <h1>) as a paragraph that require a break line and ignore all inline level elements.
This page has a list of all block-level tags: https://www.w3schools.com/html/html_blocks.asp
Yeah I guess you could throw in a line break or two wherever you hit a block level tag, then run through the result again so that anywhere you have more than two consecutive line breaks they are condensed to two so you don't get a bunch more breaks than you need.
I'd suggest a minor fix - break on a whitelist of block level tags - rather than an overhaul. If someone wants a very robust solution for this kind of functionality then they're better placed to pull in some dedicated package for the purpose. Making a dedicated package a core requirement would bloat core.
Eg you can probably auto-generate a regex given a whitelist of block level tags.
If you wanted we could shift to make it use SimpleXMLElement but I wouldn't say that it's critical we do so.