Cheerio: text() preserve line breaks

Created on 4 Apr 2016  路  9Comments  路  Source: cheeriojs/cheerio

I need to access innerHTML and preserve line breaks, so where is like <br>.
In jQuery a solution could be this one

(function($){
   $.fn.innerText = function(msg) {
         if (msg) {
            if (document.body.innerText) {
               for (var i in this) {
                  this[i].innerText = msg;
               }
            } else {
               for (var i in this) {
                  this[i].innerHTML.replace(/&amp;lt;br&amp;gt;/gi,"n").replace(/(&amp;lt;([^&amp;gt;]+)&amp;gt;)/gi, "");
               }
            }
            return this;
         } else {
            if (document.body.innerText) {
               return this[0].innerText;
            } else {
               return this[0].innerHTML.replace(/&amp;lt;br&amp;gt;/gi,"n").replace(/(&amp;lt;([^&amp;gt;]+)&amp;gt;)/gi, "");
            }
         }
   };
})(jQuery);

so that you can easily use innerText() function like

 $(this).find('a').each(function(index,item) {
      console.log( $(this).innerText() )
   })

But how to do this in cheerio?

Most helpful comment

I'm not sure if this workaround is good enough for your purposes, but I needed to quickly fetch some text from an HTML document with <br> converted to linebreaks. I ended up doing this:

$('.cc-newsbody').find('br').replaceWith('\n')
const text = $('.cc-newsbody').text() // can't chain

Turns <br> into linebreaks, converts everything else into plain text.

All 9 comments

Another solution for client-side code that I have found is this:

function htmlDecodeWithLineBreaks(html) {
  var breakToken = '_______break_______',
      lineBreakedHtml = html.replace(/<br\s?\/?>/gi, breakToken).replace(/<p\.*?>(.*?)<\/p>/gi, breakToken + '$1' + breakToken);
  return $('<div>').html(lineBreakedHtml).text().replace(new RegExp(breakToken, 'g'), '\n');
}

So that it works with cheerio too:

var $ = cheerio.load(html);
$(this).find('description').each(function(j,item) {
  var entities = htmlDecodeWithLineBreaks($,$(this).html() );
  console.log(entities)
})

just adding the $ to the signature: function htmlDecodeWithLineBreaks($,html)

+1

@loretoparisi Your solution didnt work for my html, this is what I came up with:

var he = require('he'); // he for decoding html entities
var myhtml = $('#description').html().replace(/<(?:.|\n)*?>/gm, '\n') // remove all html tags
var mytext = he.decode(myhtml)
console.log(mytext)

+1

Would like to see this supported too. For now, i'm using html-to-text package.

Why not just add a peserve text breaks option in text() method?

html-to-text is definitely a solution for this right now. This would also be a candidate for a major release. Only taking care of <br>s and <p>s should take us most of the way there.

I'm not sure if this workaround is good enough for your purposes, but I needed to quickly fetch some text from an HTML document with <br> converted to linebreaks. I ended up doing this:

$('.cc-newsbody').find('br').replaceWith('\n')
const text = $('.cc-newsbody').text() // can't chain

Turns <br> into linebreaks, converts everything else into plain text.

I have three ideas to fix this, we could either.

  1. Implement a .text(fn) where fn is a map function that takes html and pre-processes it before passing it on to the normal routine.
  2. Implement a .textWith(fn) along the same lines. The only real advantage to this is .text() is less voody and doesn't have to type-test its argument.
  3. Implement a boolean argument to .text() so its signature becomes .text( [textString], true|false ) the boolean argument defaulting now to false, but in the next major release defaulting to true. This argument would handle a smart-conversion to text. Perhaps not just newline, but also   to the Unicode U+00A0

Neither of these requiring mucking with the entire HTML, nor breaking .text() as it currently sits.

This unfortunately won't be implemented, as it diverges from browsers.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

becush picture becush  路  3Comments

M3kH picture M3kH  路  4Comments

gajus picture gajus  路  4Comments

misner picture misner  路  3Comments

Tetheta picture Tetheta  路  3Comments