Cheerio: text() preserve line breaks

Created on 4 Apr 2016 · 9Comments · Source: cheeriojs/cheerio

I need to access innerHTML and preserve line breaks, so where is like  .
In jQuery a solution could be this one

(function($){
   $.fn.innerText = function(msg) {
         if (msg) {
            if (document.body.innerText) {
               for (var i in this) {
                  this[i].innerText = msg;
               }
            } else {
               for (var i in this) {
                  this[i].innerHTML.replace(/&amp;lt;br&amp;gt;/gi,"n").replace(/(&amp;lt;([^&amp;gt;]+)&amp;gt;)/gi, "");
               }
            }
            return this;
         } else {
            if (document.body.innerText) {
               return this[0].innerText;
            } else {
               return this[0].innerHTML.replace(/&amp;lt;br&amp;gt;/gi,"n").replace(/(&amp;lt;([^&amp;gt;]+)&amp;gt;)/gi, "");
            }
         }
   };
})(jQuery);

so that you can easily use innerText() function like

 $(this).find('a').each(function(index,item) {
      console.log( $(this).innerText() )
   })

But how to do this in cheerio?

Source

loretoparisi

👍14

Most helpful comment

I'm not sure if this workaround is good enough for your purposes, but I needed to quickly fetch some text from an HTML document with   converted to linebreaks. I ended up doing this:

$('.cc-newsbody').find('br').replaceWith('\n')
const text = $('.cc-newsbody').text() // can't chain

Turns   into linebreaks, converts everything else into plain text.

msikma on 9 Apr 2018

👍13 👀1 ❤1 🎉1

All 9 comments

Another solution for client-side code that I have found is this:

function htmlDecodeWithLineBreaks(html) {
  var breakToken = '_______break_______',
      lineBreakedHtml = html.replace(/<br\s?\/?>/gi, breakToken).replace(/<p\.*?>(.*?)<\/p>/gi, breakToken + '$1' + breakToken);
  return $('<div>').html(lineBreakedHtml).text().replace(new RegExp(breakToken, 'g'), '\n');
}

So that it works with cheerio too:

var $ = cheerio.load(html);
$(this).find('description').each(function(j,item) {
  var entities = htmlDecodeWithLineBreaks($,$(this).html() );
  console.log(entities)
})

just adding the $ to the signature: function htmlDecodeWithLineBreaks($,html)

loretoparisi on 4 Apr 2016

👎1 👍1

@loretoparisi Your solution didnt work for my html, this is what I came up with:

var he = require('he'); // he for decoding html entities
var myhtml = $('#description').html().replace(/<(?:.|\n)*?>/gm, '\n') // remove all html tags
var mytext = he.decode(myhtml)
console.log(mytext)

lucaswxp on 1 Jun 2016

👍1

thealexbaron on 16 Jun 2016

👍9

Would like to see this supported too. For now, i'm using html-to-text package.

aysark on 26 Oct 2016

👍1

Why not just add a peserve text breaks option in text() method?

Dublerq on 27 Mar 2017

html-to-text is definitely a solution for this right now. This would also be a candidate for a major release. Only taking care of  s and s should take us most of the way there.

fb55 on 2 Apr 2017

I'm not sure if this workaround is good enough for your purposes, but I needed to quickly fetch some text from an HTML document with   converted to linebreaks. I ended up doing this:

$('.cc-newsbody').find('br').replaceWith('\n')
const text = $('.cc-newsbody').text() // can't chain

Turns   into linebreaks, converts everything else into plain text.

msikma on 9 Apr 2018

👍13 👀1 ❤1 🎉1

I have three ideas to fix this, we could either.

Implement a .text(fn) where fn is a map function that takes html and pre-processes it before passing it on to the normal routine.
Implement a .textWith(fn) along the same lines. The only real advantage to this is .text() is less voody and doesn't have to type-test its argument.
Implement a boolean argument to .text() so its signature becomes .text( [textString], true|false ) the boolean argument defaulting now to false, but in the next major release defaulting to true. This argument would handle a smart-conversion to text. Perhaps not just newline, but also to the Unicode U+00A0

Neither of these requiring mucking with the entire HTML, nor breaking .text() as it currently sits.