I need to access innerHTML and preserve line breaks, so where is like <br>.
In jQuery a solution could be this one
(function($){
$.fn.innerText = function(msg) {
if (msg) {
if (document.body.innerText) {
for (var i in this) {
this[i].innerText = msg;
}
} else {
for (var i in this) {
this[i].innerHTML.replace(/&lt;br&gt;/gi,"n").replace(/(&lt;([^&gt;]+)&gt;)/gi, "");
}
}
return this;
} else {
if (document.body.innerText) {
return this[0].innerText;
} else {
return this[0].innerHTML.replace(/&lt;br&gt;/gi,"n").replace(/(&lt;([^&gt;]+)&gt;)/gi, "");
}
}
};
})(jQuery);
so that you can easily use innerText() function like
$(this).find('a').each(function(index,item) {
console.log( $(this).innerText() )
})
But how to do this in cheerio?
Another solution for client-side code that I have found is this:
function htmlDecodeWithLineBreaks(html) {
var breakToken = '_______break_______',
lineBreakedHtml = html.replace(/<br\s?\/?>/gi, breakToken).replace(/<p\.*?>(.*?)<\/p>/gi, breakToken + '$1' + breakToken);
return $('<div>').html(lineBreakedHtml).text().replace(new RegExp(breakToken, 'g'), '\n');
}
So that it works with cheerio too:
var $ = cheerio.load(html);
$(this).find('description').each(function(j,item) {
var entities = htmlDecodeWithLineBreaks($,$(this).html() );
console.log(entities)
})
just adding the $ to the signature: function htmlDecodeWithLineBreaks($,html)
+1
@loretoparisi Your solution didnt work for my html, this is what I came up with:
var he = require('he'); // he for decoding html entities
var myhtml = $('#description').html().replace(/<(?:.|\n)*?>/gm, '\n') // remove all html tags
var mytext = he.decode(myhtml)
console.log(mytext)
+1
Would like to see this supported too. For now, i'm using html-to-text package.
Why not just add a peserve text breaks option in text() method?
html-to-text is definitely a solution for this right now. This would also be a candidate for a major release. Only taking care of <br>s and <p>s should take us most of the way there.
I'm not sure if this workaround is good enough for your purposes, but I needed to quickly fetch some text from an HTML document with <br> converted to linebreaks. I ended up doing this:
$('.cc-newsbody').find('br').replaceWith('\n')
const text = $('.cc-newsbody').text() // can't chain
Turns <br> into linebreaks, converts everything else into plain text.
I have three ideas to fix this, we could either.
.text(fn) where fn is a map function that takes html and pre-processes it before passing it on to the normal routine..textWith(fn) along the same lines. The only real advantage to this is .text() is less voody and doesn't have to type-test its argument..text() so its signature becomes .text( [textString], true|false ) the boolean argument defaulting now to false, but in the next major release defaulting to true. This argument would handle a smart-conversion to text. Perhaps not just newline, but also to the Unicode U+00A0Neither of these requiring mucking with the entire HTML, nor breaking .text() as it currently sits.
This unfortunately won't be implemented, as it diverges from browsers.
Most helpful comment
I'm not sure if this workaround is good enough for your purposes, but I needed to quickly fetch some text from an HTML document with
<br>converted to linebreaks. I ended up doing this:Turns
<br>into linebreaks, converts everything else into plain text.