We require both HTML and Plain Text content for improved deliverability. In some cases, only HTML may be available. We need a function that automatically converts a given HTML string to the Plain Text equivalent.
I'll take a stab at it. Shouldn't be too bad.
Thanks @cmckni3!
I wonder what the scope on this is. Should it look for h tags then p tags that follow. Should it just get whatever plain text it can on the page. Maybe somebody is doing some weird HTML?
Nevertheless
echo strip_tags(preg_replace('#<script(.*?)>(.*?)</script>#is', '', $HTMLCODE));
I think something along those lines would suffice as an idea. However something really awesome that gets only the perfect parts that need to be displayed in plain text would need to be achieved. Otherwise you may end up getting the nav bar before what you need.
Hi,
I made a PR here. I am hoping I am making the PR in the right place. This seems like something for the new refactored version. However there is no parsing of data incase its HTML being inputted.
You can run a test like:
$html = file_get_contents('http://sandbox.onlinephpfunctions.com/');
$html = preg_replace(
array(
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
"\n\$0", "\n\$0",
),
$html );
echo strip_tags($html);
I don't agree with using a regex. I think it should use libxml2 and grab all of the text nodes.
Maybe have some sort of way to filter out certain nodes and their children as well.
Sorry, I didn't get a chance to look at this over the weekend as intended.
My bad on my last PR. I made a new one.
I think regex has its place in PHP life. While some people may not agree with using regex to parse html.. I think it helps combat parsing html on the fly. I looked into this issue cause it was bugging me. There are repos out there that are made for html to text. However a lot of them aren't exactly open source. Also the sendgrid code is very clean and adding 50 functions just for parsing HTML isn't exactly the best option.
libxml2 would require adding it to your server. I am not sure that is the best way either as it would be an additional requirement just to use sendgrid mail api.
@cmckni3, @jopanel,
Generally, we want to limit dependencies and stick to the standard PHP library.
I think a simple regex is a good start, then we can build upon that functionality and possibly add an optional, more industrial strength tool.
I think libxml2 is included in newer versions of PHP unless specified to not enable it. It's a hard dependency in some web frameworks, e.g. symfony.
@cmckni3,
Thanks for the follow up!
Perhaps we can check if that dependency exists, if so, we use it, otherwise we fall back to the regex.
I checked my local XAMPP, it does include XML which labels version 2.8.0 of libxml2. My work environment server using PHP 5.6 also has it with version 2.7. I have not checked PHP7 or PHP5.4. However it appears it may be a standard with a PHP install. I will leave my pull request open for comments and I hope to see your version of HTML plain/text parser.
Seems like it's pretty standard from the docs
Nice find @cmckni3 :)
Will you be taking up the PR?
You want me to look at the PR @thinkingserious? I don't understand your question.
My apologies for my lack of clarity @cmckni3.
I am asking if you will create a PR to solve this issue.
Sure, when I have a chance. Moved to a new apartment this week. Not sure when I'll get to it. Probably this weekend.
Sounds good, thank you!
@cmckni3 @jopanel @thinkingserious I am checking in to see if there are any questions or things we can do to help out here?
@mbernier
I am just hanging around seeing if @cmckni3 is going to come through with his libxml2 version. For now I have made the regex version on my pull request. I am hoping he comes through this weekend. I would love to check out his method.
I created a simple xpath version of it to get all text nodes and return the text content. It's at least a start down the road of handling links and other special cases.
Since there has been no activity on this issue since March 1, 2020, we are closing this issue. Please feel free to reopen or create a new issue if you still require assistance. Thank you!
Most helpful comment
@cmckni3,
Thanks for the follow up!
Perhaps we can check if that dependency exists, if so, we use it, otherwise we fall back to the regex.