Sendgrid-php: Auto Generate Plain Text Content from HTML

Created on 4 Oct 2017 · 21Comments · Source: sendgrid/sendgrid-php

We require both HTML and Plain Text content for improved deliverability. In some cases, only HTML may be available. We need a function that automatically converts a given HTML string to the Plain Text equivalent.

medium work in progress sendgrid enhancement

Source

thinkingserious

Most helpful comment

@cmckni3,

Thanks for the follow up!

Perhaps we can check if that dependency exists, if so, we use it, otherwise we fall back to the regex.

thinkingserious on 11 Oct 2017

👍2

All 21 comments

I'll take a stab at it. Shouldn't be too bad.

cmckni3 on 4 Oct 2017

🎉1

Thanks @cmckni3!

thinkingserious on 4 Oct 2017

I wonder what the scope on this is. Should it look for h tags then p tags that follow. Should it just get whatever plain text it can on the page. Maybe somebody is doing some weird HTML?

Nevertheless

echo strip_tags(preg_replace('#<script(.*?)>(.*?)</script>#is', '', $HTMLCODE));

I think something along those lines would suffice as an idea. However something really awesome that gets only the perfect parts that need to be displayed in plain text would need to be achieved. Otherwise you may end up getting the nav bar before what you need.

jopanel on 5 Oct 2017

Hi,

I made a PR here. I am hoping I am making the PR in the right place. This seems like something for the new refactored version. However there is no parsing of data incase its HTML being inputted.

You can run a test like:

jopanel on 11 Oct 2017

I don't agree with using a regex. I think it should use libxml2 and grab all of the text nodes.

Maybe have some sort of way to filter out certain nodes and their children as well.

cmckni3 on 11 Oct 2017

Sorry, I didn't get a chance to look at this over the weekend as intended.

cmckni3 on 11 Oct 2017

My bad on my last PR. I made a new one.

I think regex has its place in PHP life. While some people may not agree with using regex to parse html.. I think it helps combat parsing html on the fly. I looked into this issue cause it was bugging me. There are repos out there that are made for html to text. However a lot of them aren't exactly open source. Also the sendgrid code is very clean and adding 50 functions just for parsing HTML isn't exactly the best option.

libxml2 would require adding it to your server. I am not sure that is the best way either as it would be an additional requirement just to use sendgrid mail api.

jopanel on 11 Oct 2017

@cmckni3, @jopanel,

Generally, we want to limit dependencies and stick to the standard PHP library.

I think a simple regex is a good start, then we can build upon that functionality and possibly add an optional, more industrial strength tool.

thinkingserious on 11 Oct 2017

👍1

I think libxml2 is included in newer versions of PHP unless specified to not enable it. It's a hard dependency in some web frameworks, e.g. symfony.

cmckni3 on 11 Oct 2017

👍1

@cmckni3,

Thanks for the follow up!

Perhaps we can check if that dependency exists, if so, we use it, otherwise we fall back to the regex.

thinkingserious on 11 Oct 2017

👍2

I checked my local XAMPP, it does include XML which labels version 2.8.0 of libxml2. My work environment server using PHP 5.6 also has it with version 2.7. I have not checked PHP7 or PHP5.4. However it appears it may be a standard with a PHP install. I will leave my pull request open for comments and I hope to see your version of HTML plain/text parser.

jopanel on 11 Oct 2017

👍1

Seems like it's pretty standard from the docs

cmckni3 on 11 Oct 2017

👍1

Nice find @cmckni3 :)

Will you be taking up the PR?

thinkingserious on 11 Oct 2017

You want me to look at the PR @thinkingserious? I don't understand your question.

cmckni3 on 11 Oct 2017

My apologies for my lack of clarity @cmckni3.

I am asking if you will create a PR to solve this issue.

thinkingserious on 11 Oct 2017

Sure, when I have a chance. Moved to a new apartment this week. Not sure when I'll get to it. Probably this weekend.

cmckni3 on 11 Oct 2017

👍1

Sounds good, thank you!

thinkingserious on 11 Oct 2017

@cmckni3 @jopanel @thinkingserious I am checking in to see if there are any questions or things we can do to help out here?

mbernier on 20 Oct 2017

@mbernier

I am just hanging around seeing if @cmckni3 is going to come through with his libxml2 version. For now I have made the regex version on my pull request. I am hoping he comes through this weekend. I would love to check out his method.

jopanel on 21 Oct 2017

I created a simple xpath version of it to get all text nodes and return the text content. It's at least a start down the road of handling links and other special cases.

cmckni3 on 1 Nov 2017

Since there has been no activity on this issue since March 1, 2020, we are closing this issue. Please feel free to reopen or create a new issue if you still require assistance. Thank you!

thinkingserious on 11 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings