Amphtml: Validation should warn, not fail, for malformed hrefs

Created on 29 Nov 2016  路  17Comments  路  Source: ampproject/amphtml

Currently an invalid or malformed href value in a link will cause a validation error for AMP pages rather than a warning ("Prohibited or invalid use of HTML tag", specifically.) This seems overly aggressive, since the consequence of such a link is simply that the link won't be clickable. Currently only javascript: hrefs are explicitly disallowed in AMP I believe, so unclear why malformed hrefs are rejected as well.

blob

Could this be relaxed to produce a warning rather than a complete rejection of the document?

cc @cramforce

When Possible Feature Request caching

Most helpful comment

@rafaelhbarros , the 'empty' href case is now valid. It's the easier case as it doesn't require cache changes. Leaving the issue open for the 'invalid' href case.

All 17 comments

@src-code Could you paste actual URL that was rejected. It is very hard to see the details in the blurry screenshot.

It looks like http://www.heritagehockey.com,/. I agree with the sentiment that a broken link shouldn't cause the entire page to be invalidated. Unfortunately it's not clear that it's safe to assume that javascript: and similar are the only forms that can inject javascript into link hrefs.

@Gregable what Soy (Closure Templates) does in this case is instead of declaring the doc invalid is to sanitize the link (Basically rewriting to about:blank)

I'm wondering if we can just accept input starting with http: and https:, though. Seems safe.

@cramforce Would that mean that we would also allow javascript: and remove it only in the cache? If not, why would we differ?

@Gregable I think it makes sense to treat things that are not desired differently from things where we fail out of an abundance of caution.

Extending on invalid hrefs, I'd say that empty hrefs (<a href="">link</a>) should be valid as well (or be a warning, not a failure).

@rafaelhbarros We have tried to avoid using validator warnings for 'linter' type issues, instead using them sparingly only for issues that will become errors in the future (deprecation). The rationale is to keep the signal-to-noise ratio high.

If we implement this, it would be neither an error nor a warning, at least in validation.

@rafaelhbarros , the 'empty' href case is now valid. It's the easier case as it doesn't require cache changes. Leaving the issue open for the 'invalid' href case.

@Gregable who is working on the 'invalid' href case? Thanks

This issue seems to be in Pending Triage for awhile. @Gregable Please triage this to an appropriate milestone.

This issue seems to be in Pending Triage for awhile. @Gregable Please triage this to an appropriate milestone.

any update on this one, making page amp-invalid for just malformed url should not happen

Since this was originally filed, we have relaxed the rules quite a bit for malformed URLs. The requirement has not been removed, but very few documents now fail due to this. I did a quick test and it's in the range of 1/5,000 documents fail because of this issue.

The ones I did find were pretty horribly mangled, with URLs like
http://<iframe width="560" height="315" src="https://www.youtube.com/embed/abc123" frameborder="0" allowfullscreen></iframe>

It's not clear to me that replacing these links with an empty href would actually fix the document in all cases.

Yes replacing with empty href sounds valid and adding these pages under warnings is better then making pages invalid
And below is one of the malformed url that is making the page invalid in my case

Malformed URL 'http://Facethebook.co.in (Under Process)' for attribute 'href' in tag 'a'.

I suppose this malformed url should not create validation failure

Edit: I read it now the warning cannot be added for such cases

Replacing with empty href only solves the security risk of turning the URL into a vector for arbitrary javascript.

This is still a malformed URL. Replacing with an empty href does not solve the problem that the document is malformed. The author's intent was clearly <a href="http://facethebook.co.in/">, not <a href>.

There are a number of validator rules that catch issues like these. We fail on some unparseable CSS, malformed tags, etc. For common cases where the correct fix is clear and easy to make, I fully agree. For example, <div><span></div> can be unambiguously corrected to <div><span><span></div>, which the cache does. Uncorrectable hrefs are both rare and ambiguous. It's unclear why this particular issue would be something we'd correct, while simultaneously not allowing + removing onclick attributes.

I am closing this issue. The changes to accept a wider range of URLs seem to have resolved most of the concerns folks have. Feel free to reopen if there are remaining concerns.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lisotton picture lisotton  路  52Comments

retornam picture retornam  路  52Comments

ericlindley-g picture ericlindley-g  路  60Comments

zhouyx picture zhouyx  路  103Comments

zhouyx picture zhouyx  路  60Comments