Amphtml: Provide a programmatic validation API

Created on 12 Feb 2016 · 19Comments · Source: ampproject/amphtml

Could https://cdn.ampproject.org/v0/validator.js be used to provide programmatic validation?

If all the validation logic is there, it would make things (#937, #1967), a lot easier if we could call an endpoint like

https://cdn.ampproject.org/validate?url=...&callback=...

Automating the validation of lists of URLs would become a lot easier from anywhere that can make an HTTP request, such as Google Apps scripts that parse a spreadsheet of AMP URLs.

CC @pietergreyling, @jmadler, @adewale

Feature Request caching

Source

dandv

Most helpful comment

+1 for wanting an API.

I'd like to see it accept both the URL of an AMP page, or the AMP html in a post body.

GET https://cdn.ampproject.org/v/?url= .... &callback=jsonpcallback
POST https://cdn.ampproject.org/v/  /*data in post body */

Result - something like

{
     "version": 0,   /* The amp version detected/validated */
     "source": "https://some.amp.site/page.html",   /* or "POST" if post data */
     "canonical": "http:// ...",  /* Canonical url from amp page */
     "valid":  false, /* or true */
     "extensions": [ /* array of amp extension detected */ ],   /* Nice to have */
     "errors": [
        {
              "position":  "10:2",
              "error": "The text (CDATA) inside tag 'author stylesheet' matches 'CSS !important', which is disallowed.",
              "help" : "https://github.com/ampproject/amphtml/blob/master/spec/amp-html-format.md#stylesheets"
        },
        {
              "position": "9:0",
              "warning": "AMP deprecated <style> boilerplate text detected",
              "help": "http:// ... "
        }
     ]
}

Nice to have: jsonp support, optional html output.

jpettitt on 12 Feb 2016

👍6

All 19 comments

@dandv please sketch out the API you'd like to see. What would be the output of this endpoint?

adewale on 12 Feb 2016

+1 for wanting an API.

I'd like to see it accept both the URL of an AMP page, or the AMP html in a post body.

GET https://cdn.ampproject.org/v/?url= .... &callback=jsonpcallback
POST https://cdn.ampproject.org/v/  /*data in post body */

Result - something like

{
     "version": 0,   /* The amp version detected/validated */
     "source": "https://some.amp.site/page.html",   /* or "POST" if post data */
     "canonical": "http:// ...",  /* Canonical url from amp page */
     "valid":  false, /* or true */
     "extensions": [ /* array of amp extension detected */ ],   /* Nice to have */
     "errors": [
        {
              "position":  "10:2",
              "error": "The text (CDATA) inside tag 'author stylesheet' matches 'CSS !important', which is disallowed.",
              "help" : "https://github.com/ampproject/amphtml/blob/master/spec/amp-html-format.md#stylesheets"
        },
        {
              "position": "9:0",
              "warning": "AMP deprecated <style> boilerplate text detected",
              "help": "http:// ... "
        }
     ]
}

Nice to have: jsonp support, optional html output.

jpettitt on 12 Feb 2016

👍6

I think this might be a good idea, but nothing like that exists at the moment.

For now, it is possible to run the javascript validator in other contexts. I left a comment here:
https://github.com/ampproject/amphtml/issues/999#issuecomment-171787638 with an example.

You can also build it from source. There are instructions here:
https://github.com/ampproject/amphtml/tree/master/validator

Gregable on 12 Feb 2016

Following up, @dandv @jpettitt . What is the use case in mind here? It seems like using either the nodejs library (https://www.npmjs.com/package/amphtml-validator) or using validator.js directly are going to be faster in all cases than shipping an HTML document over the wire to cdn.ampproject.org and waiting for a response.

Gregable on 11 Aug 2016

@Gregable: at the time, I was working on a piece of Google Apps script attached to a trix with partner AMP URLs to test. In that environment, you can't import Node modules, so fetching the validation result over HTTP is the only realistic option.

The HTML document doesn't have to be shipped over; the URLs are publicly accessible, so the call would submit the URL for validation and receive a JSON result similar to what @jpettitt described.

dandv on 10 Sep 2016

CloudFlare has added an endpoint to our beta cache with a json output. We are happy to tweak output based of feedback.

curl https://cdn.edgeamp.org/q/' -X POST —data-binary @invalid_amp.html -H 'Content-Type: text/html; charset=UTF-8'
curl https://cdn.edgeamp.org/q/cfedgeorigin.com/amp/invalid_amp.html

curl https://cdn.edgeamp.org/q/cfedgeorigin.com/amp/invalid_amp.html 2>/dev/null | python -mjson.tool
{
    "errors": [
        {
            "code": "MANDATORY_TAG_MISSING",
            "col": 0,
            "error": "The mandatory tag 'noscript enclosure for boilerplate' is missing or incorrect.",
            "help": "https://github.com/ampproject/amphtml/blob/master/spec/amp-boilerplate.md",
            "line": 12
        },
        {
            "code": "MANDATORY_TAG_MISSING",
            "col": 0,
            "error": "The mandatory tag 'head > style[amp-boilerplate]' is missing or incorrect.",
            "help": "https://github.com/ampproject/amphtml/blob/master/spec/amp-boilerplate.md",
            "line": 12
        },
        {
            "code": "MANDATORY_TAG_MISSING",
            "col": 0,
            "error": "The mandatory tag 'noscript > style[amp-boilerplate]' is missing or incorrect.",
            "help": "https://github.com/ampproject/amphtml/blob/master/spec/amp-boilerplate.md",
            "line": 12
        }
    ],
    "source": "http://cfedgeorigin.com/amp/invalid_amp.html",
    "valid": false,
    "version": "1471559432224"
}

dknecht on 10 Sep 2016

Looking into this. A few questions for those interested in this thread:

One of the main issues is avoiding abuse (ie: DOS) via Google's fetching. We could solve this by:

Requiring that the API request provide the contents, rather than the URL, and we perform validation on the provided string. This wouldn't work for @dandv's case. Would this still be useful?
Only serving the cached validation result, which would be similar to just fetching the document via cdn.ampproject.org. This wouldn't support testing changes to a site, only seeing a snapshot in time. Would this still be useful?

Gregable on 22 Sep 2016

1) Requiring that the API request provide the contents, rather than the URL, and we perform validation on the provided string. This wouldn't work for @dandv's case. Would this still be useful?

This could be useful especially in pre-production scenarios where the candidate AMP page is not yet hosted.

2) Only serving the cached validation result, which would be similar to just fetching the document via cdn.ampproject.org. This wouldn't support testing changes to a site, only seeing a snapshot in time. Would this still be useful?

In a development scenario, which is iterative, this might not be so useful due to the time lag (unless there is a mechanism to trigger a refresh, which might bring us back full circle).
In a production troubleshooting use case, this could be pretty useful if I understand correctly.
- This would imply caching validation results in FAIL cases as well as PASS regardless of whether the document was actually rejected (not cached) as in the FAIL case - correct?
- What would we store? The full validator output as at the time of validation?
- Would it make sense to also tag this with the validator spec file revision?

pietergreyling on 22 Sep 2016

This would imply caching validation results in FAIL cases as well as PASS regardless of whether the document was actually rejected (not cached) as in the FAIL case - correct?

What would we store? The full validator output as at the time of validation?

This is true, I don't think the cache currently stores FAIL cases, I think it retries on every request.

Would it make sense to also tag this with the validator spec file revision?

It wouldn't hurt, but I don't think get the feeling that most developers are using this revision number for much. It's really used only for development of the validator itself, we are quite careful about backwards incompatible changes these days.

Gregable on 22 Sep 2016

One of the main issues is avoiding abuse (ie: DOS) via Google's fetching.

Could DOS attacks via the "validate AMP URL" API be avoided via other anti-DOS mechanisms?

I.e. if we employ rate limiting, we can reasonably limit requests to a pretty low rate, given that:

one AMP document is unlikely to realistically change more often than once per ~5 seconds, so validating more often than that wouldn't be useful
multiple documents across a site tend to be generated from a small number of templates, so repeated sweeping validations across a domain can also be limited. API users would choose a representative set of AMP pages that exercise the templates.

If we implement domain whitelisting, we could have a verification mechanism similar to verifying site ownership in Google Webmaster.

dandv on 23 Sep 2016

👍1

I got here as I was looking for a process that could trigger a validation check when a git push occurs. For now I'll probably implement the validator locally, but having an official URL, even if it just returned pass/fail, would be great!

brianlayman on 4 Nov 2016

@brianlayman
This could be of interest:
https://github.com/ampproject/ampbench
Walkthrough article: Debug AMP pages with AMPBench, an open source app from the AMP Project.
Look for this section in the walkthrough article: "Experimental AMPBench JSON APIs"

The code is here:
https://github.com/ampproject/ampbench/blob/master/ampbench_routes.js#L393

AMPBench also supports this kind of thing:

$ curl https://ampbench.appspot.com/raw?url=https://ampbyexample.com
PASS

pietergreyling on 4 Nov 2016

🎉5

@pietergreyling That looks to be perfect! Thank you for the supporting detail too. Time to start coding...

brianlayman on 4 Nov 2016

👍1

@pietergreyling — this looks great! Are there plans to move the Experimental AMPBench JSON API to a stable status? We're seeing folks ask about this in the context of integrating into their platforms for faster in-context validation.

ericlindley-g on 2 Dec 2016

@ericlindley-g At this point the recommended way of integrating AMPBench into a custom validation workflow using the JSON API is to run an instance of AMPBench on a dedicated hosting platform. The latter can be public or on an in-house server behind a firewall restricting access to internal clients.

Instructions to do this are here:
https://github.com/ampproject/ampbench#getting-the-code-and-running-it
https://github.com/ampproject/ampbench#deploying-ampbench-to-the-cloud

This also has the advantages of allowing the source code of the API to be tuned according to custom needs and the creation of Pull Requests to the AMPBench repository for any improvements based on such changes.

pietergreyling on 6 Dec 2016

@pietergreyling Sounds good—thanks!

ericlindley-g on 10 Dec 2016

I think we're pretty happy with the options available now, so I'm going to go ahead and close this out. Feel free to reopen if I'm wrong.

Gregable on 8 Feb 2017

I'd still really appreciate this. There is currently no way to know if a page is valid using standard JS from the console log.

You have to use the NPM package / CLI or be a human being and use the developer tool and/or read the console to find out if a page is valid. I think a CURL is okay but markup or a JS object on window would be much more useful.

Could there be a class selector on the HTML (html.amp-valid vs html.amp-invalid) and/or an object accessible via window which holds the validation state when developers append #development=1 (ie window.AMP.validator.status)?

This would be really useful for E2E testing via things like Robot Framework and Ghost Inspector. There's no documented JS object we can reliably count on to store the validation status and JS console doesn't have access to previous logs in order to check for AMP Validation Successful

johnnyshankman on 16 May 2017

@Gregable are there any new options for what I've described since this issue was opened? Looking for a reliable way to check validation status using just vanilla JS after appending #development=1

There's a window.amp.validator object but it doesn't actually hold the current state it seems. validateUrlAndLog doesn't return anything because it's async and doesn't return the promise. validateString isn't useful because the document has already been modified by AMP and the AMP validator won't take in HTML that's clearly already had AMP ran over it.

johnnyshankman on 16 May 2017

Was this page helpful?

0 / 5 - 0 ratings