Amphtml: Question: Blocking Bots and Scrappers on reading amp pages from third party cache

Created on 5 Mar 2017 · 13Comments · Source: ampproject/amphtml

(Sorry for not sticking to issue guideline as this was generic concern)

Context:
I am impressed the way AMP contents is delivered, However, as a niche content creator, I am little concerned over the scrapping door the amp caches open up. I don't want scrappers and bots (except a few ) to crawl the site or store and _re-use content_.

Question:

Is there a documentation/reference/case where I can find how _Google AMP cache_ or third party cache validates humans vs bots (like reCaptcha).
- based on this I can choose to allow only the legit third party cache to crawl my origin server for AMP cache
- help propagate standards to third-party cache

Feature Request:

A way I can configure rate limiting or any other access features on reading third party AMP cache? Sort of amp-manifest embedded in amp-pages which indicates rate limits, humans only access control logic, bots control (like robots.txt) etc...
- allow publishers to be in control of cache access.

As a standard practice third party cache providers should also document how they delineate bots and scrappers to dose-off publishers' concern of the their content.

PS:

I have gone through amp-access docs and found that currently only browser visibility based option is available and server side access control option is still contemplated.

DiscussioQuestion caching

Source

jalajc

Most helpful comment

@jalajc there are a lot of issues here, I'll try to unpack.

Google AMP Cache is closed from well-behaved bots via its robots.txt
Today, we don't do anything special to distinguish between bots and real users.
Google AMP Cache is trying to respect user privacy and does not use cookies. That makes it near-impossible to persist user/bot classification, and performing classification on every request via captcha or alike is obviously not possible.
AMP access server-side, by design, would mask the information that you'd want to see to make access decisions (you'll just see Google UA in your logs).
Finally, you can authenticate Google AMP Cache just like you can authenticate Googlebot, by performing reverse-DNS lookup on the connecting IP.

I will try to find time to talk to folks who fight scrapers in Google Search, and see if they have any ideas that could be implemented within the product constraints (particularly, 3).

vitaliybl on 15 Mar 2017

👍3

All 13 comments

I can't answer for google but I can tell the the cdn.ampproject.org rate limits becasue we hit it they other day loading our own pages (we were forcing updates by reading expired pages).

jpettitt on 8 Mar 2017

Whoo, don't know who to assign this to so @adelinamart gets it.

jridgewell on 10 Mar 2017

😄1

@dvoytenko @rudygalfi is something amp-access handles?

lannka on 13 Mar 2017

@jalajc the amp team is planning to add support for recaptcha in AMP as per #2273
Let me know if that answers your question. Also, you can read more about how AMP cache works at https://github.com/ampproject/amphtml/blob/master/spec/amp-cache-guidelines.md

adelinamart on 13 Mar 2017

It sounds like the use case here is somewhat different. The desire is to protect from bot scrapping which is not what access/reCaptcha do - they do client-side obfuscation or user action guarding.

I believe all AMP Caches are required to fully respect robots.txt and thus have some protection from scrapping. /to @vitaliybl to clarify.

dvoytenko on 14 Mar 2017

@jalajc the amp team is planning to add support for recaptcha in AMP as per #2273
Let me know if that answers your question. Also, you can read more about how AMP cache works at

Not completely. scrappers and bots are not practically affected by component visibility on UI. Javascript is always 'optional'. Good browsers and people will mostly allow scripts. But certainly scrappers find ways to fetch content without javascripts.
@dvoytenko's comment is worth notable in this regard.

So workable solution is a _intermediate service_ (from cache provider) which first validates legitimate calls and then only allow contents to be fetched rather than content is fetched from cache and browser(viewer) validating if a legitimate call?Not sure how reCaptcha will work in AMP.

Also server side rate limits are also desired if in case scrapper abuse faster than before availability of content.
Note that in case scrapper/bot read amp-caches _programmatically_, publisher cannot even find a trace (as javascript based analytics won't work, and server logs are not available at publishers end.). This is scary. However, If scrapper/bot reads from my server I can always introduce rate limits and block access at the webserver (such as apache, ngix) level, because I have _visibility_ and _control_ on who is visiting my server.

@dvoytenko problem with robot.txt is, except the top few, I have seldom found bots to respect robot.txt. And in many cases you won't know if it is bot/scrapper as UA can be faked.
But yes, if strict enforcement comes from AMP cache implementations along with rate limits, this could work.

jalajc on 14 Mar 2017

Note that in case scrapper/bot read amp-caches programmatically, publisher cannot even find a trace (as javascript based analytics won't work, and server logs are not available at publishers end.). This is scary.

This is a hard one to solve, particularly as amp caches proliferate. If I were making an evil bot I'd distribute it's reads across all the amp caches since they can all serve the same content. This makes it really hard to prevent scraping if the entity doing it has a good browser emulation.

Adding black/white lists would be a nightmare to manage, and be very brittle and error prone.

For security reasons amp caches can't report page requests to publishers. Doing so would give the publisher visibility into their page being pre-rendered in a search result when the user hasn't yet requested to view it. That would defeat one of the main reasons the cache exists.

jpettitt on 15 Mar 2017

@jalajc there are a lot of issues here, I'll try to unpack.

Google AMP Cache is closed from well-behaved bots via its robots.txt
Today, we don't do anything special to distinguish between bots and real users.
Google AMP Cache is trying to respect user privacy and does not use cookies. That makes it near-impossible to persist user/bot classification, and performing classification on every request via captcha or alike is obviously not possible.
AMP access server-side, by design, would mask the information that you'd want to see to make access decisions (you'll just see Google UA in your logs).
Finally, you can authenticate Google AMP Cache just like you can authenticate Googlebot, by performing reverse-DNS lookup on the connecting IP.

I will try to find time to talk to folks who fight scrapers in Google Search, and see if they have any ideas that could be implemented within the product constraints (particularly, 3).

vitaliybl on 15 Mar 2017

👍3

One more thing before I forget: @dvoytenko , Google AMP Cache itself does not obey robots.txt because it acts on behalf of a user.

vitaliybl on 16 Mar 2017

👍1

This issue seems to be in Pending Triage for awhile. @vitaliybl Please triage this to an appropriate milestone.

ampprojectbot on 20 Oct 2017

This issue seems to be in Pending Triage for awhile. @vitaliybl Please triage this to an appropriate milestone.

ampprojectbot on 23 Jan 2018

I was looking for the same.. I tend to use an ajax call after N seconds. Now I see you can do something using triggers - trackPageview - visibilitySpec - https://ampbyexample.com/components/amp-analytics/ (the cat sample)