Using a JS-based robots.txt parser (like this one), validate the file itself and apply existing SEO audits whenever applicable.
This integration has two parts:
Audit group: Crawling and indexing
Description: robots.txt is valid
Failure description: robots.txt is not valid
Help text: If your robots.txt file is malformed, crawlers may not be able to understand how you want your website to be crawled or indexed. Learn more.
Success conditions:
all, noindexAdd the following success condition:
Note that directives may be applied to the site as a whole or a specific page. Only fail if the current page is blocked from indexing (directly or indirectly).
Do we want to let user know if /robots.txt fails with something like HTTP 500? IMO if the response code is in HTTP 500 - 600 range we can safely report it as an issue.
User-agent: Googlebot
Disallow: / # everything is blocked for googlebot
User-agent: *
Disallow: # but allowed for everyone else
Should we fail in such case? How about robots.txt that is only blocking e.g. Googlebot-Image, or Bing, Yandex, DuckDuckGo? 馃
For consistency with https://github.com/GoogleChrome/lighthouse/issues/3182 let's try to avoid distinguishing between crawlers. If the common case is * UAs, then that's the one we should check. If possible, it would be great to warn saying "you passed, but you're blocking googlebot".
The alternative is to fail the audit when seeing anything resembling noindex, which seems too strict.
I'd also love to see the contents echoed back in the extra info table or similar. Just showing it to users is a sort of manual validation, even if the audit passes. As a secondary benefit, this would be great for data mining later.
For the record, here is the full set of rules I've put together from various sources and implemented in the robots.txt validator:
only directives from the safelist are allowed:
'user-agent', 'disallow', // standard
'allow', 'sitemap', // universally supported
'crawl-delay', // yahoo, bing, yandex
'clean-param', 'host', // yandex
'request-rate', 'visit-time', 'noindex' // not officially supported, but used in the wild
there are no 'allow' or 'disallow' directives before 'user-agent'
I did run my validator against top 1000 domains and got following errors for 39 of them: https://gist.github.com/kdzwinel/b791967eb66d0e2925ea22c8ca14233a .
Various docs:
and online validators:
Most helpful comment
For the record, here is the full set of rules I've put together from various sources and implemented in the robots.txt validator:
Rules
only directives from the safelist are allowed:
'user-agent', 'disallow', // standard
'allow', 'sitemap', // universally supported
'crawl-delay', // yahoo, bing, yandex
'clean-param', 'host', // yandex
'request-rate', 'visit-time', 'noindex' // not officially supported, but used in the wild
there are no 'allow' or 'disallow' directives before 'user-agent'
Test
I did run my validator against top 1000 domains and got following errors for 39 of them: https://gist.github.com/kdzwinel/b791967eb66d0e2925ea22c8ca14233a .
Resources
Various docs:
and online validators: