Links in docs get regularily broken (example), and it should be possible to have a script that iterates all links in the docs and checks for a HTTP response code of < 400.
Probably not something we want to run as part of the CI, but I could see the script being run on-demand regularily.
I wonder if this might be something for @nodejs/website to figure out.
IIRC @mikeal once said he created something for crawling every link found on a website?
that iterates all links in the docs and checks for a HTTP response code of < 400.
This is probably not enough, changing headings within a page causes the #hash to change and links won't jump to the correct section anymore. So we will need to parse the retrieved documents as well.
Maybe we can use puppeteer for this.
Strawman with puppeteer for simple wrong hashes detection (intra links only):
script
'use strict';
const { URL } = require('url');
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const { href, origin, pathname } = new URL('https://nodejs.org/api/all.html');
await page.goto(href);
const wrongLinks = await page.evaluate((mainOrigin, mainPathname) => {
return [...document.body.querySelectorAll('a[href]')]
.filter(link => link.origin === mainOrigin &&
link.pathname === mainPathname &&
link.hash !== '' &&
document.body.querySelector(link.hash) === null)
.map(link => `${link.innerText} : ${link.href}`)
.join('\n');
}, origin, pathname);
console.log(wrongLinks);
browser.close();
})();
Currently, it detects these links:
output
cluster.settings : https://nodejs.org/api/all.html#clustersettings
verify.update() : https://nodejs.org/api/all.html#crypto_verifier_update_data_inputencoding
verify.verify() : https://nodejs.org/api/all.html#crypto_verifier_verify_object_signature_signatureformat
verify.update() : https://nodejs.org/api/all.html#crypto_verifier_update_data_inputencoding
verify.verify() : https://nodejs.org/api/all.html#crypto_verifier_verify_object_signature_signatureformat
TCP-based protocol : https://nodejs.org/api/all.html#debugger_tcp_based_protocol
Http2Session and Sockets : https://nodejs.org/api/all.html#http2_http2sesion_and_sockets
ALPN negotiation : https://nodejs.org/api/all.html#alpn-negotiation
ServerRequest : https://nodejs.org/api/all.html#http2_class_server_request
stream.pushStream() : https://nodejs.org/api/all.html#http2_stream-pushstream
readable._destroy : https://nodejs.org/api/all.html#stream_readable_destroy_err_callback
readable._destroy : https://nodejs.org/api/all.html#stream_readable_destroy_err_callback
stream._destroy() : https://nodejs.org/api/all.html#stream_readable_destroy_err_callback
I was more thinking of external links when I opened this, but it's good to have checks for those relative links too. For external ones, I think a simple status code check should be enough.
For the internal ones, I could see a check like above being part of the CI, but I don't think we can include puppeteer in the repository, it's just too heavy.
Also, there is a tool called html-proofer that can be used for such things. I use it in some of my static pages to check if all the resources are exists like (images, stylesheets, etc) and... it also checks for any broken links on your website.
A more meticulous and tangled variant for internal links checking (for hash-only links and for inter-document links inside the doc site). It still uses puppeteer, so it is not bearable inside the repo or CI, but it can be occasionally used locally.
The current run has resulted in https://github.com/nodejs/node/pull/15293 and https://github.com/nodejs/node/issues/15291.
Could we use something Node.js like jsdom or cheerio instead of Puppeteer? The latter sounds a lot like an overkill to me, while cheerio might even be small enough to be bundled in core.
Or even better, a Markdown-based solution, that can possibly be integrated with doctool.
@TimothyGu I had someone PR Danger as a CI/CD tool for a markdown-only project of mine to detect broken links - it may be useful to run on docs updates?
http://danger.systems/js/
https://github.com/danger/danger-js
There's been zero activity on this in 11 months. I recommend closing.
I wrote a tool, similar to the html-proofer API, based on node.
I wrote it because html-proofer too slow for > 1000 pages.
Since my tool is using cheerio which is using htmlparser2, it's super fast.
FWIW, the internal doc system is checked now (see https://github.com/nodejs/node/pull/21889), so we only need external link validation.
Still no actual activity on this, should we keep it open?
I agree, better close it than.
Most helpful comment
Strawman with puppeteer for simple wrong hashes detection (intra links only):
script
Currently, it detects these links:
output
cluster.settings : https://nodejs.org/api/all.html#clustersettings
verify.update() : https://nodejs.org/api/all.html#crypto_verifier_update_data_inputencoding
verify.verify() : https://nodejs.org/api/all.html#crypto_verifier_verify_object_signature_signatureformat
verify.update() : https://nodejs.org/api/all.html#crypto_verifier_update_data_inputencoding
verify.verify() : https://nodejs.org/api/all.html#crypto_verifier_verify_object_signature_signatureformat
TCP-based protocol : https://nodejs.org/api/all.html#debugger_tcp_based_protocol
Http2Session and Sockets : https://nodejs.org/api/all.html#http2_http2sesion_and_sockets
ALPN negotiation : https://nodejs.org/api/all.html#alpn-negotiation
ServerRequest : https://nodejs.org/api/all.html#http2_class_server_request
stream.pushStream() : https://nodejs.org/api/all.html#http2_stream-pushstream
readable._destroy : https://nodejs.org/api/all.html#stream_readable_destroy_err_callback
readable._destroy : https://nodejs.org/api/all.html#stream_readable_destroy_err_callback
stream._destroy() : https://nodejs.org/api/all.html#stream_readable_destroy_err_callback