Dvc.org: seo: monitor and fix broken links

Created on 26 Oct 2019  路  13Comments  路  Source: iterative/dvc.org

Use this script or similar manually to double check which links are broken in the docs: https://github.com/iterative/dvc.org/pull/690/files#diff-a5173e320dcf100fc3ff5b32ba2ea911

The last run (see https://github.com/iterative/dvc.org/pull/690#issuecomment-542015419) reported the following problems:

static/docs/changelog/0.18.md: 'discuss.dvc.org'
static/docs/changelog/0.35.md: 'https://plugins.jetbrains.com/plugin/11368-data-version-control-dvc-support'
static/docs/command-reference/add.md: '/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache'
static/docs/command-reference/checkout.md: '/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache'
static/docs/command-reference/config.md: '/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache'
static/docs/command-reference/config.md: 'https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html'
static/docs/command-reference/destroy.md: '/doc/user-guide/dvc-files-and-directories'
static/docs/command-reference/get-url.md: 'https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html'
static/docs/command-reference/remote/add.md: 'https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html'
static/docs/command-reference/remote/add.md: 'https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html'
static/docs/command-reference/remote/add.md: 'https://minio.io/'
static/docs/command-reference/remote/add.md: 'https://docs.microsoft.com/en-us/azure/storage/common/storage-create-storage-account'
static/docs/command-reference/remote/index.md: 'https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html'
static/docs/command-reference/remote/modify.md: 'https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html'
static/docs/command-reference/remote/modify.md: 'https://minio.io/'
static/docs/command-reference/update.md: 'https://github.com/iterative/example-get-started'
static/docs/get-started/add-files.md: '/docs/user-guide/large-dataset-optimization'
static/docs/get-started/experiments.md: '/docs/user-guide/large-dataset-optimization'
static/docs/get-started/index.md: '/chat'
static/docs/get-started/pipeline.md: '/doc/tutorial'
static/docs/tutorials/deep/define-ml-pipeline.md: 'https://data.dvc.org/tutorial/ver/data.zip'
static/docs/tutorials/deep/preparation.md: 'https://code.dvc.org/tutorial/nlp/code.zip'
static/docs/tutorials/pipelines.md: '/doc/tutorial'
static/docs/tutorials/versioning.md: '/chat'
static/docs/understanding-dvc/collaboration-issues.md: '<https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning'
static/docs/understanding-dvc/related-technologies.md: '/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache'
static/docs/understanding-dvc/resources.md: 'https://www.kaggle.com/rtatman/kerneld4769833fe'
static/docs/use-cases/share-data-and-model-files.md: 'https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html'
static/docs/use-cases/share-data-and-model-files.md: 'https://docs.aws.amazon.com/cli/latest/reference/s3/mb.html'
static/docs/user-guide/contributing-docs.md: 'https://github.com/iterative/dvc.org/tree/master/src/Documentation/sidebar.json'
static/docs/user-guide/contributing-docs.md: 'https://github.com/iterative/dvc.org.git'
static/docs/user-guide/contributing-docs.md: 'https://nodejs.org/'
static/docs/user-guide/contributing-docs.md: 'https://marketplace.visualstudio.com/items?itemName=stkb.rewrap'
static/docs/user-guide/contributing-docs.md: 'https://raw.githubusercontent.com/iterative/dvc.org/master/static/docs/user-guide/contributing-doc.md'
static/docs/user-guide/contributing.md: 'https://github.com/iterative/dvc.git'
static/docs/user-guide/contributing.md: '/chat'
static/docs/user-guide/contributing.md: 'https://docs.aws.amazon.com/en_us/cli/latest/userguide/cli-chap-install.html'
static/docs/user-guide/contributing.md: 'https://cloud.google.com/sdk/docs/quickstarts'
static/docs/user-guide/contributing.md: 'https://github.com/ambv/black'
static/docs/user-guide/dvc-files-and-directories.md: '/docs/user-guide/large-dataset-optimization'
static/docs/user-guide/large-dataset-optimization.md: '/docs/user-guide/update-tracked-files'
static/docs/user-guide/plugins.md: 'https://plugins.jetbrains.com/plugin/11368-dvc-support-poc'
static/docs/user-guide/running-dvc-on-windows.md: '<https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2003/cc778996(v=ws.10'
static/docs/user-guide/update-tracked-files.md: '/docs/user-guide/large-dataset-optimization'

UPDATE: Scroll to https://github.com/iterative/dvc.org/issues/746#issuecomment-548686225 and below for latest pending work here.

hacktoberfest help wanted website

Most helpful comment

static/docs/install/linux.md: 'https://docs.conda.io/en/latest/miniconda.htm'
static/docs/install/macos.md: 'https://docs.conda.io/en/latest/miniconda.htm'
static/docs/install/windows.md: 'https://docs.conda.io/en/latest/miniconda.htm'

All three errors can be fixed by using .html instead of .htm suffix.

All 13 comments

I would clarify that there are lot of false positives here. And we need only fix a very few that use redirects like dvc.org/docs/something -> dvc.org/doc/somethinf

Running script. Only seeing a few issues after turning on --max-redirect=10 and --method=GET. Lots of false positives w/ redirects=0 and method=HEAD.

Thanks @taylorlee1 ! Yes, I'm not sure that script is completely flawless. Please use your criteria and fix the broken links you are able to find 馃檪 You may open a PR with the fixes and say "Fix #746" in it's description. More info in https://dvc.org/doc/user-guide/contributing/docs

@taylorlee1 some false positives on redirects=0 should be fixed. Mostly those which are redirects we keep for backward compatibility (docs -> doc), etc. They are not external links, they automatically transform one docs link into another.

static/docs/install/linux.md: 'https://docs.conda.io/en/latest/miniconda.htm'
static/docs/install/macos.md: 'https://docs.conda.io/en/latest/miniconda.htm'
static/docs/install/windows.md: 'https://docs.conda.io/en/latest/miniconda.htm'

All three errors can be fixed by using .html instead of .htm suffix.

I fixed the easy redirects (docs -> doc), but there are a few I am not sure about:

`KO 200 https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8 https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8?gi=48427509caa5
--> ['./static/docs/understanding-dvc/resources.md']

KO 200 https://dvc.org/chat https://discordapp.com/invite/dvwXA2N
--> ['./static/docs/get-started/index.md', './static/docs/user-guide/contributing/core.md', './static/docs/tutorials/versioning.md']

KO 200 https://towardsdatascience.com/how-to-use-data-version-control-dvc-in-a-machine-learning-project-a78245c0185 https://towardsdatascience.com/how-to-use-data-version-control-dvc-in-a-machine-learning-project-a78245c0185?gi=bcd31aa1f168
--> ['./static/docs/tutorials/community.md']

KO 200 https://www.python.org/dev/peps/pep-0008/? https://www.python.org/dev/peps/pep-0008/
--> ['./static/docs/user-guide/contributing/core.md']

KO 200 https://help.github.com/en/articles/resolving-a-merge-conflict-using-the-command-line https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/resolving-a-merge-conflict-using-the-command-line
--> ['./static/docs/tutorials/deep/reproducibility.md']

KO 200 https://git-scm.com https://git-scm.com/
--> ['./static/docs/tutorials/deep/preparation.md', './static/docs/tutorials/pipelines.md', './static/docs/tutorials/versioning.md']

KO 200 https://stackoverflow.com/a/42120328/761963 https://stackoverflow.com/questions/25925752/uninstall-packages-in-mac-os-x/42120328#42120328
--> ['./static/docs/install/macos.md']

ERROR 403 https://blogs.windows.com/windowsdeveloper/2016/03/30/run-bash-on-ubuntu-on-windows/ https://blogs.windows.com/windowsdeveloper/2016/03/30/run-bash-on-ubuntu-on-windows/
--> ['./static/docs/install/windows.md', './static/docs/user-guide/running-dvc-on-windows.md']

KO 200 https://nodejs.org/ https://nodejs.org/en/
--> ['./static/docs/user-guide/contributing/docs.md', './README.md']

KO 200 https://help.github.com/en/articles/fork-a-repo https://help.github.com/en/github/getting-started-with-github/fork-a-repo
--> ['./static/docs/user-guide/contributing/docs.md']

KO 200 https://github.com/iterative/dvc.org.git https://github.com/iterative/dvc.org
--> ['./static/docs/user-guide/contributing/docs.md']

KO 200 https://github.com/ambv/black https://github.com/psf/black
--> ['./static/docs/user-guide/contributing/core.md']

KO 200 https://codeclimate.com/github/iterative/dvc.org/maintainability https://codeclimate.com/github/iterative/dvc.org
--> ['./README.md']

ERROR 403 http://studio.ml/ http://www.studio.ml/
--> ['./static/docs/understanding-dvc/related-technologies.md']

KO 200 https://towardsdatascience.com/the-data-science-workflow-43859db0415 https://towardsdatascience.com/the-data-science-workflow-43859db0415?gi=1bfdb7f61eb7
--> ['./static/docs/understanding-dvc/resources.md']

KO 200 https://plugins.jetbrains.com/plugin/11368-dvc-support-poc https://plugins.jetbrains.com/plugin/11368-data-version-control-dvc-support
--> ['./static/docs/install/plugins.md']

KO 200 https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee?gi=508f8e55e673
--> ['./static/docs/understanding-dvc/resources.md']

KO 200 https://github.com/iterative/dvc.git https://github.com/iterative/dvc
--> ['./static/docs/user-guide/contributing/core.md']

KO 200 https://stackoverflow.com https://stackoverflow.com/
--> ['./static/docs/tutorials/deep/preparation.md']

KO 200 https://github.com/iterative/dvc.org/tree/master/src/Documentation/sidebar.json https://github.com/iterative/dvc.org/blob/master/src/Documentation/sidebar.json
--> ['./static/docs/user-guide/contributing/docs.md']

KO 200 https://dvc.org/chat https://discordapp.com/invite/dvwXA2N
--> ['./README.md']
`

My 2c on this:

keep redirects that just remove/add slash at the end like / - it would be never ending game fixing those, or those that add some ? after the redirect.

def fix redirects like this https://plugins.jetbrains.com/plugin/11368-dvc-support-poc - that look like owners of the site moved the page (probably it should be returning 301?)

https://dvc.org/chat https://discordapp.com/invite/dvwXA2N - these are specifically made so that we change the invite if it's needed.

  • We should def. keep https://dvc.org/chat (all occurrences) (actually just /chat) as mentioned by Ivan.
  • The ones that just add a query string e.g. ?gi=48427509caa5 (all the towardsdatascience.com ones) we can also ignore, also as mentioned by Ivan.

KO 200 https://www.python.org/dev/peps/pep-0008/? https://www.python.org/dev/peps/pep-0008/ ,
KO 200 https://stackoverflow.com https://stackoverflow.com/

  • No difference here. Leave them please.

KO 200 https://help.github.com/en/articles/resolving-a-merge-conflict-using-the-command-line https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/resolving-a-merge-conflict-using-the-command-line ,
KO 200 https://help.github.com/en/articles/fork-a-repo https://help.github.com/en/github/getting-started-with-github/fork-a-repo ,
KO 200 https://github.com/ambv/black https://github.com/psf/black ,
KO 200 https://codeclimate.com/github/iterative/dvc.org/maintainability https://codeclimate.com/github/iterative/dvc.org ,
KO 200 https://github.com/iterative/dvc.org/tree/master/src/Documentation/sidebar.json https://github.com/iterative/dvc.org/blob/master/src/Documentation/sidebar.json

Yes, please update since we know the content has moved.

KO 200 https://git-scm.com https://git-scm.com/

Yes, please add / to any base URL that is missing it. @shcheklein I don't think any redirect removes slashes from base URLs (with no path) like this one.

KO 200 https://stackoverflow.com/a/42120328/761963 ,
KO 200 https://nodejs.org/ https://nodejs.org/en/

Keep the short versions. Easier to read in docs. Plus the extra paths may change later but not the parts we have.

ERROR 403 https://blogs.windows.com/windowsdeveloper/2016/03/30/run-bash-on-ubuntu-on-windows/

I get a 200 OK. Leave it.

KO 200 https://github.com/iterative/dvc.org.git https://github.com/iterative/dvc.org ,
KO 200 https://github.com/iterative/dvc.org/tree/master/src/Documentation/sidebar.json https://github.com/iterative/dvc.org/blob/master/src/Documentation/sidebar.json
KO 200 https://github.com/iterative/dvc.git https://github.com/iterative/dvc

Please update.

ERROR 403 http://studio.ml/ http://www.studio.ml/

I get 301. But yes, please update.

KO 200 https://plugins.jetbrains.com/plugin/11368-dvc-support-poc https://plugins.jetbrains.com/plugin/11368-data-version-control-dvc-support

I get 302. Anyway, please just use https://plugins.jetbrains.com/plugin/11368 and let them redirect accordingly.

Thanks again @taylorlee1!

Another issue I just detected is that we can't really check all the anchors of links for example [How to report a problem](/doc/user-guide/contributing/core#how-to-report-a-problem) (in https://dvc.org/doc/user-guide/contributing, related to #727 BTW) or even just (#anchor) links, not to mention external anchored ones like [Link](https://web.site/page#anchor)... Would this be a nightmare to test with a script? As in... Should we just avoid #anchored links in general (and remove them all)?

Don't think we need to remove them. It's more or less fine to have some of them broken and update them from time to time. We can make a script that analyzes the content of the page to see if there is anchor there. SSR would be helpful in this case, but can be done w/o that as well.

It's more or less fine to have some of them broken

Agree. As long as the link before anchor exists. But finding broken ones may indicate that the original content has changed and so the link may no longer be relevant. (Unlikely)

SSR would be helpful in this case...

Only for internal links. Any external links to dynamic sites will also be hard to detect broken anchors for. (A crawler would throw false positives, which we could simply review manually when reported.)

Anyway, yeah not a huge deal, just something I realized and wanted to note here.

should be solved by @casperdcl 's fix and a few commits that removed/fixed broken links

My implementation (#958) was quick and dirty but probably does the job. Didn't actually fix the current broken links but will find any future ones and keep warning about the current ones

Was this page helpful?
0 / 5 - 0 ratings