Gitea: Make the default /robots.txt reject all crawlers

Created on 20 Jan 2017  路  14Comments  路  Source: go-gitea/gitea

I am wondering whether this is going too far or not. In my mind, the default for privately set-up gitea instances should be private by default and that entails rejecting crawlers too as a way to reduce surprise to the user.

kinproposal

Most helpful comment

I agree with some comments I've read: Gitea should come with some sensible default robots.txt for public sites, not as a sample but installed as default. The users will of course be able to replace it as they see fit.

BTW: what are robots.txt used for in private sites?

EDIT: I thought it meant intranet sites, sorry!

All 14 comments

Not even mentioning as secure as possible but still making it easy to use for the user by default. Which should entail hiding version numbers, disabling gravatar and other information leaking features and making the default be private repositories, etc, but that is a separate discussion.

Maybe not make it default, but if REQUIRE_SIGNIN_VIEW is set to true, and /robots.txt isn't found, Gitea could provide a default "block all robots".txt 馃檪

(Since if REQUIRE_SIGNIN_VIEW is set it seems m00t for a crawler to crawl it 馃槢 )

well yes, but at that point it doesn't really matter, so I think it doesn't go far enough

IMHO we should not block everything by default. For sure there are enough instances that don't want to block everything. Private repositories are anyway blocked at all because it's not accessible. If somebody really wants to enforce that, he can add a robots.txt to the custom folder.

I've just had a problem with robots, but in my case the service is running from a suburl so serving a robots.txt from Gitea would not have helped. Unless I'm missing a specification allowing for that. What I've been reading (not much) came from http://www.robotstxt.org/robotstxt.html

For top-level installs, generating a robots.txt would indeed be good as it would allow for example preventing bots from downloading archives for each committish, which in turn fills up disk space (see #769) - according to the lecture above (robotstxt.org) you cannot use globs in a robots.txt file so having it automatically generated helps with instances where everyone can create new repos...

We could have two examples, one for private sites another for public sites.

I agree with some comments I've read: Gitea should come with some sensible default robots.txt for public sites, not as a sample but installed as default. The users will of course be able to replace it as they see fit.

BTW: what are robots.txt used for in private sites?

EDIT: I thought it meant intranet sites, sorry!

I agree it is probably reasonable to provide a sensible example of robots.txt for a basic public site -that's specific knowledge that's appropriate for Gitea. For private sites, we could put something on the website documentation but it's basically:

User-agent: * 
Disallow: /

I guess we have to decide what level of basic support we think we should give - but our documentation is supposed to cover specific Gitea information. This would probably class as basic hardening and therefore just about appropriate.

I just experienced the negative surprise to see my private gitea repository indexed. I naively thought search engines would not find the subfolder on my website, but they did.
Based on my experience, I would suggest:

  1. create a default robots.txt that rejects all crawlers
  2. make it clear in the installation documentation that by default the gitea installation will be indexed by search engines.

Private repositories won't get indexed. You simply got public repos what should be totally obviously indexed if they are found by Google or other search engines.

@tboerger what I mean, is that I have a repo I need to share with fellow team members that I don't want to be indexed by search engines. For ease of use, I also opted to have the url publicly accessible provided you know the address.

Than you should add a custom robots.txt. Not everybody wants to hide all the repos. If something is private, make it private. Everything else could be generally fine to get indexed.

The few exceptions that want to avoid it: add a robots.txt to the customization.

An config item and option should be in the installation page to let users chose if allow crawlers.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  路  3Comments

thehowl picture thehowl  路  3Comments

adpande picture adpande  路  3Comments

lunny picture lunny  路  3Comments

flozz picture flozz  路  3Comments