Gitea: Make the default /robots.txt reject all crawlers

Created on 20 Jan 2017 · 14Comments · Source: go-gitea/gitea

I am wondering whether this is going too far or not. In my mind, the default for privately set-up gitea instances should be private by default and that entails rejecting crawlers too as a way to reduce surprise to the user.

kinproposal

Source

sztanpet

👍3

Most helpful comment

I agree with some comments I've read: Gitea should come with some sensible default robots.txt for public sites, not as a sample but installed as default. The users will of course be able to replace it as they see fit.

BTW: what are robots.txt used for in private sites?

EDIT: I thought it meant intranet sites, sorry!

guillep2k on 14 Oct 2019

👍4 👎1

All 14 comments

Not even mentioning as secure as possible but still making it easy to use for the user by default. Which should entail hiding version numbers, disabling gravatar and other information leaking features and making the default be private repositories, etc, but that is a separate discussion.

sztanpet on 20 Jan 2017

Maybe not make it default, but if REQUIRE_SIGNIN_VIEW is set to true, and /robots.txt isn't found, Gitea could provide a default "block all robots".txt 🙂

bkcsoft on 20 Jan 2017

👍3

(Since if REQUIRE_SIGNIN_VIEW is set it seems m00t for a crawler to crawl it 😛 )

bkcsoft on 20 Jan 2017

well yes, but at that point it doesn't really matter, so I think it doesn't go far enough

sztanpet on 20 Jan 2017

IMHO we should not block everything by default. For sure there are enough instances that don't want to block everything. Private repositories are anyway blocked at all because it's not accessible. If somebody really wants to enforce that, he can add a robots.txt to the custom folder.

tboerger on 20 Jan 2017

👍1

I've just had a problem with robots, but in my case the service is running from a suburl so serving a robots.txt from Gitea would not have helped. Unless I'm missing a specification allowing for that. What I've been reading (not much) came from http://www.robotstxt.org/robotstxt.html

For top-level installs, generating a robots.txt would indeed be good as it would allow for example preventing bots from downloading archives for each committish, which in turn fills up disk space (see #769) - according to the lecture above (robotstxt.org) you cannot use globs in a robots.txt file so having it automatically generated helps with instances where everyone can create new repos...

strk on 26 Jan 2017

We could have two examples, one for private sites another for public sites.

lunny on 14 Oct 2019

BTW: what are robots.txt used for in private sites?

EDIT: I thought it meant intranet sites, sorry!

guillep2k on 14 Oct 2019

👍4 👎1

I agree it is probably reasonable to provide a sensible example of robots.txt for a basic public site -that's specific knowledge that's appropriate for Gitea. For private sites, we could put something on the website documentation but it's basically:

User-agent: * 
Disallow: /

I guess we have to decide what level of basic support we think we should give - but our documentation is supposed to cover specific Gitea information. This would probably class as basic hardening and therefore just about appropriate.

zeripath on 14 Oct 2019

I just experienced the negative surprise to see my private gitea repository indexed. I naively thought search engines would not find the subfolder on my website, but they did.
Based on my experience, I would suggest:

create a default robots.txt that rejects all crawlers
make it clear in the installation documentation that by default the gitea installation will be indexed by search engines.

8ctopus on 31 Dec 2019

Private repositories won't get indexed. You simply got public repos what should be totally obviously indexed if they are found by Google or other search engines.

tboerger on 31 Dec 2019

😕1 👎1

@tboerger what I mean, is that I have a repo I need to share with fellow team members that I don't want to be indexed by search engines. For ease of use, I also opted to have the url publicly accessible provided you know the address.

8ctopus on 4 Jan 2020

👍1

Than you should add a custom robots.txt. Not everybody wants to hide all the repos. If something is private, make it private. Everything else could be generally fine to get indexed.

The few exceptions that want to avoid it: add a robots.txt to the customization.

tboerger on 4 Jan 2020

An config item and option should be in the installation page to let users chose if allow crawlers.

lunny on 15 Oct 2020

👍3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

There can be multiple forks on a same account

thehowl · 3Comments

GPG keys that do not expire default their expiry time to 0001-01-01

thehowl · 3Comments

Show conflicted files on pull request ui.

lunny · 3Comments

Error on ssh/gpg page after gpg key add [docker]

tuxfanou · 3Comments

Not able to override default logo

adpande · 3Comments