Core: Prevent crawling of nojs pages

Created on 4 Apr 2016  路  24Comments  路  Source: flarum/core

Reported by @dav-is via Gitter chat:

It looks like Google crawled the community site while it was experiencing JS issues and ended up recording some nojs URLs.

nojs

Some sort of countermeasures are needed to keep search engines away from these URLs.

It might also be a good idea to use Google web tools to get the site recrawled, so people won't follow the links to these URLs.

typbug

Most helpful comment

@dav-is But google probably isn't the only search engine we need to fix this for :)

All 24 comments

Seems like a issue with the server configuration (and possibly robots.txt file) then a issue with flarum itself..
In any case, this post looks helpful

robots.txt

Disallow: /*?nojs=1

I agree it's not a core issue. But it nicely shows the kind of problems that you get when doing single-page applications... _sigh_

We can leave this one open as a reminder to fix it, though.

It's the sort of thing that admins won't know to fix unless we tell them.

If the robots.txt syntax suggested by @kulga will do the trick (I hadn't realized robots.txt could be used that way) then this could be handled by a super-slick installer. It could create a robots.txt file containing this line ... problem solved, yes?

If that sounds good, I'll add a note to the installer discussion issue pointing at this one.

Couldn't the robot.txt just be placed in flarum's root directory?

Would that work even if Flarum is installed in a subdirectory? ... I've been under the impression that robots.txt has to be in the web root, or it may be ignored.

Maybe we should just add this to the installation docs, instead? It's a simple enough procedure.

What about using canonical URL meta tags?

@dcsjapan Yeah. Just looked it up. It has to be at the root

Another option is Google has settings on crawling with the webmaster tools
img

@dav-is But google probably isn't the only search engine we need to fix this for :)

As discussed in flarum/flarum#36, we will build a middleware that sends a special header on these pages.

P.S.: @sijad wants to do this.

@franzliedke is it a good idea to add those tags in app view header?

Would it be better to be using a canonical URL meta tag rather than a noindex directive?

@tobscure yes it seems a better solutions. should I PR a new one?

Or maybe in addition to noindex. Because canonical URL should probably be constructed and set by individual controllers (eg, DiscussionController) rather than just by stripping ?nojs=1 off the end of the current URL...

Yes, we should allow controllers to generate a canonical URL, and if we have one, the view will print the meta tag.

Speaking of canonical urls, should the discussion list have a canonical url that doesn't have the post number in it? For example: forum.example/d/5/77 => forum.example/d/5

not sure I understand @dav-is?

Well the page forum.example/d/5 would have the same content at forum.example/d/5/8 and the same content as forum.example/d/5/9. There's overlap in content so should we just remove the post number from the url using the canonical tag? When google crawls, doesn't it scroll down and load all the posts?

When google crawls, doesn't it scroll down and load all the posts?

I don't think so? We do know that Google will wait for AJAX content to be loaded... but I can't imagine it would trigger a scroll event. So I think it will only see whatever content is preloaded.

You're right that certain discussion URLs will have similar content though. Maybe we should make the canonical URL use a number rounded down to the nearest multiple of 20 (or however many posts per page there are)?

I was under the impression indexer bots can understand pagination and call on next, prev and other links..

http://stackoverflow.com/questions/35952151/javascript-pagination-and-seo

Since this topic came up quite often... here's what I understand:

  • discussion pages should have rel=canonical meta tags - we have an issue for this, and I've bumped it so that it gets done soon
  • this canonical tag should link to the first page
  • we should have additional pagination links (for search engines) that lead to search-engine-friendly pages, that - important! - can be used to access exactly that range with JS in a browser, too
Was this page helpful?
0 / 5 - 0 ratings

Related issues

luceos picture luceos  路  3Comments

clrh picture clrh  路  4Comments

jordanjay29 picture jordanjay29  路  3Comments

jordanjay29 picture jordanjay29  路  3Comments

Ralkage picture Ralkage  路  3Comments