Brought to you by jekyll/jekyll-archives#24
When you have tags with different cases in different posts the archives plugin looses some entries on archive pages. There are some feelings that this issue needs to be resolved in jekyll itself.
This will be a 3.0 change.
Why do you have tags with different cases? They should be unique based on lowercase character content.
Why do you have tags with different cases?
Because I don't remember in which case I've written some tags.
They should be unique based on lowercase character content.
Yes, but they are not.
They should be unique based on lowercase character content.
Yes, but they are not.
Perhaps I'm just plain wrong about that. For example, apple and Apple differ only by one character, but it carries a lot of semantic meaning: :apple: is _very_ different from .
Why do you have tags with different cases?
Because I don't remember in which case I've written some tags.
Maybe this is a problem inherent to tags, not to our system. Maybe we abolish tags altogether. Maybe we make it easier to make tags more transparent to the writer, like jekyll compose --list-tags or something.
Yes, but there is a problem with plugins like "archive", which uses tags incorrectly(?) so "apple" and "Apple" would have the same url-slug and then content of one of the tags will not exist on archive page. That is the actual problem.
Problems with jekyll-archive should be filed as issues on the jekyll/jekyll-archive repository. But yes, there are semantic issues where uniquifying the tags based on their downcased variants would be the wrong thing to do.
The problem here is that with 2 tags, lets say "Apple" and "apple", Utils.slugify produces the same slug...
Should we be changing Utils.slugify, or make our own slug function to deal with tag letter casing?
As description says the issue is originated from archives.
The problem here is that with 2 tags, lets say "Apple" and "apple", Utils.slugify produces the same slug...
@alfredxing Perhaps this is an archives issue? Or we make slugify case sensitive.
Archives isn't doing anything with tags whatsoever, other than iterating them and calling slugify... I guess it depends on whether you want tags to be case sensitive?
Tags should be case sensitive, yes.
Define "case sensitive", they should certainly be normalized by Jekyll as they go in (downcased, spaces converted "-" and special chars strip (most of them.) which is the standard behavior of most if not all all normal tagging systems that care about multi-tenant situations.
What about windows (and others with case insensitive file system) users?
The relevance of that should be nil since we make the tag pages and how your operating system cases is not relevant to the internet and how it ='s defacto standards work.
The matter is, that if you give something "Hello" as a tag, it need output "hello" as the tag and "hello.html" as the tag page, like most systems with tagging behave (except it gets written to html here, not stored in a transaction.)
I don't think we should downcase the tag, but :+1: on converting spaces and such.
I don't think tags are being normalized right now.
After thinking about it some more, I came up with an example where case sensitivity would be nice:
title: Testing out Apple's new Force Touch trackpad
category: technology
tags:
- Apple
- trackpad
title: My super yummy apple pie recipe
category: food
tags:
- apple
- pie
- recipe
Given @parkr's suggestion above, I think it would be a good idea to change Utils#slugify, or make another version of it.
Why go against the grain?
And btw, Apple's name is "Apple Inc."
@parkr I agree that we shouldn't downcase tags but we should make them case-insensitive (case-preserving as well). Ideally they would be case-sensitive but in a world where the default file system on most dev machines is case-insensitive I think we should stick to that. If we wanted to get fancy we could have a semantic title and a display title for each tag but I don't think we need that.
EDIT: Wow I meant we should make them case-insensitive. I think autocorrect bit me.
How does WordPress do this, again? I don't like the idea of transforming content you don't ask us to transform... The url's should be case sensitive I think, and match the case of the inputted tag. Maybe that could change.
/cc @spraints for whether Pages is case-sensitive.
Wordpress allows you to do whatever you want with hooks that allow you to enforce what you want. At least if: https://wordpress.org/support/topic/force-tags-to-lowercase is any indication. We could do the same, we have a hook system we could create a hook for that point that allows people to transform and normalize however they want, which gives the opportunity for tons of plugins such as ones that do spell-checking, normalizing and all sorts of junk we probably don't want in core.
it looks like I misread that, that's for a plugin, I'll dig through their source and see later tomorrow.
Is there any good reason the rules for tags should be different than post titles (case-insensitive, case-preserving IIRC)? Since as @envygeeks mentioned tags end as as their own pages it would case havoc in case-insensitive dev filesystems.
is there a way I can downcase all of the tags in my Jekyll repo?
whether Pages is case-sensitive.
Pages runs on linux machines, so anything file-related is case-sensitive.
is there a way I can downcase all of the tags in my Jekyll repo?
@d3netxer Yes! But it requires going through all your posts and manually modifying tags. You could easily write a program to do this.
whether Pages is case-sensitive.
Pages runs on linux machines, so anything file-related is case-sensitive.
Cool, thanks @spraints! I figured but wanted to check.
Windows and Linux are case-sensitive, but Mac is case-insensitive, which makes it the outlier. Do we cater to case-sensitivity or case-insensitivity?
@parkr I'm not a Windows expert but I thought it was case-insensitive. http://superuser.com/questions/165975/are-all-versions-of-windows-case-insensitive
If we go with case-sensitive I think we should make sure the error/warning thrown on case-insensitive filesystems because of two tags that are the same except for case should be very clear or else we might end up with a lot questions on here. Might be something to think about for jekyll doctor.
It depends on if you enable POSIX compliance and set ObCaseInsensitive to 0... Windows can go either way like OS X used to be able to (I don't know if it does anymore?) and Linux can with some file systems but by default NTFS is Case Insensitive and by default Linux is case sensitive (by way of EXT4 -- that is if you opt to use it, on Linux we do have our choices.)
One more thing to add to the discussion. I'm not sure if Jekyll is concerned with Wordpress compatibility/importing but I would like to bring up that Wordpress is case-insenstive when it comes to tags.
Is Tag the same as tag?
Yes. Capital letters do not change a tag. Blogging is the same as blogging.
https://en.support.wordpress.com/posts/categories-vs-tags/
Also, OS X can go either way though a lot of apps break when switching from the default. The default is case-insensitive and case-preserving.
That's apples to oranges. IMO.
I'm not sure if Jekyll is concerned with Wordpress compatibility/importing but I would like to bring up that Wordpress is case-insenstive when it comes to tags.
We aren't all that worried about this, no. We have a jekyll-import WordPress importer which we point people to and which should downcase the tags.
That said, WordPress is prior art and can help inform our decision, notwithstanding their pedantic documentation.
That said, WordPress is prior art and can help inform our decision, notwithstanding their pedantic documentation.
When you got a billion users and don't want a billion tickets that happens :cry:
I just noticed this thread, I want to add my input: case-insensitive.
Seeing this issue is still open I assume the case is not decided yet...
I ran into this problem, expecting the tag and The Tag to be the same tag. Coming from Wordpress I just assumed this as the default behavior. Generally speaking, people have been trained to expect this behavior thanks to the URL spec. (e.g. Google.com resolves to google.com)
According to RFC 3986, the canonical URL scheme is lower-cased. (see https://tools.ietf.org/html/rfc3986#section-3.1) Quoting:
Although schemes are case-insensitive, the canonical form is lowercase and documents that
specify schemes must do so with lowercase letters. An implementation should accept uppercase letters as equivalent to lowercase in scheme names (e.g., allow "HTTP" as well as "http") for the sake of robustness but should only produce lowercase scheme names for consistency.
So it follows that Jekyll should accept apple and Apple but normalize both to a single tag of apple. Unless I'm mistaken, a Jekyll tag is a user-defined resource locator, so it should conform to the URL spec.
If a user wants to refer to the company founded by Steve Jobs instead of the fruit, a more explicit tag should be used, perhaps Apple Inc or Apple Corp which would normalize to the tags apple-inc or apple-corp. Encouraging explicitness over ambiguity is generally a good thing.
Perhaps later, Jekyll core could be extended to allow tag manipulation as suggested by envygeeks but by default I suggest it makes sense to conform to the canonical URL scheme.
@xHN35RQ Only the scheme is case-insensitive, the path isn't. Some URL normalization require uppercase characters like escape characters. https://en.wikipedia.org/wiki/URL_normalization
Only the scheme is case-insensitive, the path isn't.
Right, but why is this relevant? Since Jekyll is using tags to generate URLs, then it should follow RFC 3986 and by default "should accept uppercase letters as equivalent to lowercase in scheme names (e.g., allow "HTTP" as well as "http") for the sake of robustness but should only produce lowercase scheme names for consistency." I don't understand why path casing is important here? Maybe I'm looking at this the wrong way.
@xHN35RQ in the URL https://example.com/tags/my-tag, the only part of that, in what you mention in RFC 3986 that applies to us is "https" and it doesn't actually apply to us, that only applies to your browser and server, both of which normalize that on your behalf. The RFC does not (and will probably never) dictate that we make things case-insensitive, that we force down-case (even if I feel this is the standard of the web) or that we make it all capitals. It only dicates that matters to it: https://, gofer://, ftp://, http://, all of which must be case-insensitive, and only to the extent that it's allow to come in case-insensitive but will be enforced as lower-case in a browser and in some crazy applications all upper-case.
That said; Jekyll is a static site generator, this is a problem for your server, not for us. If you want case insensitive URL's in Jekyll, you'll increase your files exponentially to an uncontrollable number quickly. My arguments (the ones that you linked to) were so that people can flag tags any way they want, they can grep out a tag and and alter a files tag to that tag, normalizing and optimizing their site in a manner that makes them not have to care one iota about whether they got the case right.
On Wordpress being able to do it (it being case-insensitive tags/urls,) it can offer you that because it is an application that is always on. They can offer something like that, we cannot, not feasibly anyhow because _technically_ we can but realistically we won't and shouldn't because again, this creates an exponential problem that we do not want. We write files on your behalf to be served by a server of your choosing (like Nginx) whereas with Wordpress all URL's land to them to deal with and they need not create thousands of files to do what you want... we would need to do that, which is the unrealistic situation we don't want to create. Yes, Wordpress takes in URL's via Nginx or Apache too, but again, they are serving you the data through Wordpress, Nginx is serving your data directly for us.
Fair enough. Thanks for your replies. Your explanation makes sense.
And correct, the spec shouldn't dictate your behavior but it seemed to me that since Jekyll generates URL structures they should conform to expected web usage. For example, my clients tend to expect case-insensitivity, (apple = Apple = /apple) But I see this is largely just an opinion.
I'm still left with the problem that given the tag and The Tag Jekyll treats these as two different tags, but only generates one tag URL the-tag and this page only contains the posts tagged with The Tag. Is there a way I can work around this? Is it currently pluggable?
@xHN35RQ in theory it should be: https://jekyllrb.com/docs/plugins/#hooks you can hook into :pre_render and edit the front-matter they supply to normalize it to a grep of all known tags.
@hazzik I think this is a jekyll-archive issue with how it reads in tag and category hashes, and I've proposed a possible solution on https://github.com/jekyll/jekyll-archives/issues/43, so this issue can be closed.
Just wanted to note for anyone who finds their way to this thread that I built a tool which can help manage tag case sensitivity. It is part of a project I created called jekyll-pre-commit, which uses git pre-commit hooks to run checks before allowing you to commit. I added a check called NoDuplicateTags which, once you've installed the gem you'd declare in your _config.yml. Then you'll get a warning for example, if you try to commit a post using the tag "mysql" but you already have a post that's using the tag "MySQL". Hope someone else also finds this useful.
Most helpful comment
Just wanted to note for anyone who finds their way to this thread that I built a tool which can help manage tag case sensitivity. It is part of a project I created called jekyll-pre-commit, which uses git pre-commit hooks to run checks before allowing you to commit. I added a check called NoDuplicateTags which, once you've installed the gem you'd declare in your _config.yml. Then you'll get a warning for example, if you try to commit a post using the tag "mysql" but you already have a post that's using the tag "MySQL". Hope someone else also finds this useful.