https://drafts.csswg.org/css-text-3/#hyphens-property
See
https://bugs.webkit.org/show_bug.cgi?id=166485
https://bugs.chromium.org/p/chromium/issues/detail?id=676270
WebKit and Chromium hyphenate text with 'hyphens: auto' when no language is declared. Gecko does not.
MDN says:
Hyphenation rules are language-specific. In HTML, the language is determined by the lang attribute, and browsers will hyphenate only if this attribute is present and if an appropriate hyphenation dictionary is available.
Spec says:
Correct automatic hyphenation requires a hyphenation resource appropriate to the language of the text being broken. The UA is therefore only required to automatically hyphenate text for which the content language is known and for which it has an appropriate hyphenation resource.
Authors should correctly tag their content’s language (e.g. using the HTML lang attribute) in order to obtain correct automatic hyphenation. UAs may refuse to automatically hyphenate untagged content regardless of the hyphens property value.
https://drafts.csswg.org/css-text-3/#valdef-hyphens-auto
Now the spec doesn't forbid it, but I think the intent is that UAs should not hyphenate untagged content. (Correct?)
Test case/demo:
http://software.hixie.ch/utilities/js/live-dom-viewer/saved/4761
<!DOCTYPE html>
<style> div { border:solid; width:150px; -webkit-hyphens:auto; hyphens:auto; } </style>
No lang
<div>Long words like implementation, initialization, realization, and hyphenation.</div>
lang=en-US
<div lang=en-US>Long words like implementation, initialization, realization, and hyphenation.</div>
In the WebKit bug there is resistance to changing their behavior to that of Gecko.
I think it would be good to figure out what behavior we ideally want browsers to have when the language is not declared, and put that in the spec so we can achieve interoperable behavior for this case.
Now the spec doesn't forbid it, but I think the intent is that UAs should not hyphenate untagged content. (Correct?)
👍 as this is the most reliable behavior that also gives the authors most control.
Worse than that but maybe an option: hyphenate based on the primary browser/os language if non declared on dom/subdom.
The UA might be able to do something useful based on guessing from various heuristics, but it seems very brittle. I'd rather use this as a carrot to encourage authors to always language-tag their content, which they should do anyway.
I don't have strong opinion one or another, but a few things we should consider when discussing:
lang
attribute, HTML defines the language of a node. I guess people are fine to include this?HTML defines the language of a node. I guess people are fine to include this?
CSS leaves it to the host language to define human language, so for HTML that is indeed the rules that should be used as far as I can tell.
'hyphens' uses https://drafts.csswg.org/css-text-3/#content-language
':lang()' uses https://drafts.csswg.org/selectors/#language
(I guess there should only be one concept defined here that css-text and selectors should both hook into... [edit: https://github.com/w3c/csswg-drafts/issues/879 ])
F2F discussion: https://logs.csswg.org/irc.w3.org/css/2017-01-11/#e757028
in httparchive hyphens:auto appears in 27,173 resources from 494,891 pages
Did some more analysis today. (The January data set is 494,956 pages.)
The number of pages using hyphens: auto
in total is 24,246. (~4.9%)
138,458 pages specify a language in <html lang>
. (~28.0%)
Of those that do not specify a language, 18,282 resources in 16,402 pages use hyphens: auto
. (~3.3% of total; ~67.6% of all hyphens: auto
pages).
So 3.3% of the top 500,000 pages, or two thirds of pages using hyphens: auto
, are affected here, which is quite a lot. At the f2f it was argued that this is most likely to have a negative impact for European users with non-English system language reading English pages.
These matches can be further analyzed to determine which behavior is a net win for users. For example applying language heuristic, selecting English-detected pages, and applying hyphenation rules for some European languages, and making a judgement if the hyphenations that happen can cause confusion or unintended meaning. I do not have the bandwidth to do this myself at this time, so up for grabs.
(As an anecdote that applying hyphenation with the wrong language can be a real problem, see http://indesignsecrets.com/words-hyphenating-wrong-indesign.php# )
cc @litherum
User statistics are one thing. But it is not a net-win to future end-users/visitors if web-authors are encouraged to be lazy about specifying language.
If a text node has multiple languages there could be lang="auto" to let authors explicitly tell the user agent to try and do its best to detect languages and thus correct hyphenation.
I really don't want to see legal, medical or scientific texts auto-hyphened incorrectly in future just because there are existing documents created by lazy authors.
Breaking 3% of webpages is not acceptable for our browser (and I'd imagine most others).
Right, the question is just, are they more broken in Safari or in Firefox?
Webpages are only “breaking” on hyphenation if incorrect rules are applied, so browsers should not hyphenate when in doubt, i.e. when the content language is unknown, obviously. That means, Gecko is good/rightish, Webkit is bad/wrongish.
3% of webpages (if the stats are representative) use hyphens:auto
without declaring the language on the html element. The problematic pages are a strict subset: those that uses hyphens:auto
without declaring the language on the element (or an ancestor of the element) where `hyphens:auto is applied. This may be close to the same 3%, or maybe be substantially less.
Even if we keep the 3% as a baseline, this does not necessarily mean that changing the behavior would just break these sites. It also fixes them. Whether it fixes them more, or break them more depends on 2 things:
1 - how bad is the breakage / how good is the fixing
2 - what percentage of the audience of the sites is affected
On 1, I would argue that the imp-
rovement of going from inco-
rrect hyphenation to none
is larger than the pro-
blems of going form corre-
ct hyphenation to none.
On the one hand, we have me-
aning alt-
ering behavior, while on the
other, we have a small imp-
airement to typ-
ogra-
phic quality.
As for 2, it depends on what percentage of the audience of these web sites is viewing them from a environment set up in a different language from the content's. Depending on the site, this could be close to none, or close to most.
Finally, I'd like to note that safari does not support the unprefixed "hyphens" property, only "-webkit-hyphens". Depending on the correlation between using the prefixed and unprefixed together, declaring the language, and having an international audience, it is hard to tell if safari extending that behavior the the unprefixed version would keep things mostly as they are, or break a substantive number of pages that so far (correctly) did not hyphenate. It may also be possible to get safari's current behavior with the -webkit- prefix, but not with the unprefixed property.
Marking this "Needs Data", although we have some data in https://github.com/w3c/csswg-drafts/issues/869#issuecomment-274066878 -- what's needed is analysis of the data.
If hyphens are to require language, could a 'Content-Language' HTTP header could be sufficient indication of language in the absence of a lang=""
?
^ Answer is yes
The Working Group just discussed Should 'hyphens: auto' work if lang is not declared?
.
The full IRC log of that discussion
<dael> topic: Should 'hyphens: auto' work if lang is not declared?
<dael> github: https://github.com/w3c/csswg-drafts/issues/869
<dael> myles: Hypenation requires a dictionary. The dictionary selected should be informed by lang. WIthout lang hard to pick dictionary. In webkit we pick the OS dictionary
<dael> myles: I believe that's right.
<dael> myles: We have seen a lot of untagged content.
<dael> dbaron: I think this was intentional decision by WG to encourage content to be written in a way that works worldwide.
<dael> dbaron: Gecko impl waht the spec says where we only auto-hypenate with a declared language.
<dael> astearns: Seemed like terrible things could happen if you use a dictionary without a lang.
<dael> fantasai: If they think their page works, but only on their computer that's bad.
<fantasai> s/If/Also, if/
<dael> florian: If hypens was meant to be applied but doesn't that's not good but it's readable. Auto hypens would make text confusing. It's enouraging authors to not lang tag and you might hypen wrong which is worse then no hypens.
<dael> myles: For the first part our thought is about exiting content. For new content we should encourage correct tags. We're worried about breaking large websites.
<dael> florian: You'd have degredation.
<dael> fantasai: The pages are broken in browsers that don't have the behavior so it's making it more obvious.
<dael> myles: Aren't you concerned you get german hypens on english text? Seems like a reason to change this preference.
<dael> ??: If you have a http header with a language?
<dael> fantasai: That counts as a language.
<myles> s/??/richr/
<astearns> s/myles: Aren't/astearns: Aren't
<dael> myles: I won't object.
<dbaron> https://html.spec.whatwg.org/multipage/dom.html#attr-lang
<dael> astearns: Anyone else?
<dael> astearns: Objections for requiring language for hypenation to take effect?
The Working Group just discussed Lone CRs
, and agreed to the following resolutions:
RESOLVED: requiring language for hyphenation to take effect
The full IRC log of that discussion
<fantasai> Topic: Lone CRs
<fantasai> https://drafts.csswg.org/css-text-3/issues-lc-2013#issue-138
<fantasai> github: https://github.com/w3c/csswg-drafts/issues/869
<dael> RESOLVED: requiring language for hyphenation to take effect
@clagnut when deriving the default content language from http, browser vendors could care to inject that language as an attribute into html documents (on I suppose) when saving web pages so that the default fallback locale persists when saving html documents?
PRs submitted:
I have commented in
https://bugs.webkit.org/show_bug.cgi?id=166485
https://bugs.chromium.org/p/chromium/issues/detail?id=676270
about the resolution here.
Reported a bug for Edge.
https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/19005611/
@inoas There's no need to do that. The Content-language headers are processed by the browser without modifying the DOM, and the language is reflected in :lang()
.
@fantasai I think you missed @inoas' point. If you want the hyphenation to be preserved after off-lining the page (saving it locally or something), you need to preserve the information that was in HTTP headers. However, that's true for more than just the Content-Language header, and CSS is the wrong place to solve that issues. The bigger problem is that any full solution to saving / offlining / sideloading content will need to preserve an number of HTTP headers. Web Package or something similar is what you're looking for.
There is also sites with multi-language content though. Even sometimes in one sentence. Even worse, this content may be user-generated. With "hyphenate as best as we can" this case could be mitigated, but what is the way of mitigating this issue in the case of language-specified hyphenation?
There is also sites with multi-language content though
That's fine, the lang attribute can be applied to any element, and different elements in the page can have different values.
Even sometimes in one sentence. Even worse, this content may be user-generated.
That's tough. Doing correct hyphenation on un unknown mix of languages is going to be hard. Falling back to the language of the operating system wouldn't help (as was the case for the implementations that did hyphenation even without a declared language)
Maybe that's why Chrome-like browsers adopted initially an heuristics-based approach. As someone stated above, a bad hyphenation may be still better than no hyphenation at all.
Just maybe this behaviour may be possible with a new switch for the hyphens property "all", or smth like that? Meaning "hyphenate all you can find".
Most helpful comment
The UA might be able to do something useful based on guessing from various heuristics, but it seems very brittle. I'd rather use this as a carrot to encourage authors to always language-tag their content, which they should do anyway.