Rather than rejecting requests, allow configuration to either ignore UTF8 headers, or, parse them, even if they're illegal.
I suggest an enum Reject, Ignore, Parse.
However on output we should still be strict, UrlEncoding cookie values etc.
See https://github.com/aspnet/KestrelHttpServer/issues/1076 and https://github.com/aspnet/KestrelHttpServer/issues/1125
edit by @muratg: We should also consider request line when we get to this. https://github.com/aspnet/KestrelHttpServer/issues/2647
In practice I expect this to become an FAQ where most people turn on UTF-8, so we should consider making that the default.
Have we tested how these headers forward through IIS? WebListener?
I like how we say "We" Nope, please try it :p
@halter73 please work with @cesarbs, based on timing this may be load balanced to you.
cc @davidfowl
A big change (essentially a rewrite in header processing) so moving to 1.2.0.
@Tratcher could you file a corresponding bug on WebListener repo?
Should also fix: https://github.com/aspnet/KestrelHttpServer/issues/1125
Investigating this. Will check how those headers forward through IIS, and also what other servers do.
Here's what I found so far:
Server | Behavior
-- | --
IIS | accepts UTF-8 in header value, don't know if immediatelly decoded
IIS running ASP.NET 4 app | accepts UTF-8, decodes it as such
IIS with ANCM | rejects, not yet sure why, but the request is not forwarded at all
WebListener | accepts UTF-8, decodes it as such
nginx | accepts non-ASCII, haven't checked yet what it does with it
node.js | accepts non-ASCII, it's up to the app to decode it
Apache |accepts non-ASCII, it's up to the app to decode it
A relevant bit from the RFC:
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
I think the most relevant part here is:
A recipient SHOULD treat other octets in field content (obs-text) as opaque data.
So it's not forbidding chars above 0x7F. obs-text is actually defined in the next section as
obs-text = %x80-FF
The most correct behavior seems to be to accept characters in the 0x80 - 0xFF range (which we reject at present), but to let the app decide the encoding. Http.sys appears to deviate from this though, by decoding as UTF-8.
For reference, which header did you test? You may get different results for a common header vs a custom header. E.g. Host and Location are often special cased.
How does Apache leave it up to the app to decode it? Does it expose the raw header bytes?
@Tratcher I was testing with Referer, as in #1125.
I tested Apache with a PHP app, which saw the header as raw bytes. You get a "string" for it, but I'd have to manually decode it as UTF-8 to get the right chars.
@blowdart We're currently rejecting if the header is UTF8, so I don't think this has security implications. Moving to 2.1.0
Won't this still reject UTF8 cookies? if so that's a problem
FYI if you care SNI says the host is utf8
https://tools.ietf.org/html/rfc3546#section-3.1
The hostname is represented as a byte
string using UTF-8 encoding [UTF8], without a trailing dot.
Now a good server should probably check the host header and SNI match, might be impossible if you can't have a utf8 header
There is even a spec on how to deal with hostnames
https://tools.ietf.org/html/rfc3490#section-4
But a lot is left as an exercise for the reader.
Won't this still reject UTF8 cookies? if so that's a problem
Why is that a problem? We've had 2 complaints overall so far.
If you provide service for customers, and they allowed to insert javascript,
and If it accidentally inserts a unicode character.
then their customers will have problems forever.
critical problem. please hot fix
IE doesn't encode the referer header. As a result, this breaks our .NET Core APIs when requests are coming from IE users on some of our sites with e.g. Cyrillic characters in the URL. Would be good to have support for this.
@davidfowl I think you are not getting many complaints because it takes so long to track down what the issue is. Because many different applications might set cookies with utf-8 characters on the root of the domain (.example.com) and thus some developers might blame it on user error because that encoding error is not popping up in the logs and may be hard to reproduce.
We could really use this, we have a site running on a subdomain (site.bigcompany.com) and user navigate to our site through our company site (bigcompany.com). Our company site uses cookies with non-ascii characters that are also sent to the subdomains and causes users that access our site through our company site to receive bad requests. We're now working with the team from our company site to have them fix their cookies.
I believe this is something that can also be misused for a DOS-attacks. If you manage to set a cookie in the users browser then your site (running ANC) will not function anymore.
As other mentions, the error might originate from javascript cookies and not your own code, so you might not able to correct the error.
In our situation Google Analytics sat the malformed/utf-8 cookie.
Our marketing department used the querystring utm_campaign to name their campaigns (containing danish characters), so when they made an url to our site, and posted it on facebook or other places, all users that clicked that link was no longer able to view our website after the first page view. Only solution was to clear cookies (or wait the 6 months until cookie expiration) - but not many users told us we had a problem, they just went on to our competitor instead.
It took us ages to track down, because the cookie was actually sat on a parent domain and not the kestrel site (which, at the time, was running on a subdomain). And it was not even sat by our own code, but third party javascript, so searching our codebase for the cookie name came out with 0 results.
Since we discovered the error, a newer version of Google Analytics has changed behaviour of storing all their data at their side and only store a user-id cookie on the client side, so the malformed cookie is no longer set. See this for explanation: https://stackoverflow.com/questions/18604715/google-analytics-missing-utmz-cookie
I think you might be able to reproduce the error if you are using the old Google Analytics scripts (ga.js) on your kestrel website (instead of the newer analytics.js), by simply loading a url with a utm_campaign querystring containing international characters (we have switched to the newer version, so can't test it on my own anymore).
It would be great to at least have some kind of option to ignore utf8 cookies, so we at least can get around it by programming instead of loosing users forever.
We also had this problem, which was extremely difficult to narrow down to being this. We use shibboleth which is injecting request headers into our application. One of our staff had í in their name, which just resulted in them getting the 400 error and us being confused for a long time.
I also had an issue reported in context of Server-Timing header. The definition of this header allows using quoted-string for server-timing-param-value and quoted-string allows broader set of characters:
quoted-string = DQUOTE *( qdtext / quoted-pair ) DQUOTE
qdtext = HTAB / SP /%x21 / %x23-5B / %x5D-7E / obs-text
obs-text = %x80-FF
So this header would require an option of relaxing the ascii-only check also for responses.
3.2.6. Field Value Components
Most HTTP header field values are defined using common syntax
components (token, quoted-string, and comment) separated by
whitespace or specific delimiting characters. Delimiters are chosen
from the set of US-ASCII visual characters not allowed in a token
(DQUOTE and "(),/:;<=>?@[\]{}").
token = 1*tchar
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*"
/ "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
/ DIGIT / ALPHA
; any VCHAR, except delimiters
A string of text is parsed as a single value if it is quoted using
double-quote marks.
quoted-string = DQUOTE *( qdtext / quoted-pair ) DQUOTE
qdtext = HTAB / SP /%x21 / %x23-5B / %x5D-7E / obs-text
* obs-text = %x80-FF
Comments can be included in some HTTP header fields by surrounding
the comment text with parentheses. Comments are only allowed in
fields containing "comment" as part of their field value definition.
comment = "(" *( ctext / quoted-pair / comment ) ")"
ctext = HTAB / SP / %x21-27 / %x2A-5B / %x5D-7E / obs-text
The backslash octet ("\") can be used as a single-octet quoting
mechanism within quoted-string and comment constructs. Recipients
that process the value of a quoted-string MUST handle a quoted-pair
as if it were replaced by the octet following the backslash.
quoted-pair = "\" ( HTAB / SP / VCHAR / obs-text )
A sender SHOULD NOT generate a quoted-pair in a quoted-string except
where necessary to quote DQUOTE and backslash octets occurring within
that string. A sender SHOULD NOT generate a quoted-pair in a comment
except where necessary to quote parentheses ["(" and ")"] and
backslash octets occurring within that comment.
However
3.2.4. Field Parsing
...
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
* Newly defined header fields SHOULD limit their field values to
* US-ASCII octets. A recipient SHOULD treat other octets in field
* content (obs-text) as opaque data.
As string is UTF-16; that would suggest the correct approach would be to reject 0x00 and simple widen all other chars converting ISO-8859-1 -> UTF-16 - which also mean any UTF-8 outside the ASCII range would not be interpreted correctly?
e.g. treat opaque data as 8 byte data, converting (byte) 0xDD -> (char)0x00 0xDD
@muratg you said
Moving to 2.1.0
it's still attached to the backlog milestone while a 2.1.0 milestone exists
is this going to get fixed in 2.1.0? or not?
@KLuuKer bringing this back to triage.
@shirhatti @DamianEdwards what are your thoughts on this one?
Backlogging this, no work planned in 2.x.
Do you have any timeline on this? For now, I need to look for an alternative server since I cannot use my API from JavaScript in IE when a URL contains a non-ASCII characters. Every http-request contains those characters in the Referer header and the server returns 400 (bad error).
Let's investigate this for 2.2
My issue was related to a cookie set on the client and getting sent to server. Had this issue with not encoding cookies on the server, and they were set to empty. If the cookie is set on the client side, and posted, it shouldnt give a Bad Request page with no way to handle it in code. Another gotcha with cookies in aspnet core.
I have a similar problem. Our ASP.NET Core applications runs behind a "kind of" reverse proxy, which adds additional headers (for whatever purpose). These headers can contain german umlauts (e.g. "ö"), which lead to "Malformed request: invalid headers." and the requests ends immediately (400).
Is there any workaround to still serve these requests?
No, there's no workaround when the request can't be decoded like this.
These headers can contain german umlauts (e.g. "ö"), which lead to "Malformed request: invalid headers." and the requests ends immediately (400).
You can url encode the header? e.g. ö is %C3%B6
@benaadams the application setting these headers isn't under my control; I could decode anything. 🤷♂️
But I'm going to ask its vendor; getting an answer there is generally a painful process and often not very helpful, though.
I am facing the same problem as @TommyRush. I am using Shibboleth which injects headers. I neither have control over the Shibboleth installation nor over the dataset used by the shibboleth installation. So now MOST users can work without any problems, BUT the poor ones with umlauts in their name or other data can't do anything.
Would appreciate this being fixed soon.
My Application is running behind IIS, is there any way IIS could parse the headers and "fix" them?
@TommyRush did you ever find a workaround?
@amrmahdi you can check your logs for info level logs of the form Connection id "<ID>" bad request data: "Invalid request target: <Invalid Chars>" and Connection id "<ID>" bad request data: "Malformed request: invalid headers."
Since RFC 5987 required supporting ISO-8859-1, which in RFC (which obsoletes 5987) the requirement was removed, although it was encouraged to support ISO-8859-1 for backward compatibility, why did Kestrel not support ISO-8859-1?
Also given that HttpConnection on corefx allows it, isn't it strange not to be supported by the server ?
We're going to drop the entire header if the header value not a valid UTF8 sequence.
@DaBeSoft I did, but it's outside of IIS, which I don't think helps you. Shibboleth can URL encode the values. This of course means all of our applications have to be aware and decode any values, which also caused a lot of problems.
In shibboleth2.xml we added encoding=URL like this
We're going to drop the entire header if the header value not a valid UTF8 sequence.
Do you mean ASCII? UTF8 has never been a valid header encoding for HTTP1.x. Closest is treating as opaque bytes, but don't think that helps with them as string
Do you mean ASCII? UTF8 has never been a valid header encoding for HTTP1.x. Closest is treating as opaque bytes, but don't think that helps with them as string
We know, but people keep insisting on sending us UTF-8 headers. Some servers allow it.
Do you mean ASCII? UTF8 has never been a valid header encoding for HTTP1.x. Closest is treating as opaque bytes, but don't think that helps with them as string
We know, but people keep insisting on sending us UTF-8 headers. Some servers allow it.
Do they; or are they sending Latin1/IEC 8859-1 as called out by the spec
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
For example \xD6\xD0\xCE\xC4 from https://github.com/aspnet/KestrelHttpServer/issues/2647 in UTF8 this is an invalid code sequence resulting in ���� whereas in Latin1/ISO-8859-1 this is ÖÐÎÄ
Not sure ÖÐÎÄ makes much more sense, but ���� is a one way transform to jibberish and would fail the UTF8 test
Looking at the various encodings:
// UTF8
Encoding.UTF8.GetString(new byte[] {0xD6,0xD0,0xCE,0xC4})
// Output: ����
```csharp
// UTF16LE
Encoding.Unicode.GetString(new byte[] {0xD6,0xD0,0xCE,0xC4})
// Output: 탖쓎
```csharp
// Widen to UTF16 (i.e. Latin1)
new string(new char[] {(char)0xD6,(char)0xD0,(char)0xCE,(char)0xC4})
// Output: ÖÐÎÄ
@davidfowl you say
We're going to drop the entire header if the header value not a valid UTF8 sequence.
But what about the case when some stupid piece of script (usually not changeable because of even more stupid reasons) is inserting incorrect values in the cookies?
sometimes people don't care about the incorrect headers, but we still are going to have to need that cookie
There's no way to represent that data so the choices are:
for most headers I would choose: drop the header
for some selective headers (like cookie) I would choose: garble the value, and maybe have some way of trying to get the bits I need parsed out manually (at your own risk)
There's no way to represent that data so the choices are ...
Suggesting this should be applied if header is outside the non-printable ascii range, rather than valid utf8; as any encoding outside non-ascii will likely garble the value; as it may not be correct one.
Or... allow an option to specify fallback Encoding to be used when ascii fast-path fails (while still rejecting control codes)
Update on the investigation and recommendations. I've tested the following with the Referer and Cookies headers:
Scenario | Extended Ascii (ISO-8859-1) | UTF-8 | Mixed (both extended ascii and utf-8)
-- | -- | -- | --
Kestrel (currently) | ❌ Request rejected | ❌Request rejected | ❌Request rejected
IIS + Managed Handler (ASP.NET 4) | ✔️ Parsed correctly | ✔️Parsed correctly | ⚠️ Parsed as Extended ASCII
IIS + ANCM In-Proc | ⚠️ Parsed as UTF-8 | ✔️ Parsed correctly | ⚠️ Parsed as UTF-8
IIS + ANCM Out-of-proc | ✔️ Raw bytes passed on to server | ❌⚠️ May reject request (see below) | ✔️ Raw bytes passed on to the server
Apache | ✔️ Raw bytes passed on to server | ✔️ Raw bytes passed on to server | ✔️ Raw bytes passed on to server
Apache + PHP | ⚠️ Parsed as UTF-8 | ✔️ Parsed correctly | ⚠️ Parsed as UTF-8
nginx | ✔️ Raw bytes passed on to server | ✔️ Raw bytes passed on to server | ✔️ Raw bytes passed on to server
node.js | ✔️ Parsed correctly | ⚠️ Parsed as Extended ASCII | ⚠️ Parsed as Extended ASCII
A few notes on the behaviours:
mb-strings. I wasn't able to get it to work in a reasonable amount of time so I didn't verify it as we were more interested in the out-of-box default behaviours hereBased on the comparisons, here are the recommendations and proposals:
These 3 points are independent and I think it's reasonable to start with 1 and 2. We can look into 3 if there is enough demand.
Other alternatives that have been proposed:
But I don't think any of these alternatives are as desirable.
@JunTaoLuo Can you look at response headers as well?
I was hoping we could address response headers separately. Although related, apps have more control over the response headers whereas they cannot control what request headers are sent by clients.
Scenario | Name: François (Encode-able in UTF-8 and Extended ASCII) | Message: 你好 (Encode-able in UTF-8 only)
-- | -- | --
Kestrel (currently) | ❌Failed request | ❌Failed request
IIS + Managed Handler (ASP.NET 4) | ✔️ Encoded as UTF-8 | ✔️Encoded as UTF-8
IIS + ANCM In-Proc | ✔️ Encoded as UTF-8 | ✔️Encoded as UTF-8
IIS + ANCM Out-of-proc | ⚠️ Extended ASCII encoding becomes re-encoded as UTF-8, UTF-8 encoding becomes re-encoded as A UTF-8 representation of the Extended ASCII of the original UTF-8 Encoding (double encoded) | ⚠️UTF-8 encoding becomes re-encoded as A UTF-8 representation of the Extended ASCII of the original UTF-8 Encoding (double encoded)
Apache | ✔️ Raw bytes sent to client | ✔️ Raw bytes sent to client
Apache + PHP | ✔️ Encoded as UTF-8 | ✔️Encoded as UTF-8
nginx | ✔️ Raw bytes sent to client | ✔️ Raw bytes sent to client
node.js | ✔️ Encoded as Extended ASCII | ❌Failed request
We ran into this issue as well.
Is there any temporary fix to ignore invalid request headers? (our issue is in the Referrer header)
Parsing of request headers with UTF-8 encoded values has been merged and will be available in 2.2.0-preview2.
I've continued to do some additional investigation in how servers handle header names, path and query strings with non-ascii characters and this is what I found:
Header name | Extended ASCII | UTF-8
-- | -- | --
Kestrel (currently) | ❌400 Bad Request | ❌400 Bad request
IIS + Managed Handler (ASP.NET 4) | ❌400 Bad Request | ❌400 Bad Request
IIS + ANCM In-Proc | ❌400 Bad Request | ❌400 Bad Request
IIS + ANCM Out-of-proc | ❌400 Bad Request | ❌400 Bad Request
Apache | ❌400 Bad Request | ❌400 Bad Request
Apache + PHP | ❌400 Bad Request | ❌400 Bad Request
nginx | ⚠️Request forwarded with non-ascii header removed | ⚠️Request forwarded with non-ascii header removed
node.js | ❌400 Bad Request | ❌400 Bad Request
Path | Extended ASCII | UTF-8
-- | -- | --
Kestrel (currently) | ❌Failed Request | ❌Failed request
IIS + Managed Handler (ASP.NET 4) | ✔️ Decoded as Extended ASCII | ⚠️Decoded as Extended ASCII
IIS + ANCM In-Proc | ❌ Request rejected but an empty 200 response is sent | ✔️ Decoded as UTF-8
IIS + ANCM Out-of-proc | ✔️ Decoded as Extended ASCII | ⚠️Decoded as Extended ASCII
Apache | ✔️ Raw bytes sent to client | ✔️ Raw bytes sent to client
Apache + PHP | ❌404 Not Found | ✔️Decoded as UTF-8
nginx | ✔️ Raw bytes sent to client | ✔️ Raw bytes sent to client
node.js | ✔️ Decoded as Extended ASCII | ⚠️Decoded as Extended ASCII
Query String| Extended ASCII | UTF-8
-- | -- | --
Kestrel (currently) | ❌Failed Request | ❌Failed request
IIS + Managed Handler (ASP.NET 4) | ⚠️ Decoded as UTF-8 | ✔️Decoded as UTF-8
IIS + ANCM In-Proc | ✔️ Decoded as Extended ASCII | ⚠️Decoded as Extended ASCII
IIS + ANCM Out-of-proc | ❌400 Bad Request | ❌400 Bad Request
Apache | ✔️ Raw bytes sent to client | ✔️ Raw bytes sent to client
Apache + PHP | ⚠️Decoded as UTF-8 | ✔️Decoded as UTF-8
nginx | ✔️ Raw bytes sent to client | ✔️ Raw bytes sent to client
node.js | ✔️ Decoded as Extended ASCII | ⚠️Decoded as Extended ASCII
Header names with non-ascii characters are almost universally rejected, other than nginx. We should follow the same pattern and continue to reject these requests. There's less consensus in how to treat requests with non-ascii characters in path and query string but these should be URL encoded. I think we can continue to reject these requests too, unless we have a compelling reason to do otherwise.
I think it's highly critical to address this issue., at least in term of "removing" non-ascii headers.
Currently requests fail in case any header has non-ascii value, which on many cases can occur due to an external factor, such as referrer pointing to a UTF-8 URL.
@effyteva I have already made the changes to accept UTF-8 encoded characters in header values, which means UTF-8 encoded urls in the Referer header will now be accepted. However, we are still planning to reject requests containing non-ascii characters in the header names as well as path and query string values.
@JunTaoLuo sorry, I didn't understand you already committed those changes for the 2.2 release.
Thanks, and keep up the great work!
We decided to not pursue accepting non-ASCII characters in header names, path and query string at this time. If there is compelling reason to enable these scenarios, please file another issue and we will re-prioritize. A follow up issue to address UTF-8 values in response header values has been filed at https://github.com/aspnet/KestrelHttpServer/issues/2884.
Thanks for this fix. Is there an ETA for 2.2.0 release? @JunTaoLuo
See the roadmap.
We can encode. It is okay but what if we are migrating an old application to aspnetcore that already have too many customers who has the cookie contains non-ascii characters?
For now, there is a workaround if you are using nginx at the front of your aspnetcore application.
To remove all of the non ascii characters from the request header:
server {
set_by_lua_block $cookie_ascii {
local cookie = ngx.var.http_cookie
if cookie == nil or cookie == '' then return cookie end
local cookie_ascii, n, err = ngx.re.gsub(cookie, "[^\\x00-\\x7F]", "")
return cookie_ascii
}
listen 80;
server_name example.com;
location / {
proxy_pass http://localhost:5000;
...
proxy_set_header Cookie $cookie_ascii;
...
}
}
We still suffer from the issue, after upgrading to ASP.NET Core 2.2.
Is there any setting required to enable this?
Comments on closed issues are not tracked, please open a new issue with the details for your scenario.
Most helpful comment
Parsing of request headers with UTF-8 encoded values has been merged and will be available in 2.2.0-preview2.