Describe the bug
It is not possible to write blob metadata containing non-ASCII characters.
Also affects Microsoft Azure Storage SDK for .NET (11.0.0).
https://github.com/Azure/azure-storage-net/issues/975
Expected behavior
You should be able to save blob metadata containing non-ASCII characters.
Actual behavior (include Exception or Stack Trace)
System.AggregateException: 'Retry failed after 6 tries.4.0,(.NET Core 3.1.1; Microsoft Windows 10.0.18363)'
RequestFailedException: Request headers must contain only ASCII characters.
This exception was originally thrown at this call stack:
System.Net.Http.HttpConnection.WriteStringAsync(string)
System.Net.Http.HttpConnection.WriteHeadersAsync(System.Net.Http.Headers.HttpHeaders, string)
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(System.Threading.Tasks.Task)
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
System.Runtime.CompilerServices.ConfiguredTaskAwaitable.ConfiguredTaskAwaiter.GetResult()
System.Net.Http.HttpConnection.SendAsyncCore(System.Net.Http.HttpRequestMessage, System.Threading.CancellationToken)
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(System.Threading.Tasks.Task)
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
...
[Call Stack Truncated]
To Reproduce
Write metadata to a blob containing a non-ASCII character (such as "ñ").
using System;
using System.Collections.Generic;
using Azure.Storage.Blobs;
namespace BlobExperiment
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello World!");
const string connectionString =
"DefaultEndpointsProtocol=https;AccountName=<account-name>;AccountKey=<account-key>;EndpointSuffix=core.windows.net";
var client = new BlobContainerClient(connectionString, "test");
var blobClient = client.GetBlobClient("test.jpg");
var metadata = new Dictionary<string, string>();
metadata.Add("Test", "ñ"); // The notorious ñ.
blobClient.SetMetadata(metadata);
}
}
}
Environment:
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.
@vanillajonathan The documentation mentions that metadata name/value pairs should adhere to all restrictions any HTTP headers adhere to, so only ASCII characters are allowed. In order to store non-ASCII characters in metadata a transformation like Base64-encode or URL-encode is required. For example when metadata is set from Azure portal then name/value is sent as URL-encoded.
This is quite a limitation. I imagine it being painful when working with the blob indexing using Azure Cognitive Search and the metadata is stored as JSON.
I also just discovered that having a pound character "£" in blob metadata causes this error. It must be expecting basic ascii set, Ouch!
It seems strange if metadata is sent as "request headers" (I understand that as HTTP request headers) instead of being sent in the request body.
Can you guys fix it so that it sends the metadata in the request body instead of as HTTP headers (if that is what it is currently doing)?
Or what is the recommended way to deal with this? To URL encode it? Or to Base64 encode it?
But wouldn't that store the metadata in the blob as encoded?
Then you would have to decode the metadata every time you retrieve it?
So how would Azure Cognitive Search operate on data that is encoded?
I would like to save the metadata as JSON. Since I've got problems with non-ASCII chars in metadata I am now saving the metadata in a SQL database.
You are correct; application metadata is sent to the service via HTTP request headers.
As Azure Blob Storage is fundamentally exposed as an HTTP service, the HTTP body content is used to represent the contents of a blob while HTTP headers are used to expose the properties and metadata of a blob. Using the HTTP headers in this way to carry properties/metadata is by design, and enables the following scenarios:
Blob properties and application metadata are thus designed so they can be accessed alongside blobs via a standard HTTP interface. Using the HTTP request body to carry metadata would not support these important scenarios.
As noted earlier, the use of HTTP headers does imply certain restrictions on the supported character set for metadata, as they must be carried in an HTTP header. To conform with the behavior of most HTTP clients and the recommendation from the RFC, metadata is restricted to US-ASCII octets. (Note that metadata names are further restricted, as documented here). From the RFC:
Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data. _RFC 7230 3.2.4_
If you want to use the application metadata, any special characters will need to be encoded. Both URL-encoding or Base64-encoding would be acceptable; the choice would be up to the application (and in your case, whether Cognitive Search supports decoding metadata in this way, which I suspect they do not).
The option I would recommend, though, would be to store the JSON metadata document alongside the original blob. For example, image.jpg could be accompanied by image.jpg.meta, which would store its metadata in a separate JSON blob.
Cognitive Search has good support for indexing blobs containing JSON. You can create a metadata field, e.g. metadata_original_blob_uri, that points back to the real file, and mark this as “Retrievable” (but not “Searchable”) in the indexer settings. This will give you full search over all the metadata, and your application can then retrieve the original URI to the source blob when you have a search match.
https://docs.microsoft.com/en-us/azure/search/search-howto-index-json-blobs
I hit this issue also in the context of updating blob metadata for Azure Cognitive Search. It was a headache to debug and find a fix, but metadataDictionary.Add("myKey", Uri.EscapeDataString(metadata_value)) followed by blobClient.SetMetadata(metadataDictionary) seems to work for pretty much all Unicode content. The encoding gets removed when finally stored in the metadata and the content looks fine (character accents visible, etc.) when viewing the uploaded metadata from Storage Explorer.
One caveat is that you can still end up uploading metadata values that have trailing spaces (this is not permitted and an warning is displayed in storage explorer when viewing the metadata). Before figuring out how to encode, having trailing space was raising a very cryptic error from SetMetadata:
---> AzureStorage Blob Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature
It would be helpful to improve the docs and the exception error messages above and below to suggest encoding the metadata values. Even better, the library would do the encoding for you. Having an additional method or adding an optional parameter to the existing method with an option to escapeStrings=true would help developers learn the expectations of the API.
--->Azure.RequestFailedException: Request headers must contain only ASCII characters.
---> System.Net.Http.HttpRequestException: Request headers must contain only ASCII characters.
at System.Net.Http.HttpConnection.WriteStringAsync(String s)
A related issue is the following which is regarding the lack of warning that filenames need to be escaped as well. The default behavior of the API should be revisited for all functions generating http headers that might break in this way and developers need to be made well aware through docs, function comments and/or optional method parameters if they need to do the encoding themselves.
BlobClient doesn't properly escape filenames with certain characters such as '#'.
https://github.com/Azure/azure-sdk-for-net/issues/11602
Thank you @ryanerdmann.
Thank you @kganjam, I was able to use Uri.EscapeDataString to escape JSON (that I serialize with JsonSerializer.Serialize) and save that as metadata.
Can blob index tags be used instead of metadata?
Does tags also suffer the same limitations of ASCII as metadata?
I would like to store data about the blob, such as:
| Key | Value |
|-|-|
| Author | Alice |
| Tags | cat, black, pet |
| Location | Netherlands