Runtime: Add a string overload for ComputeHash

Created on 21 May 2020  路  11Comments  路  Source: dotnet/runtime

Background and Motivation

Make it easier, more expressive and more straight forward for a developer to hash a string without first having transform the string into a byte array.

Proposed API

A overload for the ComputeHash method on the HashAlgorithm class to take a string as the input parameter to compute the hash for.

namespace System.Security.Cryptography
{
    public abstract class HashAlgorithm : IDisposable, ICryptoTransform
    {
        public byte[] ComputeHash(byte[] buffer);
+        public byte[] ComputeHash(string buffer);

Usage Examples

using (var hash = new HMACSHA256(key))
{
    hashBytes = hash.ComputeHash("Hello world!");
}
api-suggestion area-System.Security untriaged

Most helpful comment

I like the current designed because it taught me exactly this

Strings cannot be hashed, only bytes can be hashed.

In my early days of programming after having tried a few different languages including PHP where hashing works on strings and returns bytes as hex encoded string, i was surprised at that i cannot hash a string. Nowadays i prefer it, because makes you think about how it works and forces you on a decision which encoding to use.
A standardized hashing algorithm that takes a "non standardized" string input which produces an "non standardized" output because it chooses some default encoding that might be different from other languages that take a string, and are thus not interoperable, is a bad design and violates the single responsibility design pattern.
During the course of replicating the MD5 function of PHP in C# i've learnt about Encodings, Base64, byte[]'s, strings and hex-encoding. The documentation of PHP does not state how its md5 function works or which encoding it uses so that made it hard to replicate the same result in .NET and now i curse these magic functions :wink:

All 11 comments

Tagging subscribers to this area: @bartonjs, @vcsjones, @krwq
Notify danmosemsft if you want to be subscribed.

Strings cannot be hashed, only bytes can be hashed. Therefore this API would be assuming some sort of encoding. At minimum it would need to be

```C#
public byte[] ComputeHash(string buffer, Encoding encoding=null);

where the null encoding means to use some sort of default (UTF-8 because it's 2020? UTF16-LE because it's .NET? Should it tolerate malformed strings (unpaired surrogates) or throw?).  Possibly the right answer is to not default the encoding and just always make someone provide it.

So the call site would be

```C#
byte[] hash = hasher.ComputeHash("Hello world!", Encoding.UTF8);

or, to reject unpaired surrogates:

C# byte[] hash = hasher.ComputeHash("Hello world!", new UTF8Encoding(false, true));

For the sake of a complete proposal I would consider:

  1. An overload taking ReadOnlySpan<char>. The string one can defer to this.
  2. Putting equivalent methods on IncrementalHash.

I think even ReadOnlySpan<T> is fine as long as T is a simple struct (or whatever generic restriction we've put on span casting)

@krwq my concern with T : unmanaged would be endianness. Assuming T is int for example, would it be little endian? Would whatever the native platform is be used?

I like the current designed because it taught me exactly this

Strings cannot be hashed, only bytes can be hashed.

In my early days of programming after having tried a few different languages including PHP where hashing works on strings and returns bytes as hex encoded string, i was surprised at that i cannot hash a string. Nowadays i prefer it, because makes you think about how it works and forces you on a decision which encoding to use.
A standardized hashing algorithm that takes a "non standardized" string input which produces an "non standardized" output because it chooses some default encoding that might be different from other languages that take a string, and are thus not interoperable, is a bad design and violates the single responsibility design pattern.
During the course of replicating the MD5 function of PHP in C# i've learnt about Encodings, Base64, byte[]'s, strings and hex-encoding. The documentation of PHP does not state how its md5 function works or which encoding it uses so that made it hard to replicate the same result in .NET and now i curse these magic functions :wink:

@bartonjs, Oh I did not even consider that strings cannot be hashed, only bytes. Good point. As per the optional encoding parameter it is not all that necessary as long as there is a sane default since you can call the GetBytes method on any Encoding class.

I want to a compute the hash of a string, so then I want to just pass in the string, but instead I have first transform it into bytes, so I have to look up how to do that, and then I have to read up on all these encodings and the history of computing so then I wind up reading articles such as The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and What every programmer absolutely, positively needs to know about encodings and character sets to work with text, it is a long read, and it is not pretty. 7-bit, 8-bit, Unicode, UTF-7, UTF-8 with and without BOM, UTF-16 (little endian and big endian), UTF-32, ugh. All I wanted to do was to hash a string, now I am having an existential crisis. Pondering why I chose a life of code. As a last resort to pull myself up from the darkness I chose to refuse to acknowledge the existence of any other encodings; Thou shalt have no other encodings before UTF-8. Ignorance is a bliss.

I wouldn't know what default encoding it ought to use, but someone more knowledgeable than me might have an idea what a suitable default encoding would be. Perhaps UTF-8. Then people who really needs some compatibility with some old, odd legacy system using some arcane encoding can still transform a string to a byte array, but for most people who either don't know, or don't care, or just want to pass in whatever string and and are happy to leave the choice of encoding to be decided by higher powers can chose to do so.

@vcsjones That's a good idea. I like that.

@Suchiman With an overload you can still create interoperable systems just the way you can do now already. But the overload is just a convenient way compute a hash in scenarios the encoding is irrelevant (such as when the hash is only used locally) or by developers who quickly want something without investing all that time reading about the long and boring history of encodings. Encodings is something I just want to work, and I don't want to spend too much time thinking about it. It is a long, boring story that should be solved by now, and we shouldn't have to think about it (unless when working with legacy systems). I know developers who don't even know about encodings.

@vanillajonathan for local usage why isn't "foo".GetHashCode() sufficient? Also you can create an extension method in your code for computing specific hash in a specific way

For cryptographic primitives like this, I think it's reasonable to require the caller to provide the data in the format the primitive expects. In this case, that'd be a buffer of bytes.

If we wanted to add a higher-level API which encapsulated the several steps required to accomplish some scenario, that seems like it'd be a reasonable request. But IMO those higher level APIs should be their own thing and shouldn't go directly on the primitives. Especially if there ends up being a large number of such higher-level APIs.

@krwq Because GetHashCode is not cryptographically secure, hence it is not suitable for determining data integrity against tampering.

The extension method is a good interim idea. Here is an implementation if anyone happens to be interested:

/// <summary>
/// Extension methods for <see cref="HashAlgorithm"/>.
/// </summary>
public static class HashAlgorithmExtensions
{
    /// <summary>
    /// Computes the hash value for the specified string. Uses UTF-8 encoding.
    /// </summary>
    /// <param name="hashAlgorithm">The <see cref="HashAlgorithm"/> instance.</param>
    /// <param name="str">The input to compute the hash code for.</param>
    /// <returns>The computed hash code.</returns>
    /// <remarks>Uses UTF-8 encoding for the string.</remarks>
    public static byte[] ComputeHash(this HashAlgorithm hashAlgorithm, string str)
    {
        var encoding = new UTF8Encoding();
        var buffer = encoding.GetBytes(str);

        return hashAlgorithm.ComputeHash(buffer);
    }
}

@GrabYourPitchforks, interesting point. It leads my thoughts to the facade pattern.

Given the support against this proposal, and the ease of a workaround, I'm going to go ahead and close this proposal.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GitAntoinee picture GitAntoinee  路  3Comments

noahfalk picture noahfalk  路  3Comments

iCodeWebApps picture iCodeWebApps  路  3Comments

matty-hall picture matty-hall  路  3Comments

omariom picture omariom  路  3Comments