Go: x/text/encoding: ianaindex.Index.Encoding returns nil for valid encoding names

Created on 6 Mar 2017  Â·  12Comments  Â·  Source: golang/go

ianaindex.MIME.Encoding("us-ascii") for example returns (nil, nil).

This is quite surprising for me because the documentation does not say that it may return nil for valid names.
Would you please consider updating the documentation or changing it to return encoding.Nop?

Thank you!

What did you do?

package main

import (
    "fmt"

    "golang.org/x/text/encoding/ianaindex"
)

func main() {
    enc, err := ianaindex.MIME.Encoding("us-ascii")
    fmt.Printf("%v, %v\n", enc, err)
}

What did you expect to see?

non-nil Encoding object and nil error.

What did you see instead?

nil, nil

Most helpful comment

Opened a pull request to fix this issue: https://github.com/golang/text/pull/10

All 12 comments

cc @mpvl

I just ran into this issue as well.

Me too.
Currently Index.Encoding returns (nil, nil) for 201 out of 256 encodings, corresponding to 683 out of 1513 aliases. I assume these 201 encodings are not implemented.

The 55 implemented encodings are more than enough for me. The problem is that no function is supposed to fail with a nil error. Even if it were documented, it's not idiomatic.

(It *is* a failure, because Index.Encoding is supposed to return the requested encoding, not just to validate the alias.)

I think returning encoding.Nop is even worse than the current behavior, because that's not the encoding requested by the caller and it's harder to detect.

The best solution I can think of, is to return errUnsupported and improve the docs.
The downside is that there is no way to distinguish an unimplemented encoding (errUnsupported) from a bad alias (errInvalidName), because errors are not exported, which is inconvenient.

--- a/encoding/ianaindex/ianaindex.go
+++ b/encoding/ianaindex/ianaindex.go
@@ -79,7 +79,11 @@ func (x *Index) Encoding(name string) (encoding.Encoding, error) {
            return nil, errInvalidName
        }
    }
-   return x.enc[i], nil
+   enc := x.enc[i]
+   if enc == nil {
+       return nil, errUnsupported
+   }
+   return enc, nil
 }

 // Name reports the canonical name of the given Encoding. It will return an

While we are waiting for the right people to look into this bug, I propose to add a Bugs section to the docs (like the one in package bytes and others), briefly explaining the problem.

This bug is already two years old and users are going to get unexpected panics (like I did), just because the information is not at hand.

Opened a pull request to fix this issue: https://github.com/golang/text/pull/10

@emersion, thanks.
On second thought I prefer your solution of just documenting the (nil, nil) case (partly because my code already depends on it :D). Anyway I wouldn't want to lose the ability to distinguish between alias not found and encoding not supported.

But to treat ASCII as encoding.Nop seems incorrect to me; see the docs about Encoder and Decoder:
An Encoder (UTF-8 to X) is supposed to translate invalid UTF-8 to the replacement character and to return an error for any rune that cannot be represented by X.
A Decoder (X to UTF-8) is supposed to translate source bytes that are not valid X encoding to the replacement character.
As a result, reading from a Decoder should always produce valid UTF-8, and writing to an Encoder should always produce valid X encoding.
This is not the case for encoding.Nop, which is just a byte copy.

I would prefer to either do nothing, or implement a "correct" ASCII Encoding.
If ianaindex.MIME.Encoding must return an encoding that doesn't satisfy the above conditions, it should at least be documented.
Often a different behavior is preferred but we should probably just let the user special-case it. Someone may want encoding.Nop while others may want unicode.UTF8, charmap.Windows1252 or even something else.

That's true. I updated the PR to make the ASCII encoding:

  • Decode non-ASCII bytes as unicode.ReplacementChar
  • Encode non-ASCII runes as encoding.ASCIISub

@emersion, thanks for the update.

Unlike Decoders, it seems that an Encoder is supposed to return a real error for any rune that cannot be represented by the target encoding.
https://play.golang.org/p/Wj2SLov2GNG

Existing Encoders return internal.RepertoireError, not only for unrepresentable runes, but also for invalid UTF-8 and even a literal \uFFFD itself.
The documentation is not wrong in saying that invalid UTF-8 is treated as \uFFFD, since they both produce the same error, but I still find it a little surprising that \uFFFD is not considered "representable".

I haven't looked into it, but an alternative could be to add ASCII as a charmap.Charmap here (it's a generated file), with bytes >127 mapped to \uFFFD in the decode field, and omitted from the encode field.

Ah, thanks, I missed that. Updated again to return a RepertoireError as expected.

I considered adding ASCII as a charmap but decided against because the script generating tables fetches an ICU table from a remote repository.

Change https://golang.org/cl/212077 mentions this issue: encoding/ianaindex: add ASCII, document Index.Encoding

@mpvl

One question that remains is whether ASCII should be included as an exported encoding. [...] The disadvantage is that it may be used when people are better off with Latin1. But a comment could be to suggest use Latin1 for common cases.

I think it's a good idea to give users access to the ascii encoding.
I understand it would be inappropriate to use it for html, because html producers are required to use utf-8, and consumers are required to resolve labels as htmlindex does. But it's the user's responsibility to know that.

If a user thinks he needs an ascii Encoder he's probably right: if he didn't want ascii-only output he would have chosen something else.
If a user thinks he needs an ascii Decoder, he *may* be better off with a superset instead. It depends on what he wants to do if the input expected to be ascii is not actually ascii, and what such input is most likely to be.
Again, I think it's the user's responsibility to handle mislabeled/malformed input as most appropriate for his use case.

Note that latin1 == iso-8859-1 != windows-1252.
Bytes in the range 0x80-0x9f map differently to unicode, e.g. windows-1252 0x80 is U+20AC (€) while latin1 0x80 is U+0080.
It is usually safe to decode or to label latin1 as windows-1252 but not the other way around (W3C labels are intended for decoding legacy html).
(OT: Let's do whatever we can to speed up the agonizingly slow death of non-utf-8 encodings :smiley:)

One could add a Get method to ianaindex, similar to Encoding, but returning an Unsupported error if non-existing and deprecate Encoding.

SGTM (to be tested with errors.Is)

Are there any encodings compatible with Unicode that are returning nil, nil besides US-ASCII? It'd be helpful application-wise. To avoid displaying garbage data I have to bring up an error if the encoding returned is nil, but now I have to make a special case for US-ASCII, because it's compatible with Unicode so I should still display it.

@pam4

@makeworld-the-better-one, by "compatible with Unicode" I assume you actually mean "a subset of utf-8", otherwise you will have to explain better.
AFAIK, us-ascii is the only subset of utf-8 that has a name, i.e. the only encoding you can handle as utf-8 without loss.

But you may want to treat text labeled us-ascii as windows-1252, because in case of mislabeling there is usually a higher chance for it to be windows-1252 or latin1 than utf-8 (and in case of correct labeling it makes no difference). It depends on your use case.

Hmm, thanks. But it seems that in general I should just not be displaying text that has a nil encoding, with US-ASCII being the exception.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rsc picture rsc  Â·  3Comments

rakyll picture rakyll  Â·  3Comments

OneOfOne picture OneOfOne  Â·  3Comments

mingrammer picture mingrammer  Â·  3Comments

longzhizhi picture longzhizhi  Â·  3Comments